## Question

Asked By – erik

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the `cvpartition`

or `crossvalind`

functions in Matlab.

**Now we will see solution for issue: How to split/partition a dataset into training and test datasets for, e.g., cross validation? **

## Answer

If you want to split the data set once in two parts, you can use `numpy.random.shuffle`

, or `numpy.random.permutation`

if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

```
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
```

or

```
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
```

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the `sklearn`

library (k-fold, leave-n-out, …). `sklearn`

also includes more advanced “stratified sampling” methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

This question is answered By – pberkes

**This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 **