Pandas is one of the most prominent libraries utilized by data scientists and analysts in the realm of data analysis and manipulation. It offers an extensive array of data cleansing, preparation, and transformation capabilities. One such function that will be discussed in this article is shuffle in Pandas (). This function is beneficial when you want to randomize the order of your data, as it is used to shuffle the rows of a DataFrame. In this article, you will learn how to shuffle a Pandas Dataframe rows with Python.
What is Pandas Shuffle?
Pandas shuffle()
is a function used to arbitrarily reorder the rows of a DataFrame. It is used when we wish to randomize the order of our data, which is particularly essential when working with large datasets. We can circumvent any bias that may arise from the data’s order by shuffling the data.
Loading a Sample Pandas Dataframe
The Python code in the following code section generates a sample Pandas Dataframe. If you wish to follow this tutorial line by line, feel free to copy the code below in sequential sequence. You can also use your own dataframe, though the results will differ from those in the tutorial.
import pandas as pd
# create a dictionary containing data for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# print the DataFrame
print(df)
Output:
# Name Age City
#0 Alice 25 New York
#1 Bob 30 Paris
#2 Charlie 35 London
#3 David 40 Berlin
#4 Eva 45 Sydney
Shuffle a Pandas Dataframe with sample
You can shuffle a Pandas DataFrame using the sample()
method with the frac
parameter set to 1, which will randomly sample all rows in the DataFrame. Here is an example:
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame
shuffled_df = df.sample(frac=1)
# print the shuffled DataFrame
print(shuffled_df)
Output:
# Name Age City
#3 David 40 Berlin
#1 Bob 30 Paris
#0 Alice 25 New York
#2 Charlie 35 London
#4 Eva 45 Sydney
In this example, we first create a sample DataFrame. We then use the sample()
method to shuffle the rows of the DataFrame, with the frac
parameter set to 1 to sample all rows. Finally, we print the shuffled DataFrame using print()
. The output will be the shuffled DataFrame with the rows in a random order.
Our index can be reset using the Pandas.reset index() method, which sorts our index from 0 onwards. Let’s see how this appears:
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame
shuffled_df = df.sample(frac=1)
# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)
# print the shuffled and reset DataFrame
print(shuffled_df)
Output:
# Name Age City
#0 Alice 25 New York
#1 David 40 Berlin
#2 Bob 30 Paris
#3 Charlie 35 London
#4 Eva 45 Sydney
In this example, we first create a sample DataFrame. We then use the sample()
method to shuffle the rows of the DataFrame, with the frac
parameter set to 1 to sample all rows. Next, we use the reset_index()
method to reset the index of the shuffled DataFrame, with the drop=True
parameter to drop the old index. Finally, we print the shuffled and reset DataFrame using print()
. The output will be the shuffled DataFrame with a new index sorted from 0 onwards.
Get Row Numbers that Match Multiple Condition in a Pandas Dataframe
If you want to reproduce a shuffled Pandas DataFrame with a specific random seed, you can set the random_state
parameter of the sample()
method.
Here’s an example:
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame with a random seed of 42
shuffled_df = df.sample(frac=1, random_state=42)
# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)
# print the shuffled and reset DataFrame
print(shuffled_df)
Output:
# Name Age City
#0 Bob 30 Paris
#1 Eva 45 Sydney
#2 Charlie 35 London
#3 Alice 25 New York
#4 David 40 Berli
In this example, we set the random_state
parameter of the sample()
method to 42. This ensures that the shuffled DataFrame will always be the same as long as the random seed is the same. If you want to reproduce the same shuffled DataFrame in the future, simply use the same random seed value. The output will be the shuffled DataFrame with a new index sorted from 0 onwards, which is the same as the previous shuffled DataFrame generated with a random seed of 42.
Shuffle a Pandas Dataframe with Sci-Kit Learn’s shuffle
Another helpful way to randomize a Pandas Dataframe is to use the machine learning library, sklearn
. Sci-Kit Learn is a popular machine learning library, so using its shuffle()
function to shuffle a Pandas DataFrame makes it easy to integrate the shuffled data into a machine learning pipeline.
Here’s an example:
import pandas as pd
from sklearn.utils import shuffle
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame with Sci-Kit Learn's shuffle function and a random seed of 1
shuffled_df = shuffle(df, random_state=1)
# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)
# print the shuffled and reset DataFrame
print(shuffled_df)
Output:
# Name Age City
#0 Alice 25 New York
#1 Charlie 35 London
#2 Eva 45 Sydney
#3 Bob 30 Paris
#4 David 40 Berlin
If we want reproduce our results we can use the random_state
parameter.
Here’s an example:
import pandas as pd
from sklearn.utils import shuffle
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame with Sci-Kit Learn's shuffle function and a random seed of 1
shuffled_df = shuffle(df, random_state=1)
# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)
# print the shuffled and reset DataFrame
print(shuffled_df)
In this example, we pass the value 1 to the random_state
parameter of the shuffle()
function. This ensures that the shuffled DataFrame will always be the same as long as the random seed is 1. The output will be the shuffled DataFrame with a new index sorted from 0 onwards, which is the same as the previous shuffled DataFrame generated with a random seed of 1.
Shuffle a Pandas Dataframe with Numpy’s random.permutation
Another way to shuffle a Pandas DataFrame is to use NumPy’s random.permutation()
function. This function generates a random permutation of a sequence, which can be used to shuffle the rows of a DataFrame.
Here’s an example:
import pandas as pd
import numpy as np
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 45],
'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)
# shuffle the DataFrame with NumPy's random permutation function
shuffled_df = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)
# print the shuffled DataFrame
print(shuffled_df)
Output:
# Name Age City
#0 Charlie 35 London
#1 Bob 30 Paris
#2 Eva 45 Sydney
#3 David 40 Berlin
#4 Alice 25 New York
In this example, we use NumPy’s random.permutation()
function to generate a random permutation of the row indices of the DataFrame. We then use the iloc
method to select the rows of the DataFrame in the shuffled order. Finally, we reset the index to start from 0 using the reset_index()
method with the drop=True
parameter.
This method can be useful if you are already using NumPy in your project and don’t want to import another library like Sci-Kit Learn just for shuffling a DataFrame. However, it’s worth noting that random.permutation()
function generates a new permutation every time it’s called without a random seed, so if you need to reproduce the same shuffled DataFrame in the future, you’ll need to set a random seed using NumPy’s random.seed()
function.
The Fastest Way to Shuffle a Pandas Dataframe
You may be unsure of which method to choose at this juncture. I would suggest determining which method suits your workflow the best. For instance, if you are constructing a data science pipeline with sklearn, you may wish to add the shuffle utility into your pipeline.
Check this comparison below:
Method | Description | Pros | Cons | Speed |
---|---|---|---|---|
sample() with inplace=True | Shuffle the DataFrame in place with the sample() method | Fastest method, saves memory by not creating a new shuffled DataFrame | Changes the original DataFrame, may not be suitable for some use cases | Fast |
Sci-Kit Learn’s shuffle() | Shuffle the DataFrame using Sci-Kit Learn’s shuffle() function | Easy to use, works with NumPy arrays as well as DataFrames | Slower than Pandas sample() method, requires importing an additional library | Medium |
NumPy’s random.permutation() | Shuffle the DataFrame using NumPy’s random.permutation() function | Fast, works well if NumPy is already being used in the project | Generates a new permutation every time it’s called, may not be reproducible without setting a random seed | Medium |
Pandas sample() method without inplace=True | Shuffle the DataFrame using the sample() method and creating a new shuffled DataFrame | Easy to use, doesn’t change the original DataFrame | Creates a new DataFrame, which can use more memory for large DataFrames | Slowest |
Wrap up
You learned how to shuffle a Pandas Dataframe using the Pandas sample method in this tutorial. The method permits us to randomly sample rows. To shuffle our dataframe, we merely take a random sample of the entire dataframe. Using the random state= parameter
, we can even reproduce our shuffle dataframe.
You also learned how to use the sklearn and numpy libraries to shuffle your dataframe, giving you even more control over how your results are generated. Using sklearn, for instance, enables you to readily incorporate this step into machine learning pipelines.
Check out the official documentation located here to learn more about the methods outlined in this tutorial:
Thanks for reading. Happy coding!