In this guide, we’ll explore the basics of CSV files, how to read them with Pandas, and some useful tips and tricks to make your data analysis smoother and more efficient. CSV (Comma-Separated Values) files are a common format for storing and exchanging data, especially in the data science and machine learning fields. Pandas, a Python library for data manipulation and analysis, makes it easy to read, manipulate and analyze CSV files.
What is a CSV file?
A CSV file is a text document that contains data organized in a tabular format, with each row denoting a record and each column a field of that record. CSV files are a well-liked format for storing and exchanging data between various apps and systems because they are simple to read and write. Simple text editors and spreadsheet programs like Microsoft Excel, Google Sheets, and LibreOffice Calc all support opening and editing CSV files.
This tutorial explains several ways to read CSV files into Python using the following CSV file named ‘example.csv’:
Name,Income,Gender
Alice,20000,female
Bob,80000,male
Charlie,70000,male
David,150000,male
Sarah,40000,female
Example 1: Read CSV File into pandas DataFrame
To read a CSV file into a pandas DataFrame, you can use the read_csv()
function from the pandas library. Here’s an example of how to use it:
import pandas as pd
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('example.csv')
# Display the first few rows of the DataFrame
print(df.head())
Output:
# Name Income Gender
#0 Alice 20000 female
#1 Bob 80000 male
#2 Charlie 70000 male
#3 David 150000 male
#4 Sarah 40000 female
Example 2: Read Specific Columns from CSV File
To read specific columns from a CSV file into a pandas DataFrame, you can use the usecols
parameter of the read_csv()
function.
Here’s an example:
import pandas as pd
# Read only specific columns from the CSV file
df = pd.read_csv('example.csv', usecols=['Name', 'Gender'])
# Display the first few rows of the DataFrame
print(df.head())
Output:
# Name Gender
#0 Alice female
#1 Bob male
#2 Charlie male
#3 David male
#4 Sarah female
As second option you can also use indices:
import pandas as pd
# Read only specific columns from the CSV file
df = pd.read_csv('example.csv', usecols=[0,2])
# Display the first few rows of the DataFrame
print(df.head())
Example 3: Specify Header Row when Importing CSV File
To specify the header row when importing a CSV file using Pandas, you can use the header
parameter of the read_csv()
function. This parameter allows you to specify which row of the CSV file should be used as the header row. By default, Pandas assumes that the first row of the CSV file contains the column names.
Here’s an example:
import pandas as pd
# read CSV file with header row at index 0
df = pd.read_csv('example.csv', header=0)
# Display the first few rows of the DataFrame
print(df.head())
Output:
# Name Income Gender
#0 Alice 20000 female
#1 Bob 80000 male
#2 Charlie 70000 male
#3 David 150000 male
#4 Sarah 40000 female
In this example, the header
parameter is set to 0
, which tells Pandas to use the first row of the CSV file as the header row. If the header row is located at a different row index, you can simply specify the corresponding index number in the header
parameter.
If your CSV file doesn’t have a header row, you can set the header
parameter to None
and then use the names
parameter to specify the column names:
import pandas as pd
# read CSV file without header row
df = pd.read_csv('example.csv', header=None, names=['Name', 'Income' , 'Gender'])
# Display the first few rows of the DataFrame
print(df.head())
In this example, the header
parameter is set to None
to indicate that the CSV file doesn’t have a header row. The names
parameter is then used to specify the column names as a list of strings.
Example 4: Skip Rows when Importing CSV File
To skip rows when importing a CSV file using Pandas, you can use the skiprows
parameter of the read_csv()
function. This parameter allows you to specify which rows of the CSV file should be skipped during the import process.
Here’s an example:
import pandas as pd
# read CSV and skip second row
df = pd.read_csv('example.csv', skiprows=[1])
# Display the first few rows of the DataFrame
print(df.head())
Output:
# Name Income Gender
#0 Bob 80000 male
#1 Charlie 70000 male
#2 David 150000 male
#3 Sarah 40000 female
And the following code shows how to skip the second and third row when importing the CSV file:
import pandas as pd
# read CSV and skip second row
df = pd.read_csv('example.csv', skiprows=[1,2])
# Display the first few rows of the DataFrame
print(df.head())
Output:
# Name Income Gender
#0 Charlie 70000 male
#1 David 150000 male
#2 Sarah 40000 female
Skip rows with parameter
You can also use the skiprows
parameter to skip rows based on a condition. For example, if your CSV file has a header row followed by several rows of comments, you can skip the comment rows by checking if the first character of each row is a #
symbol:
import pandas as pd
# define a function to check if a row is a comment
def is_comment(row):
return row.startswith('#')
# read the CSV file and skip comment rows
df = pd.read_csv('example.csv', skiprows=lambda x: is_comment(x))
# display the DataFrame
print(df.head())
In this example, we define a function called is_comment()
that checks if a row starts with a #
symbol. We then use a lambda function to pass each row of the CSV file to the is_comment()
function and skip the rows that return True
. This allows us to skip the comment rows and import only the data rows.
Example 5: Read CSV Files with Custom Delimiter
Sometimes you may have a CSV file with a delimiter that is different from a comma.
To read CSV files with a custom delimiter using Pandas, you can use the delimiter
or sep
parameter of the read_csv()
function. By default, Pandas assumes that CSV files are comma-separated, but you can specify a different delimiter using the delimiter
or sep
parameter.
Here’s an example of how to read a CSV file with a tab delimiter using the delimiter
parameter:
import pandas as pd
# read CSV file with tab delimiter
df = pd.read_csv('my_data.csv', delimiter='\t')
# display the DataFrame
print(df.head())
In this example, the delimiter
parameter is set to \t
, which tells Pandas to use a tab character as the delimiter.
Alternatively, you can use the sep
parameter to specify the delimiter:
import pandas as pd
# read CSV file with pipe delimiter
df = pd.read_csv('my_data.csv', sep='|')
# display the DataFrame
print(df.head())
In this example, the sep
parameter is set to |
, which tells Pandas to use a pipe character as the delimiter.
You can also specify a regular expression pattern as the delimiter using the sep
parameter. For example, if your CSV file uses a delimiter that consists of multiple spaces, you can specify a regular expression pattern to match the delimiter:
import pandas as pd
# read CSV file with multiple-space delimiter
df = pd.read_csv('my_data.csv', sep='\s+')
# display the DataFrame
print(df.head())
In this example, the sep
parameter is set to \s+
, which is a regular expression pattern that matches one or more whitespace characters. This allows Pandas to correctly parse the CSV file even if the delimiter consists of multiple spaces.
Wrap up
To learn more about the Pandas shift method, check out the official documentation here
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
Thanks for reading. Happy coding!