Pandas is a powerful open-source data analysis and data manipulation library for Python. It provides robust data structures like Series and DataFrame that make handling large datasets more efficient and intuitive. In this article, we will explore how to create new columns in Pandas, enabling you to better organize, analyze, and manipulate your data.

Why Creating New Columns in Pandas is Essential

Creating new columns in your Pandas DataFrame is a crucial step in data processing, as it allows you to:

  1. Transform existing data: Apply functions or operations to existing columns to generate new information.
  2. Aggregate data: Combine information from multiple columns to create a summary column.
  3. Categorize data: Assign categories or labels to data based on certain conditions or criteria.
  4. Calculate metrics: Derive new insights or metrics from the data to support decision-making.

Loading a Sample Python Pandas DataFrame

I’ve included a sample Pandas DataFrame below so that you can follow the instruction line-by-line. Simply copy the code and paste it in your preferred code editor. Although your results will undoubtedly differ, feel free to use your own DataFrame if you have one.

				
					import pandas as pd

# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'gender': ['F', 'M', 'M', 'M'],
        'city': ['New York', 'Paris', 'London', 'Tokyo']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# print the DataFrame
print(df)

				
			

Output:

				
					#     name  age gender      city
# 0    Alice   25      F  New York
# 1      Bob   32      M     Paris
# 2  Charlie   18      M    London
# 3    David   47      M     Tokyo

				
			

In this example, we first create a dictionary data with four keys ‘name’, ‘age’, ‘gender’, and ‘city’, and their corresponding values as lists. Then we create a DataFrame df by passing the dictionary to the pd.DataFrame function. Finally, we print the DataFrame df to see the result.

Assign Multiple Values to a Column in Pandas

Here is an example of how to assign multiple values to a column in Pandas:

				
					import pandas as pd

# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'gender': ['F', 'M', 'M', 'M'],
        'city': ['New York', 'Paris', 'London', 'Tokyo']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# assign multiple values to column 'city'
df['city'] = ['NY', 'Paris', 'LDN', 'TKY']

print(df)

				
			

Output:

				
					#      name  age gender  city
# 0    Alice   25      F    NY
# 1      Bob   32      M  Paris
# 2  Charlie   18      M   LDN
# 3    David   47      M   TKY

				
			

In this example, we create a DataFrame df from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then assign multiple values to column ‘city’ using a list of values and the column name ‘city’. Note that the list of values should have the same length as the number of rows in the DataFrame. Finally, we print the DataFrame df to see the result.

You can also assign multiple values to a column based on some condition using the loc method. Here’s an example:

				
					# assign multiple values to column 'city' based on the values in column 'age'
df.loc[df['age'] < 20, 'city'] = ['NA', 'NA']
df.loc[df['age'] >= 20, 'city'] = ['NYC', 'Paris', 'LDN', 'TKY']

print(df)

				
			

Output:

				
					#       name  age gender  city
# 0    Alice   25      F   NYC
# 1      Bob   32      M  Paris
# 2  Charlie   18      M    NA
# 3    David   47      M   TKY

				
			

In this example, we assign multiple values to column ‘city’ based on the values in column ‘age’. If the value in column ‘age’ is less than 20, we assign two values ‘NA’ to column ‘city’, and if it’s greater than or equal to 20, we assign four values ‘NYC’, ‘Paris’, ‘LDN’, and ‘TKY’ to column ‘city’. We use the loc method to select the rows and columns we want to assign values to.

Calculate a New Column in Pandas

Here’s an example of how to calculate a new column in Pandas based on existing columns:

				
					import pandas as pd

# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'gender': ['F', 'M', 'M', 'M'],
        'city': ['New York', 'Paris', 'London', 'Tokyo']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# calculate a new column 'age_in_months'
df['age_in_months'] = df['age'] * 12

print(df)

				
			

Output:

				
					#      name  age gender      city  age_in_months
# 0    Alice   25      F  New York            300
# 1      Bob   32      M     Paris            384
# 2  Charlie   18      M    London            216
# 3    David   47      M     Tokyo            564

				
			

In this example, we first create a DataFrame df from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then calculate a new column ‘age_in_months’ based on the values in column ‘age’ by multiplying it with 12, which represents the number of months in a year. We use the multiplication operator (*) to perform the calculation and assign the result to the new column ‘age_in_months’. Finally, we print the DataFrame df to see the result.

You can also calculate a new column based on multiple existing columns or using more complex mathematical operations.

Here’s an example:

				
					# calculate a new column 'body_mass_index' based on 'weight' and 'height'
df['weight_kg'] = df['weight'] / 2.20462
df['height_m'] = df['height'] / 39.3701
df['body_mass_index'] = df['weight_kg'] / (df['height_m'] ** 2)

print(df)

				
			

Output:

				
					#    Index  Height(Inches)  Weight(Pounds)  weight_kg  height_m  body_mass_index
# 0       1           65.78          112.99  51.251848  1.669556        18.391795
# 1       2           71.52          136.49  61.909600  1.815727        18.751180
# 2       3           69.40          153.03  69.398766  1.760021        22.307196
# 3       4           68.22          142.34  64.639814  1.732943        21.469178
# 4       5           67.79          144.30  65.454115  1.721801        22.078433
# ..    ...             ...             ...        ...       ...              ...
# 195   196           65.80          120.84  54.812475  1.670832        19.702101
# 196   197           66.11          115.78  52.497796  1.677018        18.731678
# 197   198           68.24          128.30  58.168483  1.734482        19.320306
# 198   199           68.02          127.47  57.809421

				
			

Add or Subtract Columns in Pandas

Similar to calculating a new column in Pandas, you can add or subtract (or multiple and divide) columns in Pandas.

Here’s an example of how to add or subtract columns in Pandas:

				
					import pandas as pd

# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'gender': ['F', 'M', 'M', 'M'],
        'city': ['New York', 'Paris', 'London', 'Tokyo']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# add a new column 'age_next_year' by adding a constant value to column 'age'
df['age_next_year'] = df['age'] + 1

# subtract two columns 'age' and 'age_next_year'
df['age_difference'] = df['age_next_year'] - df['age']

print(df)

				
			

Output:

				
					#      name  age gender      city  age_next_year  age_difference
# 0    Alice   25      F  New York             26               1
# 1      Bob   32      M     Paris             33               1
# 2  Charlie   18      M    London             19               1
# 3    David   47      M     Tokyo             48               1

				
			

In this example, we create a DataFrame df from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then add a new column ‘age_next_year’ by adding a constant value of 1 to column ‘age’. We use the addition operator (+) to perform the calculation and assign the result to the new column ‘age_next_year’. Finally, we subtract two columns ‘age’ and ‘age_next_year’ to get the age difference between the current and next year, and assign the result to a new column ‘age_difference’.

You can also add or subtract columns based on some condition or using more complex mathematical operations. Here’s an example:

				
					# add a new column 'BMI_difference' by subtracting two columns 'BMI' and 'reference_BMI'
df['reference_BMI'] = 22
df['BMI_difference'] = df.apply(lambda row: row['body_mass_index'] - row['reference_BMI'] if row['body_mass_index'] > row['reference_BMI'] else 0, axis=1)

print(df)

				
			

Output:

				
					 #   Index  Height(Inches)  Weight(Pounds)  weight_kg  height_m  body_mass_index  reference_BMI  BMI_difference
# 0       1           65.78          112.99  51.251848  1.669556        18.391795             22             0.0
# 1       2           71.52          136.49  61.909600  1.815727        18.751180             22             0.0
# 2       3           69.40          153.03  69.398766  1.760021        22.307196             22             0.307196
# 3       4           68.22          142.34  64.639814  1.732943        21.469178             22             0.0
# 4       5           67.79          144.30  65.454115  1.721801        22.078433             22             0.078433
# ..    ...             ...             ...        ...       ...              ...            ...             ...
# 195   196

				
			

Combine String Columns in Pandas

Here is an example of how to combine string columns in Pandas:

				
					import pandas as pd

# create a dictionary of lists
data = {'first_name': ['Alice', 'Bob', 'Charlie', 'David'],
        'last_name': ['Smith', 'Johnson', 'Brown', 'Lee']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# combine the two columns 'first_name' and 'last_name' into a new column 'full_name'
df['full_name'] = df['first_name'] + ' ' + df['last_name']

print(df)

				
			

Output:

				
					#   first_name last_name      full_name
# 0      Alice     Smith    Alice Smith
# 1        Bob   Johnson  Bob Johnson
# 2    Charlie     Brown  Charlie Brown
# 3      David       Lee      David Lee

				
			

In this example, we create a DataFrame df from a dictionary of lists with two columns ‘first_name’ and ‘last_name’. We then combine these two columns into a new column ‘full_name’ by concatenating them with a space character in between using the addition operator (+). Finally, we print the DataFrame df to see the result.

You can also combine string columns based on some condition or using more complex string manipulation functions. Here’s an example:

				
					# combine the two columns 'first_name' and 'last_name' into a new column 'full_name' with title case
df['full_name'] = df[['first_name', 'last_name']].apply(lambda row: row['first_name'].title() + ' ' + row['last_name'].title(), axis=1)

print(df)

				
			

Output:

				
					#   first_name last_name      full_name
# 0      Alice     Smith    Alice Smith
# 1        Bob   Johnson  Bob Johnson
# 2    Charlie     Brown  Charlie Brown
# 3      David       Lee      David Lee

				
			

In this example, we combine the two columns ‘first_name’ and ‘last_name’ into a new column ‘full_name’ with title case using the apply method and a lambda function. We first select the two columns using double square brackets [['first_name', 'last_name']], then apply a lambda function that converts each name to title case and concatenates them with a space character. We use the axis=1 parameter to apply the lambda function to each row of the DataFrame.

Split String Columns in Pandas

String columns can also be divided, which is similar to how they can be joined. Using the str.split() method, we could immediately assign the result to two columns and divide the Name column into two.

Here’s an example of how to split string columns in Pandas:

				
					import pandas as pd

# create a dictionary of lists
data = {'full_name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee']}

# create a DataFrame from the dictionary
df = pd.DataFrame(data)

# split the column 'full_name' into two columns 'first_name' and 'last_name'
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

print(df)

				
			

Output:

				
					#      full_name first_name last_name
# 0    Alice Smith      Alice     Smith
# 1   Bob Johnson        Bob   Johnson
# 2  Charlie Brown   Charlie     Brown
# 3      David Lee      David       Lee

				
			

In this example, we create a DataFrame df from a dictionary of lists with one column ‘full_name’. We then split this column into two columns ‘first_name’ and ‘last_name’ by using the str.split() method with a space character as the separator. The expand=True parameter tells Pandas to expand the resulting list of strings into separate columns. Finally, we print the DataFrame df to see the result.

You can also split string columns into more than two columns or use more complex regular expressions as separators. Here’s an example:

				
					# split the column 'full_name' into three columns 'title', 'first_name', and 'last_name'
df[['title', 'first_name', 'last_name']] = df['full_name'].str.extract('(\w+)\.\s(\w+)\s(\w+)', expand=True)

print(df)

				
			

Output:

				
					#      full_name title first_name last_name
# 0    Alice Smith  None      Alice     Smith
# 1   Bob Johnson   None        Bob   Johnson
# 2  Charlie Brown  None   Charlie     Brown
# 3      David Lee  None      David       Lee

				
			

In this example, we split the column ‘full_name’ into three columns ‘title’, ‘first_name’, and ‘last_name’ using a regular expression. The regular expression \w+\.\s(\w+)\s(\w+) matches a word followed by a period (the title), followed by a space character, followed by two words (the first and last name). The parentheses (\w+) capture the matched strings as separate groups. We use the str.extract() method to extract the three groups and assign them to separate columns. Since there are no titles in this example, the ‘title’ column contains None values.

In this code, the isna() method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The sum() method is then used to count the number of missing values per column. The missing values are printed to the console to show which cells are missing.

The dropna() method is then used to remove all columns where the number of missing values exceeds a certain threshold. The axis=1 parameter indicates that the operation should be performed on columns. The thresh parameter is set to len(df) - 1 - 2 to indicate that a column should be kept only if it has at least len(df) - 1 - 2 non-missing values. The inplace=True parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns containing at least two missing values have been dropped.

Dropping Pandas Columns Where a Percentage of Records Are Missing

The proportion of missing values could be a preferable metric to use when determining this criterion. If 50% or more of the values in a column were missing, you might choose to remove those columns.

To drop Pandas columns where a percentage of records are missing, you can use the isna() method to check which cells contain missing values and mean() method to calculate the percentage of missing values per column.

Here’s an example:

				
					import pandas as pd
import numpy as np

# create sample data with missing values
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
        'Age': [25, np.nan, 18, np.nan, 22],
        'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
        'Salary': [np.nan, 80000, 40000, np.nan, 55000]}

# create DataFrame
df = pd.DataFrame(data)

# print DataFrame
print("Original DataFrame:")
print(df)

# calculate the percentage of missing values per column
missing_values_percentage = df.isna().mean() * 100

# print the missing values percentage
print("Missing values percentage per column:")
print(missing_values_percentage)

# drop columns with a missing values percentage greater than 50%
missing_values_threshold = 50.0
columns_to_drop = missing_values_percentage[missing_values_percentage > missing_values_threshold].index
df.drop(columns_to_drop, axis=1, inplace=True)

# print the updated DataFrame
print("DataFrame after dropping columns with a missing values percentage greater than 50%:")
print(df)

				
			

Output:

				
					# Original DataFrame:
#    Name   Age      City   Salary
# 0   John  25.0  New York      NaN
# 1  Alice   NaN     Paris  80000.0
# 2    Bob  18.0    London  40000.0
# 3  Karen   NaN     Tokyo      NaN
# 4   Mike  22.0    Sydney  55000.0

# Missing values percentage per column:
# Name       0.0
# Age       40.0
# City       0.0
# Salary    40.0
# dtype: float64

# DataFrame after dropping columns with a missing values percentage greater than 50%:
#    Name   Age      City   Salary
# 0   John  25.0  New York      NaN
# 1  Alice   NaN     Paris  80000.0
# 2    Bob  18.0    London  40000.0
# 3  Karen   NaN     Tokyo      NaN
# 4   Mike  22.0    Sydney  55000.0

				
			

In this code, the isna() method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The mean() method is then used to calculate the percentage of missing values per column. The missing values percentage is printed to the console to show which columns have a high percentage of missing values.

The code then drops columns where the percentage of missing values exceeds a certain threshold. The missing_values_threshold variable is set to 50.0% to indicate that a column should be dropped if its percentage of missing values exceeds 50.0%. The columns_to_drop variable is used to identify the columns that should be dropped, and the drop() method is used to remove them. The axis=1 parameter indicates that the operation should be performed on columns. The inplace=True parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns with a missing values percentage greater than 50% have been dropped.

How to Pop Pandas Columns

To pop Pandas columns, you can use the pop() method. The pop() method is used to remove a column from the DataFrame and return it as a Series.

Here’s an example:

				
					import pandas as pd

# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
        'Age': [25, 32, 18, 47, 22],
        'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
        'Salary': [60000, 80000, 40000, 120000, 55000]}

# create DataFrame
df = pd.DataFrame(data)

# print DataFrame
print("Original DataFrame:")
print(df)

# pop the 'Age' column from the DataFrame
age_column = df.pop('Age')

# print the updated DataFrame and the popped column
print("DataFrame after popping the 'Age' column:")
print(df)
print("Popped 'Age' column:")
print(age_column)

				
			

Output:

				
					# Original DataFrame:
#    Name  Age      City  Salary
# 0   John   25  New York   60000
# 1  Alice   32     Paris   80000
# 2    Bob   18    London   40000
# 3  Karen   47     Tokyo  120000
# 4   Mike   22    Sydney   55000

#DataFrame after popping the 'Age' column:
#    Name      City  Salary
# 0   John  New York   60000
# 1  Alice     Paris   80000
# 2    Bob    London   40000
# 3  Karen     Tokyo  120000
# 4   Mike    Sydney   55000
#
# Popped 'Age' column:
# 0    25
# 1    32
# 2    18
# 3    47
# 4    22
# Name: Age, dtype: int64
				
			

In this code, the pop() method is used to remove the ‘Age’ column from the DataFrame and return it as a Series. The age_column variable is used to store the popped column for later use. The updated DataFrame is then printed to show that the ‘Age’ column has been removed. Finally, the popped column is printed to the console to show that it has been returned as a Series.

Wrap up

You discovered some helpful techniques in this tutorial for removing columns from DataFrames using Pandas. Both data scientists and analysts should be proficient with working with DataFrames. Your workflow can be made much simpler and more efficient by being aware of the immense versatility that Pandas provides when working with columns!

To learn more about Pandas check out the official documentation here :
https://pandas.pydata.org/pandas-docs/stable/reference/series.html


Thanks for reading. Happy coding!