This article contains affiliate links. For more, please read the T&Cs.
Working with multiple DataFrames in Pandas is a common thing. Often, you will want to merge two or more DataFrames. For instance, one DataFrame could contain the customer bio-data while the other could contain his transaction history. In such a case you might want to join the two DataFrames to take advantage of information from both of them during your analysis. In such a case the data could be merged on the unique customer. However, there are several ways of merging that dictate which information will be taken from either DataFrame and which will be omitted. In this article, let’s explore that and more.
Pandas three-way joining multiple DataFrames on columns
Pandas allows us to merge more than one DataFrame. The only condition here is that the DataFrames need to have a common column. Let’s illustrate this by creating three DataFrames. They will have a common Name
column.
First, import Pandas and NumPy. Next, create a dictionary containing the data that will be used to create the DataFrames.
import pandas as pd import numpy as np names_dict ={ ‘Name’:[‘Ken’,’Jeff’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’], ‘Age’:[31,52,56,12,45,np.nan,78,85,46,135], ‘Phone’:[52,79,80,75,43,125,74,44,85,45], ‘Uni’:[‘One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’] } names_dict3 ={ ‘Name’:[‘Ken’,’Jeff’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’], ‘Weight’:[23,45,13,45,45,45,43,16,28,56], ‘Distance’:[52,79,80,75,43,125,74,44,85,45], ‘Skill’:[‘Four’,’Five’,’six’,’Seven’,’Eight’,’Nine’,’Ten’,’Eleven’,’Twelve’,’One’] } names_dict2 ={ ‘Name’:[‘Ken’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’], ‘Wage’:[24,52,45,14,45,np.nan,12,49,10], ‘Height’:[23,43,45,43,34,74,45,56,23], ‘Level’:[‘One’,’Three’,’One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’] } |
Next, let’s create three DataFrames using the data above.
df1 = pd.DataFrame(names_dict) df2 = pd.DataFrame(names_dict2) df3 = pd.DataFrame(names_dict3) |
Let’s now illustrate several ways of merging the above DataFrames
Using the reduce function
The reduce
function is used to apply a single function to all the elements of a sequence. We can use it to merge all the DataFrames. First, we need to create a list containing all of them.
dfs = [df1, df2, df3] |
After that, we can use it to apply the Pandas merge
function on the list of DataFrames.
from functools import reduce df_final = reduce(lambda left,right: pd.merge(left,right,on=’Name’), dfs) |
The merge
function is used when joining DataFrames on the index or columns. When columns are used the indices are ignored and vice versa. In this case, we are merging on the `Name` column.
Understanding the merging parameters
By default Pandas, will perform an Inner
join. However, this can be changed by specifying the how
argument. The options available are:
- Left. This uses the keys from the left DataFrame only. It is similar to a SQL left outer join.
- Right. Uses keys from the right DataFrame only. It is similar to a SQL right outer join.
- Outer. Uses the union of keys from both DataFrames. It is similar to a SQL full outer join.
- Inner. Uses the intersection of keys from both DataFrames. It is similar to the SQL inner join.
Let’s now apply the merge
function using all the above options and see the difference.
For instance, applying outer
will result in some null values because Pandas will pick data from all DataFrames.
pd.merge(pd.merge(df1,df2,on=’Name’,how=”outer”),df3,on=’Name’,how=”outer”) |
The inner join doesn’t result in null values because it picks data that matches from all DataFrames.
pd.merge(pd.merge(df1,df2,on=’Name’,how=”inner”),df3,on=’Name’,how=”inner”) |
The left join will pick everything from the left table. Some null values can be observed since df1
has some null values.
pd.merge(pd.merge(df1,df2,on=’Name’,how=”left”),df3,on=’Name’,how=”left”) |
The right join will pick everything from the right table. The null values result from the fact that df2
doesn’t have Jeff
in the Name
column.
pd.merge(pd.merge(df1,df2,on=’Name’,how=”right”),df3,on=’Name’,how=”right”) |
When merging DataFrames, you, therefore, have to consider the final result to select the most appropriate way to perform the merging.
Let’s take a look at another way of merging the above DataFrames. The merge
function can be called on a Pandas DataFrame. The DataFrame it’s called on acts as the left DataFrame. You can then define the right DataFrame and then specify how you want the merging to take place.
df1.merge(df2,on=’Name’).merge(df3,on=’Name’) |
What is the difference between join and merge in Pandas?
You have already seen how to work with the merge
function in Pandas. The join
method is a bit different because it joins the DataFrames on the index or a key column. If the indices are overlapping, you will get this error.
ValueError: Indexes have overlapping values: Index(['Name'], dtype='object')
To join the DataFrames on the Name
column, you first have to set it as the index of the DataFrame.
dfs = [df.set_index(‘Name’) for df in dfs] pd.DataFrame().join(dfs,how=”outer”) |
Alternatively, you can set the index on the individual DataFrames then performing the merging.
df_1 = df1.set_index(‘Name’) df_2 = df2.set_index(‘Name’) df_1.join(df_2) |
If the columns have columns with similar names, you will get this error;
ValueError: columns overlap but no suffix specified: Index(['Age', 'Phone', 'Uni'], dtype='object')
This can be addressed by passing the lsuffix
(suffix to use from left DataFrame’s overlapping columns) and rsuffix
(suffix to use from right DataFrame’s overlapping columns) arguments. Pandas will then append them to the conflicting columns.
Concatenating Pandas DataFrames
Concatenating Pandas DataFrames is common when you have several DataFrames that contain the same information. In such cases, the second DataFrame can be placed below the first one. This is ideal when both DataFrames have similar and the same number of columns. Otherwise, you will see a lot of null values as seen below.
pd.concat([df1,df2,df3]) |
Final Thoughts
In this article, you have learned how to perform merging and joining on various DataFrames. The join and merge methods can be used interchangeably. However, the merge method is more straightforward. It is also the most commonly used method.
To reference the code used in this post, please see this Notebook.
This tutorial contains affiliate links. For more, please read the T & Cs.