A Quick Guide for Combining Data in Pandas Using merge(), .join(), concat(), and .append()
For one of my projects, I was stuck (and confused) with which combination methods to use to combine my dataframes. Granted that I am not superb at coding, and my brain can be foggy after 8 hours of data cleaning. Hence, I put this quick guide together to remind myself (& hopefully bring newbies some valuable info) on the differences between merge() / .join() / concat() / .append(). So, here we go:
Before we start, we need to first understand the difference between the FOUR forms of joins:
Let’s also take a look at the visual representation of different joins (see left image).
Ok, now let’s start. I have created 3 dataframes, each dataframes have 2 columns with different number of rows. And with this blog post being on my photography account… the DataFrames are ISO, F_STOPS, and SHUTTER. The goal is to combine these 3 DataFrames and create a new DataFrame so that we have a cheap sheet chart for photographers.
Inner join is the default setting for merge( ). One key thing about inner join is that you WILL LOSE rows that don’t have a match in the other DataFrame’s key column. In our case, our key column is the index column. So here is how it looks when we use inner join.
Note that our iso DataFrame has 9 rows, and the f_stop DataFrame has 15 rows. However, after using .merge ( ) default option – inner join – we lost 6 rows from the f_stop Dataframe after combining, because ISO column don’t have index 10-15.
Hence, default .merge ( ) option will make your new dataframe smaller. (re: the circles image above). Some data loss can occur when you do an inner join.
You can also specified the key columns to join on using on=[‘column_name’, ‘column_name2’] within the .merge( ). But in this case, pandas figures that the common column we have here is the index, so it just automatically does it itself.
Within .merge( ), you can merge ALL rows from both Dataframes by using outer join. This way, you don’t lose any data. That’s the key part about outer join.
If a row doesn’t have a match in the other DataFrame (based on the key column(s)), pandas will automatically fill the cells with NaN. (Yay more data to clean!)
LEFT / RIGHT OUTER JOIN
Using either a left or right outer join will give your newly merged DataFrame with all rows from the left or right DataFrame (which ever left or right DataFrame you specified), while discarding rows from the left (or right) DataFrame that don’t have a match in the key column(s) of the left (or right) DataFrame.
In this example, the f_stop is my left DataFrame, and Shutter is my right DataFrame. Then I told pandas that I’d like to use left outer join (how = ‘left’). So pandas merge the two DataFrames and keep all rows from f_stop, while discarding the remaining rows that don’t match from the shutter.
Originally, f_stop DataFrame has 14 rows, shutter has 18 rows. But my new merged DataFrame has only 14 rows. Makes sense?
While merge( ) is a module function, .join( ) is an object function that enables you to specify only one Dataframe. In addition, .join( ) uses merge( ) under the hood (whattttt?). Don’t worry about it. It’s quite easy to understand once you see the code.
In my example here, you can see that I specified a left join here, with ISO being the left DataFrame and f_stop being the right Dataframe.
If you flip the DataFrames (f_stop being the left one instead, and ISO being the right one), you will find your new merged DataFrames bigger/longer, with the missing cells from ISO filled with NaN values.
With concatenation, your datasets are put together along an axis – row axis (0) or column axis (1).
In my first example, I concatenated using row axis (axis = 0), and it just put two DataFrames together like stacking them on top of each other to create a long DataFrame. It also fills the missing cells with NaN.
In my second example, I concatenated using column axis (axis = 1), and pandas find the key columns and stack them together side by side to create the ideal chart that I’d want.
By default, .concate( ) combine the two DataFrames by row axis if you don’t specified.
By default, a concatenation results in a set union, where all data is preserved (rows or columns). This only happens when axis labels are matched.
Note: if the images to the left are too small, you can left click on the images and choose “Open Image in New Tab” to see the actual size.
.append( ) is just a shortcut to .concate( ). It is simpler, but more restricted than concatenation. We can use .append( ) on both Series and DataFrame objects.
new_df = df1.append(df2)
So after some more combining, and data cleaning, we now have a small cheat sheet for photographers. And if we are feeling fancier, we’d go into the data and specified more on when to use each combination by adding new features. Maybe later.. probably not. I find it easier to learn and understand how to use all 3 features instead of trying to memorize a chart. Feel free to reach out if you want to discuss more about data science and/or photography! My door is always open as long as I have time.
Summery: Inner join loses some data. Outer join keeps all the data.