Introduction:
Working with data has a strong connection with programming. Therefore, a Data Scientist who knows the best practices of software development will have advantages while solving Data Science problems. To achieve this, one needs to go beyond the basics while studying Pandas. In this tutorial, we will debug some Pandas code with the help of Pandas-Log.
Debugging errors in the source code and preparing data is a great part of a data scientist’s job. But what if you could solve errors faster? Perhaps, looking “under the hood” of Pandas will make us find the solution faster. And improve the explainability of our operations in the data for free!
For this objective, we will use Pandas-Log, an open-source Pandas logging tool. This tool will help us to see what is happening while we are applying processing steps to data. A Pandas operation can affect rows and columns, Pandas-Log shows us how many columns and rows changed. In the case of any alteration of the Data Frame, it shows what are the remaining rows and columns.
Let’s get a dataset to work with and show a few logging examples. We will use game sales data from Kaggle, Google’s Data Science community, it is available on the following link: Video Game Sales – Kaggle. If the data is no longer available on Kaggle, you can download it from here on my Github.
The data contains features such as name, platform, year, genre, publisher, and the number of sales by global region. Let’s explore the data and do an analysis of it? If we find some difficulty while doing it, we will use the help of Pandas-Log.
Seeing “under the hood” of Queries with Pandas-Log:
In this step, we will see the structure of the dataset:
import pandas as pd import numpy as np df = pd.read_csv("vgsales.csv") df.head()
Games sales data:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
How this simple call of the Pandas head() method would look like if we were using Pandas-Log? To use it, just install it via PIP along with its required packages.
pip install --user pandas-log pip install --user pandas-flavor pip install --user humanize
After the installation of the pandas-log package and its requirements, we can use it by importing pandas_log:
import pandas_log with pandas_log.enable(): df.head()
Pandas-Log Output:
1) head(n: int = 5): Metadata: * Picked the first 5 rows. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 5.5 MB. * Output Dataframe size is 1.7 kB.
With Pandas-log enabled, the description of the function “head()” is displayed, such as the size of the input Dataframe and the size of the output Dataframe.
After looking at the data, everyone can agree that Nintendo is a strong video game publisher. To see clearer the Nintendo influence in the video game market, we will filter the data to Nintendo games developed before the 2000s. Only Nintendo games in the top 50 will be accepted.
old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'") old_nintendo
Best-selling Nintendo old Games… Wait! Zero games?!
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales |
---|
The output DataFrame is empty! Nintendo didn’t make any successful videogame before the 2000s? What about Mario and Donkey Kong?
Let’s bring Pandas-Log to the rescue! It will help us find any mistake made in the query.
with pandas_log.enable(): old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'")
Pandas-Log output:
1) query(expr="Year<2000", inplace=False): Metadata: * Removed 14624 rows (88.10700084347512%), 1974 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 5.5 MB. * Output Dataframe size is 661.8 kB. 2) query(expr="Rank<50", inplace=False): Metadata: * Removed 1964 rows (99.49341438703141%), 10 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 661.8 kB. * Output Dataframe size is 3.3 kB. 3) query(expr="Publisher=='Nlntendo'", inplace=False): Metadata: * Removed 10 rows (100.0%), 0 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 3.3 kB. * Output Dataframe size is 0 Bytes.
Looks like the incorrect step was in the last query, the query that checks if the Publisher of the remaining 10 rows is Nintendo. Wait! Looks like “Nlntendo” contains a typo, it should be “Nintendo”. See how the output Dataframe looks like after the correction:
old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nintendo'") old_nintendo
Best-selling old Nintendo games under the Top 50:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
5 | 6 | Tetris | GB | 1989.0 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 | 30.26 |
9 | 10 | Duck Hunt | NES | 1984.0 | Shooter | Nintendo | 26.93 | 0.63 | 0.28 | 0.47 | 28.31 |
12 | 13 | Pokemon Gold/Pokemon Silver | GB | 1999.0 | Role-Playing | Nintendo | 9.00 | 6.18 | 7.20 | 0.71 | 23.10 |
18 | 19 | Super Mario World | SNES | 1990.0 | Platform | Nintendo | 12.78 | 3.75 | 3.54 | 0.55 | 20.61 |
21 | 22 | Super Mario Land | GB | 1989.0 | Platform | Nintendo | 10.83 | 2.71 | 4.18 | 0.42 | 18.14 |
22 | 23 | Super Mario Bros. 3 | NES | 1988.0 | Platform | Nintendo | 9.54 | 3.44 | 3.84 | 0.46 | 17.28 |
30 | 31 | Pokémon Yellow: Special Pikachu Edition | GB | 1998.0 | Role-Playing | Nintendo | 5.89 | 5.04 | 3.12 | 0.59 | 14.64 |
46 | 47 | Super Mario 64 | N64 | 1996.0 | Platform | Nintendo | 6.91 | 2.85 | 1.91 | 0.23 | 11.89 |
Let’s see how another query looks under the hood with Pandas-Log. A query will return a DataFrame containing the top 10 sports games of the Playstation 2 platform. The last operation will sort the games by the number of their US sales:
df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")
Best-Selling PS2 sports games in the US:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
179 | 180 | Madden NFL 2004 | PS2 | NaN | Sports | Electronic Arts | 4.26 | 0.26 | 0.01 | 0.71 | 5.23 |
238 | 239 | Madden NFL 2005 | PS2 | 2004.0 | Sports | Electronic Arts | 4.18 | 0.26 | 0.01 | 0.08 | 4.53 |
211 | 212 | Madden NFL 06 | PS2 | 2005.0 | Sports | Electronic Arts | 3.98 | 0.26 | 0.01 | 0.66 | 4.91 |
240 | 241 | Madden NFL 07 | PS2 | 2006.0 | Sports | Electronic Arts | 3.63 | 0.24 | 0.01 | 0.61 | 4.49 |
279 | 280 | Madden NFL 2003 | PS2 | 2002.0 | Sports | Electronic Arts | 3.36 | 0.21 | 0.01 | 0.56 | 4.14 |
248 | 249 | Tony Hawk’s Pro Skater 3 | PS2 | 2001.0 | Sports | Activision | 2.66 | 1.29 | 0.01 | 0.46 | 4.41 |
451 | 452 | Madden NFL 2002 | PS2 | 2001.0 | Sports | Electronic Arts | 2.50 | 0.16 | 0.01 | 0.42 | 3.08 |
306 | 307 | Tony Hawk’s Underground | PS2 | 2003.0 | Sports | Activision | 2.29 | 1.17 | 0.01 | 0.42 | 3.90 |
534 | 535 | NBA Street | PS2 | 2001.0 | Sports | Electronic Arts | 2.19 | 0.22 | 0.00 | 0.38 | 2.79 |
580 | 581 | ESPN NFL 2K5 | PS2 | 2004.0 | Sports | Sega | 2.15 | 0.12 | 0.00 | 0.36 | 2.62 |
with pandas_log.enable(): df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")
Pandas-Log Output:
1) query(expr="Platform=='PS2'", inplace=False): Metadata: * Removed 14437 rows (86.98035907940715%), 2161 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 5.5 MB. * Output Dataframe size is 737.8 kB. 2) query(expr="Genre=='Sports'", inplace=False): Metadata: * Removed 1761 rows (81.49005090236003%), 400 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 737.8 kB. * Output Dataframe size is 135.9 kB. 3) nlargest(n=10, columns="NA_Sales", keep='first'): Metadata: * Picked 10 largest rows by columns (NA_Sales). Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 135.9 kB. * Output Dataframe size is 3.3 kB
We can see six Madden NFL games taking place in the US top 10 for PS2 sports games and Pandas-Log describing how the operations took place. With those examples, we can see the expressivity of the Pandas-Log tool and its usefulness while debugging and documenting Data Science Projects.
Utilization example of Pandas-Log in a Data Science Project:
Suppose your client wants his Data Science project error-free and wants to check its inner workings once in a while. In this case, show them how you perform the data queries, by giving your client access to a Verbosity option. A Verbosity variable will determine if the program will show extra output while doing queries:
Verbosity variable use example with Pandas-Log:
import pandas as pd import numpy as np import pandas_log # Set if the program will show the Pandas-Log output while doing a query # verbosity = False verbosity = True dataframe = pd.read_csv("vgsales.csv") # Execute two important queries asked by the client # When the client wants to be assured of the results, he would set the verbosity to 1 if(verbosity): with pandas_log.enable(): top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank") else: top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank") top_five_take_two_games
Pandas-Log output:
1) query(expr="Publisher=='Take-Two Interactive'", inplace=False): Metadata: * Removed 16185 rows (97.5117484034221%), 413 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 5.5 MB. * Output Dataframe size is 140.6 kB. 2) nsmallest(n=5, columns="Rank", keep='first'): Metadata: * Picked 5 smallest rows by columns (Rank). Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 140.6 kB. * Output Dataframe size is 1.7 kB.
Top 5 Best-selling games from Take-Two Interactive:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
16 | 17 | Grand Theft Auto V | PS3 | 2013.0 | Action | Take-Two Interactive | 7.01 | 9.27 | 0.97 | 4.14 | 21.40 |
17 | 18 | Grand Theft Auto: San Andreas | PS2 | 2004.0 | Action | Take-Two Interactive | 9.43 | 0.40 | 0.41 | 10.57 | 20.81 |
23 | 24 | Grand Theft Auto V | X360 | 2013.0 | Action | Take-Two Interactive | 9.63 | 5.31 | 0.06 | 1.38 | 16.38 |
24 | 25 | Grand Theft Auto: Vice City | PS2 | 2002.0 | Action | Take-Two Interactive | 8.41 | 5.49 | 0.47 | 1.78 | 16.15 |
38 | 39 | Grand Theft Auto III | PS2 | 2001.0 | Action | Take-Two Interactive | 6.99 | 4.51 | 0.30 | 1.30 | 13.10 |
if(verbosity): with pandas_log.enable(): top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank") else: top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank") top_five_sony_games
Pandas-Log output:
1) query(expr="Publisher=='Sony Computer Entertainment'", inplace=False): Metadata: * Removed 15915 rows (95.88504639113147%), 683 rows remaining. Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 5.5 MB. * Output Dataframe size is 237.8 kB. 2) nsmallest(n=5, columns="Rank", keep='first'): Metadata: * Picked 5 smallest rows by columns (Rank). Execution Stats: * Execution time: Step Took a moment seconds.. * Input Dataframe size is 237.8 kB. * Output Dataframe size is 1.7 kB.
Top 5 Best-selling games from Sony Computer Entertainment:
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
28 | 29 | Gran Turismo 3: A-Spec | PS2 | 2001.0 | Racing | Sony Computer Entertainment | 6.85 | 5.09 | 1.87 | 1.16 | 14.98 |
47 | 48 | Gran Turismo 4 | PS2 | 2004.0 | Racing | Sony Computer Entertainment | 3.01 | 0.01 | 1.10 | 7.53 | 11.66 |
52 | 53 | Gran Turismo | PS | 1997.0 | Racing | Sony Computer Entertainment | 4.02 | 3.87 | 2.54 | 0.52 | 10.95 |
54 | 55 | Gran Turismo 5 | PS3 | 2010.0 | Racing | Sony Computer Entertainment | 2.96 | 4.88 | 0.81 | 2.12 | 10.77 |
66 | 67 | Final Fantasy VII | PS | 1997.0 | Role-Playing | Sony Computer Entertainment | 3.01 | 2.47 | 3.28 | 0.96 | 9.72 |
Pandas-Log, make your Pandas debugging easier:
Those are the use examples of Pandas-Log, with it, you can add those Pandas-Logs to the documentation of your project, so your clients can be sure that the queries that you make are working. Thanks for reading!
Are you still curious about Pandas-Log and the functions used in this tutorial? If yes, check their documentation by accessing their links below:
The code utilized in this post is available here on Github.