Pandas-Log and Its Debugging Capabilities

Introduction:

Working with data has a strong connection with programming. Therefore, a Data Scientist who knows the best practices of software development will have advantages while solving Data Science problems. To achieve this, one needs to go beyond the basics while studying Pandas. In this tutorial, we will debug some Pandas code with the help of Pandas-Log.

Debugging errors in the source code and preparing data is a great part of a data scientist’s job. But what if you could solve errors faster? Perhaps, looking “under the hood” of Pandas will make us find the solution faster. And improve the explainability of our operations in the data for free!

For this objective, we will use Pandas-Log, an open-source Pandas logging tool. This tool will help us to see what is happening while we are applying processing steps to data. A Pandas operation can affect rows and columns, Pandas-Log shows us how many columns and rows changed. In the case of any alteration of the Data Frame, it shows what are the remaining rows and columns.

Let’s get a dataset to work with and show a few logging examples. We will use game sales data from Kaggle, Google’s Data Science community, it is available on the following link: Video Game Sales – Kaggle. If the data is no longer available on Kaggle, you can download it from here on my Github.

The data contains features such as name, platform, year, genre, publisher, and the number of sales by global region. Let’s explore the data and do an analysis of it? If we find some difficulty while doing it, we will use the help of Pandas-Log.

Seeing “under the hood” of Queries with Pandas-Log:

In this step, we will see the structure of the dataset:

import pandas as pd
import numpy as np

df = pd.read_csv("vgsales.csv")

df.head()

Games sales data:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	3	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	4	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37

How this simple call of the Pandas head() method would look like if we were using Pandas-Log? To use it, just install it via PIP along with its required packages.

pip install --user pandas-log
pip install --user pandas-flavor
pip install --user humanize

After the installation of the pandas-log package and its requirements, we can use it by importing pandas_log:

import pandas_log
with pandas_log.enable():
    df.head()

Pandas-Log Output:

1) head(n: int = 5):
    Metadata:
    * Picked the first 5 rows.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 1.7 kB.

With Pandas-log enabled, the description of the function “head()” is displayed, such as the size of the input Dataframe and the size of the output Dataframe.

After looking at the data, everyone can agree that Nintendo is a strong video game publisher. To see clearer the Nintendo influence in the video game market, we will filter the data to Nintendo games developed before the 2000s. Only Nintendo games in the top 50 will be accepted.

old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'")
old_nintendo

Best-selling Nintendo old Games… Wait! Zero games?!

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales

The output DataFrame is empty! Nintendo didn’t make any successful videogame before the 2000s? What about Mario and Donkey Kong?

Let’s bring Pandas-Log to the rescue! It will help us find any mistake made in the query.

with pandas_log.enable():
    old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'")

Pandas-Log output:

1) query(expr="Year<2000", inplace=False):
    Metadata:
    * Removed 14624 rows (88.10700084347512%), 1974 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 661.8 kB.

2) query(expr="Rank<50", inplace=False):
    Metadata:
    * Removed 1964 rows (99.49341438703141%), 10 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 661.8 kB.
    * Output Dataframe size is 3.3 kB.

3) query(expr="Publisher=='Nlntendo'", inplace=False):
    Metadata:
    * Removed 10 rows (100.0%), 0 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 3.3 kB.
    * Output Dataframe size is 0 Bytes.

Looks like the incorrect step was in the last query, the query that checks if the Publisher of the remaining 10 rows is Nintendo. Wait! Looks like “Nlntendo” contains a typo, it should be “Nintendo”. See how the output Dataframe looks like after the correction:

old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nintendo'")
old_nintendo

Best-selling old Nintendo games under the Top 50:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37
5	6	Tetris	GB	1989.0	Puzzle	Nintendo	23.20	2.26	4.22	0.58	30.26
9	10	Duck Hunt	NES	1984.0	Shooter	Nintendo	26.93	0.63	0.28	0.47	28.31
12	13	Pokemon Gold/Pokemon Silver	GB	1999.0	Role-Playing	Nintendo	9.00	6.18	7.20	0.71	23.10
18	19	Super Mario World	SNES	1990.0	Platform	Nintendo	12.78	3.75	3.54	0.55	20.61
21	22	Super Mario Land	GB	1989.0	Platform	Nintendo	10.83	2.71	4.18	0.42	18.14
22	23	Super Mario Bros. 3	NES	1988.0	Platform	Nintendo	9.54	3.44	3.84	0.46	17.28
30	31	Pokémon Yellow: Special Pikachu Edition	GB	1998.0	Role-Playing	Nintendo	5.89	5.04	3.12	0.59	14.64
46	47	Super Mario 64	N64	1996.0	Platform	Nintendo	6.91	2.85	1.91	0.23	11.89

Let’s see how another query looks under the hood with Pandas-Log. A query will return a DataFrame containing the top 10 sports games of the Playstation 2 platform. The last operation will sort the games by the number of their US sales:

df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")

Best-Selling PS2 sports games in the US:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
179	180	Madden NFL 2004	PS2	NaN	Sports	Electronic Arts	4.26	0.26	0.01	0.71	5.23
238	239	Madden NFL 2005	PS2	2004.0	Sports	Electronic Arts	4.18	0.26	0.01	0.08	4.53
211	212	Madden NFL 06	PS2	2005.0	Sports	Electronic Arts	3.98	0.26	0.01	0.66	4.91
240	241	Madden NFL 07	PS2	2006.0	Sports	Electronic Arts	3.63	0.24	0.01	0.61	4.49
279	280	Madden NFL 2003	PS2	2002.0	Sports	Electronic Arts	3.36	0.21	0.01	0.56	4.14
248	249	Tony Hawk’s Pro Skater 3	PS2	2001.0	Sports	Activision	2.66	1.29	0.01	0.46	4.41
451	452	Madden NFL 2002	PS2	2001.0	Sports	Electronic Arts	2.50	0.16	0.01	0.42	3.08
306	307	Tony Hawk’s Underground	PS2	2003.0	Sports	Activision	2.29	1.17	0.01	0.42	3.90
534	535	NBA Street	PS2	2001.0	Sports	Electronic Arts	2.19	0.22	0.00	0.38	2.79
580	581	ESPN NFL 2K5	PS2	2004.0	Sports	Sega	2.15	0.12	0.00	0.36	2.62

with pandas_log.enable():
    df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")

Pandas-Log Output:

1) query(expr="Platform=='PS2'", inplace=False):
    Metadata:
    * Removed 14437 rows (86.98035907940715%), 2161 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 737.8 kB.

2) query(expr="Genre=='Sports'", inplace=False):
    Metadata:
    * Removed 1761 rows (81.49005090236003%), 400 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 737.8 kB.
    * Output Dataframe size is 135.9 kB.

3) nlargest(n=10, columns="NA_Sales", keep='first'):
    Metadata:
    * Picked 10 largest rows by columns (NA_Sales).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 135.9 kB.
    * Output Dataframe size is 3.3 kB

We can see six Madden NFL games taking place in the US top 10 for PS2 sports games and Pandas-Log describing how the operations took place. With those examples, we can see the expressivity of the Pandas-Log tool and its usefulness while debugging and documenting Data Science Projects.

Utilization example of Pandas-Log in a Data Science Project:

Suppose your client wants his Data Science project error-free and wants to check its inner workings once in a while. In this case, show them how you perform the data queries, by giving your client access to a Verbosity option. A Verbosity variable will determine if the program will show extra output while doing queries:

Verbosity variable use example with Pandas-Log:

import pandas as pd
import numpy as np
import pandas_log

# Set if the program will show the Pandas-Log output while doing a query
# verbosity = False
verbosity = True

dataframe = pd.read_csv("vgsales.csv")

# Execute two important queries asked by the client
# When the client wants to be assured of the results, he would set the verbosity to 1
if(verbosity):
    with pandas_log.enable():
        top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank")
else:
    top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank")

top_five_take_two_games

Pandas-Log output:

1) query(expr="Publisher=='Take-Two Interactive'", inplace=False):
    Metadata:
    * Removed 16185 rows (97.5117484034221%), 413 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 140.6 kB.

2) nsmallest(n=5, columns="Rank", keep='first'):
    Metadata:
    * Picked 5 smallest rows by columns (Rank).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 140.6 kB.
    * Output Dataframe size is 1.7 kB.

Top 5 Best-selling games from Take-Two Interactive:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
16	17	Grand Theft Auto V	PS3	2013.0	Action	Take-Two Interactive	7.01	9.27	0.97	4.14	21.40
17	18	Grand Theft Auto: San Andreas	PS2	2004.0	Action	Take-Two Interactive	9.43	0.40	0.41	10.57	20.81
23	24	Grand Theft Auto V	X360	2013.0	Action	Take-Two Interactive	9.63	5.31	0.06	1.38	16.38
24	25	Grand Theft Auto: Vice City	PS2	2002.0	Action	Take-Two Interactive	8.41	5.49	0.47	1.78	16.15
38	39	Grand Theft Auto III	PS2	2001.0	Action	Take-Two Interactive	6.99	4.51	0.30	1.30	13.10

if(verbosity):
    with pandas_log.enable():
        top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank")
else:
    top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank")

top_five_sony_games

Pandas-Log output:

1) query(expr="Publisher=='Sony Computer Entertainment'", inplace=False):
    Metadata:
    * Removed 15915 rows (95.88504639113147%), 683 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 237.8 kB.

2) nsmallest(n=5, columns="Rank", keep='first'):
    Metadata:
    * Picked 5 smallest rows by columns (Rank).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 237.8 kB.
    * Output Dataframe size is 1.7 kB.

Top 5 Best-selling games from Sony Computer Entertainment:

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
28	29	Gran Turismo 3: A-Spec	PS2	2001.0	Racing	Sony Computer Entertainment	6.85	5.09	1.87	1.16	14.98
47	48	Gran Turismo 4	PS2	2004.0	Racing	Sony Computer Entertainment	3.01	0.01	1.10	7.53	11.66
52	53	Gran Turismo	PS	1997.0	Racing	Sony Computer Entertainment	4.02	3.87	2.54	0.52	10.95
54	55	Gran Turismo 5	PS3	2010.0	Racing	Sony Computer Entertainment	2.96	4.88	0.81	2.12	10.77
66	67	Final Fantasy VII	PS	1997.0	Role-Playing	Sony Computer Entertainment	3.01	2.47	3.28	0.96	9.72

Pandas-Log, make your Pandas debugging easier:

Those are the use examples of Pandas-Log, with it, you can add those Pandas-Logs to the documentation of your project, so your clients can be sure that the queries that you make are working. Thanks for reading!

Are you still curious about Pandas-Log and the functions used in this tutorial? If yes, check their documentation by accessing their links below:

The code utilized in this post is available here on Github.

Data Courses

Introduction:

Seeing “under the hood” of Queries with Pandas-Log:

Games sales data:

Best-selling Nintendo old Games… Wait! Zero games?!

Best-selling old Nintendo games under the Top 50:

Best-Selling PS2 sports games in the US:

Utilization example of Pandas-Log in a Data Science Project:

Verbosity variable use example with Pandas-Log:

Top 5 Best-selling games from Take-Two Interactive:

Top 5 Best-selling games from Sony Computer Entertainment:

Pandas-Log, make your Pandas debugging easier:

Are you still curious about Pandas-Log and the functions used in this tutorial? If yes, check their documentation by accessing their links below: