NumPy is one of the most common tools in data science and machine learning. Many data science tools are built on top of it. One such tool is Pandas. These two tools cannot miss in the arsenal of any data scientist. There are a couple of common issues you may encounter when working with these libraries. The ability to resolve these issues is of utmost importance. Solving the issues swiftly ensures that you can quickly get back to the important business of analyzing data. In this article, we’ll look at these issues and how you can address them.
Convert Pandas DataFrame to NumPy array
In certain cases, you might want to convert a Pandas DataFrame to a NumPy array. For instance, you may want to do this when applying the data to a machine learning algorithm. There are a couple of ways to do that.
Let’s create some dummy data to illustrate this. The first step is to import Pandas and NumPy. Next, we create a Python dictionary containing some dummy data. The next step is to use the Pandas `DataFrame` function to create the new DataFrame.
import pandas as pd import numpy as np names_dict ={ ‘Name’:[‘Ken’,’Jeff’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’], ‘Age’:[31,52,56,12,45,np.nan,78,85,46,135], ‘Phone’:[52,79,80,75,43,125,74,44,85,45], ‘Uni’:[‘One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’] } df = pd.DataFrame(names_dict) |
One of the ways the above DataFrame can be converted to a NumPy array is by using the `values` attribute.
df.values |
You can confirm that it is a NumPy array by checking the type.
type(df.values) |
The other way a Pandas DataFrame can be converted to a NumPy array is by using the `to_numpy()` method.
df.to_numpy() |
By default, the function represents null values using `nan`. You can however change that through the `na_value` parameter.
df.to_numpy(na_value=”NAN”) |
The is `to_records` function is used when you want to export it with its data types. The data is exported as a NumPy record array.
df.to_records() |
The final alternative is to use NumPy’s `asarray` method to convert the DataFrame to an array.
np.asarray(df) |
How do I convert a Pandas Series or index to a Numpy array?
A Pandas series can be converted to a NumPy array in a similar manner as a Pandas DataFrame. The conversion is done by selecting a column in the DataFrame and calling the `values` attribute on it.
df[‘Age’].values |
A DataFrame’s index can be converted to a NumPy array in a similar manner. This is done by obtaining the index and calling the `values` attribute on it.
df.index.values |
Alternatively, you can convert the index to a list then use NumPy to convert to an array.
np.array(df.index.tolist()) |
NumPy or Pandas: Keeping array type as integer while having a NaN value
Pandas allows us to define the data type of a series as an integer while there are null values. This is done by specifying the data type.
s = pd.Series([1, 2, np.nan], dtype=’Int64′) |
When you add integers to this series, the type will remain the same.
s + 3 |
However, when you add a float to the series the result will be coerced to be float.
s2 = s + 9.45 |
You can round off the result and convert it to integers if you want the final result to be integers.
s2.round().astype(‘Int64’) |
Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?
When creating a Pandas DataFrame the index can be specified using the `index` parameter. The data is specified using the `data` argument while the columns are specified using the `columns` argument.
Let’s take an example given this data.
data = np.array([[”,’Col1′,’Col2′],[‘Row1’,1,2],[‘Row2’,3,4]]) |
Creating a DataFrame from the above data also requires experience in performing indexing in NumPy arrays. For instance, here is how the data can be selected. The first value appears on column 1, so we start selecting from that point. Since we want the values in all columns we can leave the upper bound open. The values we want also start at row 1, so we select that first. Since we want the values in all rows, we leave the upper bound open as well.
data[1:,1:] |
The index and columns can be selected in a similar way.
data[1:,0] data[0,1:] |
Let’s now create that DataFrame while specifying the required parameters.
pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:]) |
What are the differences between Pandas and NumPy+SciPy in Python?
The difference between Pandas, NumPy, and SciPy may be a bit confusing especially the first time you hear the terms. Let’s differentiate them here.
NumPy is a Python package that is used for numerical computation. It is mainly known for its arrays referred to as NumPy arrays. NumPy provides the building blocks for various scientific packages such as Pandas.
Pandas is a Python library that is mainly used for data wrangling and analysis. It is built on top of NumPy. It provides common functions necessary for grouping data, cleaning data as well as merging data.
SciPy is the open-source ecosystem for these scientific packages. It houses not just NumPy and Pandas but also other scientific packages such as Matplotlib. It also offers the SciPy library which is a core library in the SciPy stack. The library provides functions for interpolation, optimization, statistics, and linear algebra. The SciPy library depends on NumPy.
Final Thoughts
In this article, we have addressed a couple of common issues you might face while working with NumPy and Pandas. Conversion of data to NumPy arrays is a common task, especially when selecting data that needs to be fed to machine learning algorithms. The ability to properly select data from the arrays is also very practical. For instance in visualization, you might want to visualize certain columns and leave others.
To access the code from this post, please see this Notebook.