Well, somehow it’s May and I’ve yet to share anything new. I’m a bit behind my goal of 12 posts for the year but I suppose there’s still time!
Since starting at Pinterest, I’ve found myself putting down R and using much more Python. We have an awesome Jupyter setup with some internal libraries, notebook extensions, and magics that make gathering and analyzing data much easier. At first, it felt a bit foreign to be using Python but as with most things, the more you do them, the more familiar they become.
I’d like to share my 3 of my favorite methods/functions that I’ve discovered since working with Pandas and Numpy a bit more. I’ve often felt that it’s difficult to know what you don’t know so hopefully these will be new and as useful for you as they are for me.
One thing I’ve always loved about R is the built-in datasets. I always found it challenging to make random or fake data to play with and the built-in datasets completely removed this headache. I stumbled about the PyDataset library which has many of the R datasets ported over to Python. PyDataset is easily installed with pip. By the way, since using Jupyter notebooks was less familiar for me, you can use an ‘!’ in a cell and the following command will run at the command line.
!pip install pydataset
import pandas as pd import numpy as np from pydataset import data df = data('Cars93') df.head()
Take a look at the GitHub page for a quick demo. The
show_doc option is super helpful to get some information on the dataset. I’ll use the Cars93 dataset for the rest of this post.
My favorite Pandas discovery to date is the
value_counts() method. It is super helpful in exploring a new dataset, especially one that is in a tidy format where each row is an observation. Essentially, it will quickly give you a pivot table of each observation and the number of times it appears, sorted descending, for a given Series.
I find myself using it to get an idea of the data filling a column. You could also use something like the
unique() method but I prefer
value_counts() since it also gives you a sense of the number of times each observation appears. Looking at the Cars93 dataset, we’ll use this method to get an a sense of the different car manufacturers in the data.
Right away, because the data is sorted nicely, we can see that this data is made up of mostly Ford, Chevrolet, and Dodge vehicles and very few Chrysler, Suzuki, and Saab vehicles.
Numpy’s isin function is quite useful when comparing the contents of lists or Series. I think this is best demonstrated by a practical example.
Let’s say we’re looking to purchase a new car. Luckily we have a nice dataset that we can use to help narrow down our search! Conveniently, we have a “Passengers” column that the documentation tells us is the passenger capacity.
We need to make sure that our family of 4 can all fit in the car at the same time but we also live in the city so too big of car will be a hassle to park. We’ll create a separate list of the number of passengers that we feel will be appropriate for our new car.
seats_neeeded = [4, 5]
Let’s say we want to know which Manufacturer has the most cars that will fit our needs. This is where
isin becomes handy. We can use
isin to subset our DataFrame to only the cars that we know have enough seats for us. We can then grab only the “Manufacturer” column and call
value_counts() to identify the Manufacturer that has the most cars for us.
With this line written, we can substitute “Manufacturer” with just about any other column of the DataFrame to help analyze the data and find the right car for us. And while this is relatively small dataset, imagine having something like 500,000 rows and needing to check against 250 different options.
isin becomes extremely useful in those scenarios.
The last helper we’ll take a look at is numpy.where.
numpy.where works almost exactly like R’s
ifelse and is helpful in labeling or categorizing data. We’ll use it with
isin to label the cars we can consider and the cars that won’t fit our needs. In this case, we can to keep the whole dataset incase we change our mind later down the line and want to consider a vehicle that is slightly larger.
df['Consideration'] = np.where(df['Passengers'].isin(seats_neeeded), 'Consider', 'Not For Me')
Here, we’re passing a criteria to check and, in this case, the criteria uses
isin that we discussed earlier. If the criteria is met, we’re applying the first label. If the criteria is not met, the second label is applied. These labels will be placed in a new column that we’re calling “Consideration.”
To bring this full circle, we can now use
value_counts() get see how many cars we’ll be considering and how many don’t fit our needs.
Bonus – %whos magic
One of the things I miss most stepping away from R is RStudio. I particularly like the Environment Pane and felt like I was working in the dark without it in Jupyter notebooks. I recently stumbled upon a tweet from @python_tip with a Jupyter magic that somewhat recreates the feel of the Environment Pane.
This is a handy magic to see what’s going on in your workspace and I recommend following @python_tip for other useful bits.
That’s all I have for now. You can checkout my GitHub Page for the notebook I used for this post. Thanks for reading!