My Top 3 Methods & Functions From Pandas & Numpy

Well, somehow it’s May and I’ve yet to share anything new. I’m a bit behind my goal of 12 posts for the year but I suppose there’s still time!

Since starting at Pinterest, I’ve found myself putting down R and using much more Python. We have an awesome Jupyter setup with some internal libraries, notebook extensions, and magics that make gathering and analyzing data much easier. At first, it felt a bit foreign to be using Python but as with most things, the more you do them, the more familiar they become.

I’d like to share my 3 of my favorite methods/functions that I’ve discovered since working with Pandas and Numpy a bit more. I’ve often felt that it’s difficult to know what you don’t know so hopefully these will be new and as useful for you as they are for me.


One thing I’ve always loved about R is the built-in datasets. I always found it challenging to make random or fake data to play with and the built-in datasets completely removed this headache. I stumbled about the PyDataset library which has many of the R datasets ported over to Python. PyDataset is easily installed with pip. By the way, since using Jupyter notebooks was less familiar for me, you can use an ‘!’ in a cell and the following command will run at the command line.

!pip install pydataset
import pandas as pd
import numpy as np

from pydataset import data

df = data('Cars93')

Screen Shot 2019-05-04 at 10.04.07

Take a look at the GitHub page for a quick demo. The show_doc option is super helpful to get some information on the dataset. I’ll use the Cars93 dataset for the rest of this post.

data('Cars93', show_doc=True)

Screen Shot 2019-05-04 at 10.06.33


My favorite Pandas discovery to date is the value_counts() method. It is super helpful in exploring a new dataset, especially one that is in a tidy format where each row is an observation. Essentially, it will quickly give you a pivot table of each observation and the number of times it appears, sorted descending, for a given Series.

I find myself using it to get an idea of the data filling a column. You could also use something like the unique() method but I prefer value_counts() since it also gives you a sense of the number of times each observation appears. Looking at the Cars93 dataset, we’ll use this method to get an a sense of the different car manufacturers in the data.


Screen Shot 2019-05-04 at 10.07.11

Right away, because the data is sorted nicely, we can see that this data is made up of mostly Ford, Chevrolet, and Dodge vehicles and very few Chrysler, Suzuki, and Saab vehicles.


Numpy’s isin function is quite useful when comparing the contents of lists or Series. I think this is best demonstrated by a practical example.

Let’s say we’re looking to purchase a new car. Luckily we have a nice dataset that we can use to help narrow down our search! Conveniently, we have a “Passengers” column that the documentation tells us is the passenger capacity.


Screen Shot 2019-05-04 at 10.07.45

We need to make sure that our family of 4 can all fit in the car at the same time but we also live in the city so too big of car will be a hassle to park. We’ll create a separate list of the number of passengers that we feel will be appropriate for our new car.

seats_neeeded = [4, 5]

Let’s say we want to know which Manufacturer has the most cars that will fit our needs. This is where isin becomes handy. We can use isin to subset our DataFrame to only the cars that we know have enough seats for us. We can then grab only the “Manufacturer” column and call value_counts() to identify the Manufacturer that has the most cars for us.


Screen Shot 2019-05-04 at 10.08.00

With this line written, we can substitute “Manufacturer” with just about any other column of the DataFrame to help analyze the data and find the right car for us. And while this is relatively small dataset, imagine having something like 500,000 rows and needing to check against 250 different options. isin becomes extremely useful in those scenarios.


The last helper we’ll take a look at is numpy.where. numpy.where works almost exactly like R’s ifelse and is helpful in labeling or categorizing data. We’ll use it with isin to label the cars we can consider and the cars that won’t fit our needs. In this case, we can to keep the whole dataset incase we change our mind later down the line and want to consider a vehicle that is slightly larger.

df['Consideration'] = np.where(df['Passengers'].isin(seats_neeeded), 'Consider', 'Not For Me')

Here, we’re passing a criteria to check and, in this case, the criteria uses isin that we discussed earlier. If the criteria is met, we’re applying the first label. If the criteria is not met, the second label is applied. These labels will be placed in a new column that we’re calling “Consideration.”

To bring this full circle, we can now use value_counts() get see how many cars we’ll be considering and how many don’t fit our needs.


Screen Shot 2019-05-04 at 10.08.48

Bonus – %whos magic

One of the things I miss most stepping away from R is RStudio. I particularly like the Environment Pane and felt like I was working in the dark without it in Jupyter notebooks. I recently stumbled upon a tweet from @python_tip with a Jupyter magic that somewhat recreates the feel of the Environment Pane.


Screen Shot 2019-05-04 at 10.12.01

This is a handy magic to see what’s going on in your workspace and I recommend following @python_tip for other useful bits.

That’s all I have for now. You can checkout my GitHub Page for the notebook I used for this post. Thanks for reading!


IPython and Shell Commands
Tidy Data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s