Finding Problematic Tweets with Tweepy

How many times have you said something you’ve later regretted? On top of that, how many times have you’ve posted something on Social Media that you’ve later regretted? Unfortunately, it’s probably happened to just about everyone and to make things just a bit worse, just about everything is saved on Social Media. Athletes such as Donte DiVincenzo and Josh Allen have had huge moments in their careers only to be ripped to shreds for problematic tweets from the past.

Interestingly, Twitter has opened up their platform and have built a rich API allowing users to interact with their data. Packages such as R’s rtweet and python’s tweepy can easily connect you to the various Twitter APIs. You might use these connections to do something like create a simple bot or dive into tweets for some text analysis. In this post, we’re going to use tweepy to review and find potentially old, problematic tweets by pulling tweets and searching for a keyword.

To get started, we’ll need to Authenticate and to do that, we’ll need a Consumer Key, a Consumer Secret, an Access Token, and an Access Token Secret. There are several tutorials out there for obtaining these bits but I found rtweet’s tutorial the clearest. Once we have these, we can run the Hello Tweepy example found in tweepy’s documentation. Just as a quick note, I’m importing my credentials from another file so I don’t share them with the world as I believe that violates Twitter’s terms and conditions.

import tweepy

from tweepy.auth import OAuthHandler
from tweepy_creds import consumer_key, consumer_secret, access_token, access_token_secret

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

Home Timeline

If all goes well, we should get the tweets from the home timeline. Reading through the  documentation, user_timeline() will return the 20 most recent statuses from the authenticated user or a specific user. According to Wikipedia, Katy Perry has the most followers on Twitter so we’ll use her as an example.

my_tweets = api.user_timeline(screen_name='katyperry')
for tweet in my_tweets:
    print(tweet.text)

But there’s has to be more than pulling the text of someone’s tweets, right? Of course there is.

API Response

Printing the whole response, we can see that there is quite a lot we can get back. However, for this project, we’ll keep things simple and focus on just “created at”, the time of the tweet, “text”, the text of the tweet, and “id”, the id that Twitter has assigned the tweet, which can be used to create a URL to the tweet. With these pieces in place, we can start putting everything together to create something more useful.

import pandas as pd
import tweepy

from tweepy.auth import OAuthHandler
from tweepy_creds import consumer_key, consumer_secret, access_token, access_token_secret

screen_name = 'nick_bap'
keyword = 'python'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

tweet_time = []
my_tweet = []
tweet_url = []

for status in tweepy.Cursor(api.user_timeline, screen_name=screen_name).items():
    time = status.created_at
    tweet = status.text
    tweetId = str(status.id)

    url = f'https://twitter.com/{screen_name}/status/{tweetId}'

    tweet_time.append(time)
    my_tweet.append(tweet)
    tweet_url.append(url)

pd.set_option('display.max_colwidth', -1)

df = pd.DataFrame(list(zip(tweet_time, my_tweet, tweet_url)),
                  columns=['Time of Tweet', 'Tweet', 'URL to Tweet'])

print(df.head())

df.to_csv('My Tweets - All.csv', index=False)

Here, we’ve essentially looped through the response, grabbing the date and time, the text, and the id of each tweet.  Then we appended each bit to a list and gathered the lists into a dataframe. We’ve also used Pagination to iterate through the user’s timeline since the original call only returned 20 tweets. As another quick note, I’ve only used the “Standard” subscription which will not return a user’s entire timeline. At most, I’ve returned about 3,000 tweets and to get a full archive, it appears that you may need to upgrade to the “Enterprise” subscription. We’ve also used pd.set_option('display.max_colwidth', -1) so that as much of the tweet will print as possible and have written a CSV to review the data as well.

DataFrame Head

While this is great, we’re really after tweets that contain a specific word. We’ll use str.contains to find the tweets with a similar keyword. For a “stricter” filter, you can use str.match and here’s a quick tutorial/comparison of the two.

df_filtered = df[df['Tweet'].str.contains(keyword, case=False)]

print(df_filtered)

df_filtered.to_csv(f'My Tweets - Filtered for {keyword}.csv', index=False)

 

Now we’re showing all of the tweets I’ve shared that have “python” in them. With this in place, I’ll leave you with your imagination to think of potential words and phrases that may not come off well at a later time. Just a bit of a hint after hearing about Donte DiVincenzo and Josh Allen, consider song lyrics, quotes, and conversations that may be taken out of context.  Cheers!

Resources

tweepy Documentation
rtweet’s Authentication Tutorial
Pandas: Select rows that match a string  – David Hamann

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s