Pesky Date Formats

If you’ve worked with Excel and/or with data for a while, you’ve probably dealt with pesky date formats. Dates come in a variety of makes and models including “d/m/y”, “mm/dd/yy”, “yyyy-mm-dd” and on and on. There’s nothing worse than opening Excel, applying filters, then dropping the filter menu down to find that Excel is reading a date column as strings.

Dates as Strings

Normally we would see the correct month and day but instead we see a string of numbers and slashes. Aside from not being able to convert this to your preferred format, perhaps the bigger problem is that you cannot easily sort or filter this data by date and who knows if a database can correctly parse it should you try to upload the data there.

Excel

My personal favorite date format is yyyy-mm-dd so we’re going to go through how I would fix this in Excel, R, and Python. The  date format for this data is “dd/mm/yyyy hh:mm.” For Excel, I would do the following:

  1. Add a column to the right of the Created column.
  2. Use the text to Column function to break apart the cell by the space.
  3. Delete the Time Column.
  4. Add 3 columns to the right of the Created column.
  5. Use the text to column function to break apart the column by the forward slash.
  6. In the empty column, use the Date function and specify the year, month, and day.

Now, we should have the proper date format we need and we can check this by applying a filter and looking for the month names and days in the filter options. If you’re quick with the keyboard, this may not be so bad, but many people rely on their mouse, making this a lengthy process.

R

This manipulation can be done quite a bit easier with R and the lubridate package. After reading the csv into R and calling head(), we see that the Created column is a character.

Screen Shot 2018-01-16 at 10.33.33 PM

First, we need to explain how to convert the data type to dttm (date time) and we’ll use the dmy_hm() function to do so.

data$`Cleaned Date` <- dmy_hm(data$Created)

I’m placing the “Cleaned Date” in its own column so that we can see the change, but you could overwrite the Created column as well.

Screen Shot 2018-01-16 at 10.37.35 PM

If we want to get this in the exact format as Excel,”yyyy-mm-dd”, we can call as_date() to hack off the time at the end.

data$`Cleaned Date` <- as_date(data$`Cleaned Date`)

Screen Shot 2018-01-16 at 10.40.01 PM

Note that the data type of Cleaned Date is now date and in just a few lines of code, we’ve converted these dates to the format we want. Here’s a link to the Lubridate GitHub Page so that you can see some of the functions that exist and another link to R for Data Science – Dates and Times Chapter.  By the way, R for Data Science is a free book written by Garrett Grolemund and Hadley Wickham.

Python

Using Pandas and Datetime, we can make this fix in just about one line of code. Reading the csv and calling head, we see that we still have the bad date format.

df head before

import pandas as pd
import datetime as dt

df = pd.read_csv(“Date Format Data.csv”)

df['Cleaned Date'] = pd.to_datetime(df['Created'],format="%d/%m/%Y %H:%M").dt.date

Here, we’ve used pandas.to_datetime to convert to the date time format, the format argument to explain the date format we’re expecting, and then datetime.date to keep only the date. This one line is certainly much easier and faster than the 6 step Excel fix.

Screen Shot 2018-01-27 at 9.26.30 AM

While this is a relatively simple example that doesn’t scratch the surface for all of the date formats that you might see, we can see the power of program languages. In my opinion, dealing with date formats is reason enough to begin learning Python or R!

Resources

Python
Pandas.to_datetime Documentation

R
Lubridate GitHub Page
R for Data Science – Dates and Times Chapter

Code Snippets

R

library(tidyverse)
library(lubridate)

data <- read_csv("Date Format Data.csv")

head(data)

data$`Cleaned Date` <- dmy_hm(data$Created)

data$`Cleaned Date`<- as_date(data$`Cleaned Date`)

Python

import pandas as pd
import datetime as dt

df = pd.read_csv('Date Format Data.csv')

df.head()

df['Cleaned Date'] = pd.to_datetime(df['Created'],format="%d/%m/%Y %H:%M").dt.date

 

One thought on “Pesky Date Formats

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s