If you’ve worked with Excel and/or with data for a while, you’ve probably dealt with pesky date formats. Dates come in a variety of makes and models including “d/m/y”, “mm/dd/yy”, “yyyy-mm-dd” and on and on. There’s nothing worse than opening Excel, applying filters, then dropping the filter menu down to find that Excel is reading a date column as strings.
Normally we would see the correct month and day but instead we see a string of numbers and slashes. Aside from not being able to convert this to your preferred format, perhaps the bigger problem is that you cannot easily sort or filter this data by date and who knows if a database can correctly parse it should you try to upload the data there.
My personal favorite date format is yyyy-mm-dd so we’re going to go through how I would fix this in Excel, R, and Python. The date format for this data is “dd/mm/yyyy hh:mm.” For Excel, I would do the following:
- Add a column to the right of the Created column.
- Use the text to Column function to break apart the cell by the space.
- Delete the Time Column.
- Add 3 columns to the right of the Created column.
- Use the text to column function to break apart the column by the forward slash.
- In the empty column, use the Date function and specify the year, month, and day.
Now, we should have the proper date format we need and we can check this by applying a filter and looking for the month names and days in the filter options. If you’re quick with the keyboard, this may not be so bad, but many people rely on their mouse, making this a lengthy process.
This manipulation can be done quite a bit easier with R and the lubridate package. After reading the csv into R and calling head(), we see that the Created column is a character.
First, we need to explain how to convert the data type to dttm (date time) and we’ll use the dmy_hm() function to do so.
data$`Cleaned Date` <- dmy_hm(data$Created)
I’m placing the “Cleaned Date” in its own column so that we can see the change, but you could overwrite the Created column as well.
If we want to get this in the exact format as Excel,”yyyy-mm-dd”, we can call as_date() to hack off the time at the end.
data$`Cleaned Date` <- as_date(data$`Cleaned Date`)
Note that the data type of Cleaned Date is now date and in just a few lines of code, we’ve converted these dates to the format we want. Here’s a link to the Lubridate GitHub Page so that you can see some of the functions that exist and another link to R for Data Science – Dates and Times Chapter. By the way, R for Data Science is a free book written by Garrett Grolemund and Hadley Wickham.
Using Pandas and Datetime, we can make this fix in just about one line of code. Reading the csv and calling head, we see that we still have the bad date format.
import pandas as pd import datetime as dt df = pd.read_csv(“Date Format Data.csv”) df['Cleaned Date'] = pd.to_datetime(df['Created'],format="%d/%m/%Y %H:%M").dt.date
Here, we’ve used pandas.to_datetime to convert to the date time format, the format argument to explain the date format we’re expecting, and then datetime.date to keep only the date. This one line is certainly much easier and faster than the 6 step Excel fix.
While this is a relatively simple example that doesn’t scratch the surface for all of the date formats that you might see, we can see the power of program languages. In my opinion, dealing with date formats is reason enough to begin learning Python or R!
library(tidyverse) library(lubridate) data <- read_csv("Date Format Data.csv") head(data) data$`Cleaned Date` <- dmy_hm(data$Created) data$`Cleaned Date`<- as_date(data$`Cleaned Date`)
import pandas as pd import datetime as dt df = pd.read_csv('Date Format Data.csv') df.head() df['Cleaned Date'] = pd.to_datetime(df['Created'],format="%d/%m/%Y %H:%M").dt.date