Skip to main content

Wrangling Dates with SCL

ยท 7 min read
Mark Wainwright

You have a structured data file you need to process. I'll use CSV here but it could be any data format that SCL supports. It has all the information you need but the date fields are spread out across multiple columns and several different date formats are used.

In this blog post I'll show how to use SCL to combine all the date fields into one and how to handle multiple formats

Sample Dataโ€‹

The following is the CSV file I'll be using in this example.

"Date","Time","TimeZone","Location"
"January, 01 2000","01:23:45","+00:00","UK"
"December, 31 1999","15:23:45","-10:00","Hawaii"
"January, 01 2000","02:23:45","+01:00","France"
"January, 01 2000","06:53:45","+05:30","India"
"01/01/2000","01:23:45","+00:00","Iceland"
"31/12/1999","15:23:45","-10:00","Hawaii"
"01/01/2000","02:23:45","+01:00","Germany"
"01/01/2000","06:53:45","+05:30","Sri Lanka"

It has the same date and time written in eight different ways with two different time formats and four different timezones.

As an additional headache, the Date, Time, and Timezone data is spread across three different fields.

Download the example file here. Or just copy the example above.

Reading the Dataโ€‹

To read the data, make sure you have the File System and Structured Data connectors configured, then run:

- <data> = FileRead 'dates-example.csv' | FromCSV

This reads the contents of the file and pipes the resulting stream into the FromCSV step which converts it into entities.

The rest of this process works with any data source, not just CSV.

Combining the Date Fieldsโ€‹

To parse the dates correctly, we need to combine the three fields. We can use ArrayMap to add another field which is the concatenation of all three.

- <newdata> = <data>
| ArrayMap (In <> Set: 'FullDate' To: $"{<>['Time']} {<>['Date']} {<>['TimeZone']}")

Here we use ArrayMap which maps every entity in our array (FromCSV produces an Array of Entities) to a new value.

Here the new value is the same entity but with an additional 'FullDate' field added which is the values of the three other fields concatenated together using an interpolated string.

The Schemaโ€‹

In order to convert this date string to an actual date we need to create a schema to tell the Transform step what the data is supposed to look like.

- <schema> = (
'type': 'object'
'additionalProperties': false
'properties': (
'Location': ('type': 'string')
'FullDate': ('type': 'string', 'format': 'date-time')
)
'required': ['FullDate', 'Location']
)

To explain the schema:

  • 'type': 'object' tells the Transform step to expect an entity.
  • 'additionalProperties': False will remove all the properties which are not listed in the schema. This is how we get rid of the now redundant 'Date', 'Time', and 'TimeZone' fields.
  • 'properties': is the list of properties to expect and their types.
  • 'Location': ('type': 'string') the location field is just a plain string.
  • 'FullDate': ('type': 'string', 'format':'date-time') The date field is also a string but the 'format':'date-time' tells the Transform step to convert it into a DateTime.
  • 'required': ['Date', 'Location' ] tells the Transform step to give an error or warning if any of these fields are missing

The Date Formatsโ€‹

- <dateFormats> = ['HH:mm:ss MMMM, dd yyyy zzz', 'HH:mm:ss dd/MM/yyyy zzz']

For the Transform step to correctly parse the dates, we need to tell it what date formats to expect. In this case we give it an array with the two different formats present in the data. It will try the first one first, then the second, and if that is unsucessful it will give you an error message.

If we had just one date format we could give just that string rather than an array with one element.

If we had several full date columns with a different format per column we could give an entity mapping column names to formats but that is not required in this case.

SCL is run inside C# and uses the C# Date parsing features and format specifiers. See this document which explains how those work and how to use them but briefly:

  • HH:mm:ss means hours, minutes, and seconds separated by colons
  • MMMM is the full word name for the month while 'MM' is the two digit representation
  • dd and yyyy are day number and year respectively
  • zzz is the timezone hours and minutes offset from UTC. The timezone specifiers can be quite fiddly and you may have to use multiple formats to get it right (e.g. if the timezone is omitted or just '0' for UTC but another format everywhere else)

The Transform Stepโ€‹

Now that we have our data, our schema, and our date formats we can run the Transform step.

<result> = <newdata> | Transform Schema: <schema> DateInputFormats: <dateFormats>

The Transform step has a lot of optional arguments but apart from DateInputFormats the defaults are good enough for us in this case.

We could have set

  • ErrorBehavior to control what happens if a transformation is not possible. By default you get an error but you could get a warning or ignore it entirely. This also lets you control whether or not to output the entity which caused the error.
  • ArrayDelimiters lets you specify which characters can be array delimiters. This way you could transform strings like 'alpha|beta|gamma' into arrays.
  • BooleanTrueFormats, BooleanFalseFormats, and NullFormats allow you to specify which strings can be converted to True, False, and Null respectively. As with the DateInputFormats you can do this on a column by column basis if needed.
  • CaseSensitive allows you to specify whether or not the above formats are case sensitive. This is false by default.
  • RoundingPrecision lets you control how close a floating point number needs to be to an integer to be rounded to it.

Exporting the Resultsโ€‹

We can output the data also as a CSV, but this time with tabs as the separator.

<result> | ToCSV DateTimeFormat: 'yyyy-MM-dd HH:mm:ss' Delimiter: "\t"

DateTimeFormat lets us specify how the DateTime is formatted in the output.

caution

The delimiter \t must be in double quotes. Single quotes are interpreted as a literal string.

Final Outputโ€‹

In the final output we see that all dates are now the same, which is correct:

Location    FullDate
UK 01:23:45 2000-01-01
Hawaii 01:23:45 2000-01-01
France 01:23:45 2000-01-01
India 01:23:45 2000-01-01
Iceland 01:23:45 2000-01-01
Hawaii 01:23:45 2000-01-01
Germany 01:23:45 2000-01-01
Sri Lanka 01:23:45 2000-01-01

Final SCLโ€‹

- <schema> = (
'type': 'object'
'additionalProperties': false
'properties': (
'Location': ('type': 'string')
'FullDate': ('type': 'string', 'format': 'date-time')
)
'required': ['FullDate', 'Location']
)

- <dateFormats> = ['HH:mm:ss MMMM, dd yyyy zzz', 'HH:mm:ss dd/MM/yyyy zzz']

- <data>
| FromCSV
| ArrayMap (In <> Set: 'FullDate' To: $"{<>['Time']} {<>['Date']} {<>['TimeZone']}" )
| Transform Schema: <schema> DateInputFormats: <dateFormats>
| ToCSV DateTimeFormat: 'yyyy-MM-dd HH:mm:ss' Delimiter: "\t"

Try on the Playgroundโ€‹

You can try this SCL on the playground, just copy, paste, and run.

- <data> = """
"Date","Time","TimeZone","Location"
"January, 01 2000","01:23:45","+00:00","UK"
"December, 31 1999","15:23:45","-10:00","Hawaii"
"January, 01 2000","02:23:45","+01:00","France"
"January, 01 2000","06:53:45","+05:30","India"
"01/01/2000","01:23:45","+00:00","Iceland"
"31/12/1999","15:23:45","-10:00","Hawaii"
"01/01/2000","02:23:45","+01:00","Germany"
"01/01/2000","06:53:45","+05:30","Sri Lanka"
"""

- <schema> = (
'type': 'object'
'additionalProperties': false
'properties': (
'Location': ('type': 'string')
'FullDate': ('type': 'string', 'format': 'date-time')
)
'required': ['FullDate', 'Location']
)

- <dateFormats> = ['HH:mm:ss MMMM, dd yyyy zzz', 'HH:mm:ss dd/MM/yyyy zzz']

- <data>
| FromCSV
| ArrayMap (In <> Set: 'FullDate' To: $"{<>['Time']} {<>['Date']} {<>['TimeZone']}" )
| Transform Schema: <schema> DateInputFormats: <dateFormats>
| ToCSV DateTimeFormat: 'yyyy-MM-dd HH:mm:ss' Delimiter: "\t"
| Print