Skip to main content

Performance Improvements in v0.18.0

· 12 min read
Mark Wainwright

The focus for Sequence release v0.18.0 has been on performance. This post explains the changes we've made, how they've improved performance, and why we decided to implement them. This is aimed at both SCL users and C# developers who are interested in performance.

What is SCL

Sequence Configuration Language (SCL) is a programming language for forensic and ediscovery technicians. The typical use case involves reading, writing, and manipulating large amounts of data so performance is critical. The language is implemented in C# so performance improvements are achieved by optimizing C# code.

Measuring Performance

Before we improve performance we need to measure it. This means benchmarking. I created a new console project and added the excellent benchmark dot net package to it.

As we want to optimize real world performance I found some real world data to run the benchmarks on. This data is a 1MB concordance file (very similar to a CSV) with 1000 rows and 44 columns. The data comes from the ENRON dataset but nevertheless I was very careful to add it to .gitignore and not commit it to the open source repository.

I wrote several SCL scripts representing different workflows. For example this script removes a column from the data:

- <path> = PathCombine ['Data', 'TestData.dat']
- FileRead <path>
| FromConcordance
| EntityMap (EntityRemoveProperty <> Property: "Custodian" )
| ToConcordance
| StringLength

Note that StringLength at the end. This is to ensure that the stream of concordance data is read to the end (and therefore all of the data is converted to concordance). In a real world scenario one would probably write the data to a file or send it across the internet but doing that would make the benchmarks a lot more fiddly and the code that writes to a file is outside my control anyway so there is not much point in benchmarking it.

Benchmarks

Here is a summary of all the benchmarks and the initial results, running on a 12th gen Intel i7 CPU on Windows.

NameSummaryMean / μsError / μsStandard Deviation / μs
BasicSCLPrints 'Hello World'68.111.3271.363
CombineColumnsReads a concordance file, adds a new column by combining two existing ones, and converts it back to concordance306,342.042,856.8232,385.576
ConvertDatesReads a concordance file, converts dates to a different representation, and converts it back to concordance121,637.431,641.5551,535.512
DatToDatReads a concordance file, and converts it back to concordance92,824.18810.059718.096
DatToJsonReads a concordance file, and converts it json86,831.861,723.1772,471.330
JsonToDatReads a json file and converts it to concordance74,657.571,249.8051,169.068
RemapColumnsReads a concordance file, renames one of the columns, and converts it back to concordance98,585.921,902.4142,035.561
RemoveColumnReads a concordance file, removes one of the columns, and converts it back to concordance149,538.831,581.3421,479.188
SchemaValidateReads a concordance file, confirms that all rows conform to a particular json schema, and converts it back to concordance133,736.27793.008702.981
StringReplaceReads a concordance file, performs some string replacements on each row, and converts it to json335,875.114,519.0054,227.080

Note that all times are in microseconds, so the longest sequences are taking about one third of a second.

Performance Improvements

Replacing Task with ValueTask

SCL steps all have a Run method which returns the result wrapped in a Task. One of the performance optimizations I was eager to try was wrapping the result in a ValueTask instead. This should reduce the number of heap allocations per step by one and I was expecting a small performance improvement for sequences which run cheap steps inside a loop.

The tradeoff is that there are some things you can do with a Task but not with a ValueTask such as awaiting them multiple times. Fortunately the Run method was only being used in a few places and I was able to easily check that none of the missing functionality was being used.

The results of this update:

NameInitial / μsAfter ValueTask / μsRatio
BasicSCL68.1164.560.947878432
CombineColumns306,342.04306,125.290.999292458
ConvertDates121,637.43120,553.240.991086707
DatToDat92,824.1893,733.291.009793892
DatToJson86,831.8686,622.870.997593165
JsonToDat74,657.5774,910.931.003393628
RemapColumns98,585.9298,365.440.997763575
RemoveColumn149,538.83148,584.520.993618313
SchemaValidate133,736.27131,679.900.984623693
StringReplace335,875.11328,927.930.979316181

Finding a Bug

A 2% performance improvement on the slowest benchmark was nice but I was actually hoping for a larger improvement. The StringReplace sequence was taking about three times as long as the ConvertDates sequence despite doing what seemed like a lot less work.

I thought that the reason might be that StringReplace converts the data at the end into JSON rather that concordance like most of the other benchmarks. I chose to do this initially because I wanted to make sure that performing an action on the data first didn't make the conversion to JSON slower. At this stage I wrote additional benchmarks to check if that was what was happening. It wasn't - converting to JSON was consistently ~6000μs faster than converting to concordance across the board.

Thinking again, I thought the difference might be down to the fixed cost of running a step. ConvertDates uses the Transform step which applies a JSON schema to every step in an entity stream whereas StringReplace is using the ArrayMap step to call the StringReplace step on every entity in the stream. This means that the StringReplace sequence is running about 1000 very cheap steps where ConvertDates is calling one very expensive step.

I decided to use the Visual Studio performance profiler to try and find out what was going on. I stared by using the CPU usage tool as I thought the CPU was the most likely performance bottleneck.

I discovered that a lot of time was being spent in the IRunnableStep<T>.Run method. This is not unexpected but what was unexpected was that it was spending almost three times as long in there as in the CompoundStep<T>.Run method despite it having been designed as a thin wrapper around that method which exists purely to enable logging.

Looking at the method code I immediately spotted the problem:

async Task<Result<T, IError>> IRunnableStep<T>.Run(IStateMonad stateMonad, CancellationToken cancellationToken)
{
using (stateMonad.Logger.BeginScope(Name))
{
object[] GetEnterStepArgs()
{
var properties = AllProperties
.ToDictionary(x => x.Name, x => x.Serialize(SerializeOptions.SanitizedName));

return new object[] { Name, properties };
}

LogSituation.EnterStep.Log(stateMonad, this, GetEnterStepArgs());

var result = await Run(stateMonad, cancellationToken);

if (result.IsFailure)
{
LogSituation.ExitStepFailure.Log(stateMonad, this, Name, result.Error.AsString);
}
else
{
var resultValue = result.Value.Serialize(SerializeOptions.SanitizedName);

LogSituation.ExitStepSuccess.Log(stateMonad, this, Name, resultValue);
}

return result;
}
}

The EnterStep arguments are being created - using an expensive dictionary - every time the step is run, regardless of the logging level (they should only be logged if the log level is Trace and the benchmarks are using Info). Clearly what happened was that the LogSituation.EnterStep.Log method once took a Func<object[]> but was refactored to use an object[] at some point and I missed the performance implications when updating this invocation.

I fixed the code and also checked other invocations of the method to make sure I wasn't making the same mistake somewhere else (I wasn't). Then I ran the benchmarks again:

NameAfter ValueTaskAfter EnterStep FixRatio
BasicSCL64.5635.390.548172243
CombineColumns306,125.29103,820.130.339142611
ConvertDates120,553.24119,108.530.988016
DatToDat93,733.2992,621.390.988137619
DatToJson86,622.8786,615.620.999916304
JsonToDat74,910.9377,218.891.030809389
RemapColumns98,365.4497,755.470.99379894
RemoveColumn148,584.5298,192.860.660855249
SchemaValidate131,679.90131,270.770.996892996
StringReplace328,927.9395,778.850.291184911

The three benchmarks which run steps in a loop all got a lot faster, and the BasicSCL sequence got almost twice as fast. This just goes to show that finding and fixing performance bugs is often massively more effective than micro-optimizing heap allocations.

Rearchitecting Entities

The other way to get big performance wins is to use the right collections in the right places. SCL has the type entity which is used to represent data. It's a key-value or property-value container, similar to a javascript object. The most obvious way to represent this in C# was to use a Dictionary<String, Object> and that was essentially how the Entity type was implemented - though using the SCL-friendly types StringStream and ISCLObject instead. I wanted to try a different implementation using two ImmutableArrays instead. ImmutableArray is an extremely thin wrapper around an array and has exactly the same performance characteristics.

I refactored the entity type to use an ImmutableArray<StringStream> for the property names and ImmutableArray<ISCLObject> for the values. I made them private to ensure that they would always be kept in sync. I updated the implementations of all the entity methods but not the signatures. Keeping the interface the same meant that the refactor was very smooth and I hardly had to change any code outside the entity class.

Making this change also allowed me to sneak in a few other performance optimizations. When creating entities from a consistent data structure (such as a CSV file or a SQL table) I could let all the entities share the same headers collection. Note that this would have been a bad idea if I had used a regular rather than immutable array.

I ran the benchmarks again and compared to the previous results:

NameAfter EnterStep fixAfter Entity RefactorChange
BasicSCL35.3935.811.011867759
CombineColumns103,820.1370,840.400.682337809
ConvertDates119,108.5362,156.690.521849191
DatToDat92,621.3961,528.620.66430249
DatToJson86,615.6241,558.640.479805375
JsonToDat77,218.8950,542.410.654534273
RemapColumns97,755.4760,397.030.617837856
RemoveColumn98,192.8662,817.560.639736535
SchemaValidate131,270.77104,101.350.793027648
StringReplace95,778.8552,792.740.551194131

This change almost ❗ doubled ❗ performance across the board (note that the BasicSCL sequence does not use entities). I was very happy.

The table and charts below show the complete journey. Some of the steps were now several times as fast and this difference would likely be magnified in real world scenarios that perform multiple transformations since the costs of reading the file and converting between data formats are fixed.

NameInitialAfter ValueTaskChangeAfter EnterStep fixChangeAfter Entity RefactorChangeTotal Change
BasicSCL68.1164.560.94787843235.390.54817224335.811.0118677590.525767141
CombineColumns306,342.04306,125.290.999292458103,820.130.33914261170,840.400.6823378090.231246093
ConvertDates121,637.43120,553.240.991086707119,108.530.98801662,156.690.5218491910.510999698
DatToDat92,824.1893,733.291.00979389292,621.390.98813761961,528.620.664302490.66285121
DatToJson86,831.8686,622.870.99759316586,615.620.99991630441,558.640.4798053750.478610501
JsonToDat74,657.5774,910.931.00339362877,218.891.03080938950,542.410.6545342730.676989755
RemapColumns98,585.9298,365.440.99776357597,755.470.9937989460,397.030.6178378560.612633427
RemoveColumn149,538.83148,584.520.99361831398,192.860.66085524962,817.560.6397365350.420075241
SchemaValidate133,736.27131,679.900.984623693131,270.770.996892996104,101.350.7930276480.778407757
StringReplace335,875.11328,927.930.97931618195,778.850.29118491152,792.740.5511941310.157179673

Performance Improvement Benchmarks

Takeaways

The top takeaways from this project are:

  • Fixing bugs is often the best way to improve performance
  • Dictionaries are slow - plain old arrays are often faster even when a Dictionary or Hashmap is the "theoretically" correct choice.
  • Refactoring can be easier than you expect, especially if the code you are changing is well insulated and the API to that code remains consistent.

One final thing - even though the ValueTask update only showed a very small improvement initially, that improvement was orthogonal to the other improvements. It shaved 5ms off the initial StringReplace benchmark and once I had completed the other improvements I tried reverting just the ValueTask changes and running the benchmarks again. StringReplace was now about 5ms slower. Had I done this change last instead of first it would have looked like an 8% improvement instead of a 2% one. This shows the value of small incremental improvements. Sometimes they don't just add up - they multiply!