Back to all How-tos

Using Preprocessors in the DataMapper

Preprocessors are an extremely powerful feature of the DataMapper. They allow you to make changes to the input data file before anything else happens – even before the file settings are applied and the boundaries are set!

Using preprocessors, you can do things such as:

  • Convert the data to a different format
  • Filter out or replace characters or entire lines
  • Add data to the file

In this how-to, we’ll be using the preprocessor to transform a CSV file with a variable number of fields per line into something the DataMapper can actually process. As you may know, CSV files are expected to have the same number of fields on each line, but in some cases, the CSV contains several different types of lines, with a different number of fields, and the DataMapper is unable to deal with these types of files natively because it expects to be able to refer to all cells in all records in a consistent manner.

Here’s what the actual data looks like:

1,Invoice
2,INV0001,CUS9876,2021-10-03,228.77
3,Some item,2,1,24.95,49.90
3,Some other item,3,1,41.49,124.47
3,Yet another item,1,1,54.40,54.40
1,Invoice
2,INV0002,CUS5432,2021-10-03,
3,Some item,1,1,24.95,24.95
3,Yet another item,2,1,54.40,108.80

In our data sample, we can see there are 3 types of lines (highlighted through formatting, above):

  • Lines starting with a 1 indicate the start of a new record
  • Lines starting with a 2 contain invoice information
  • Lines starting with a 3 contain the detail items for each invoice.

Now if you try to open this CSV file with the DataMapper, you won’t get anything usable. But if we could transform the file so that each line has the same number of fields, the DataMapper would be able to open it and we could then set the boundaries whenever we find a 1 in the first column.

So the idea is simple: we need all lines to contain the same number of fields. To achieve that, we’ll first inspect all lines in the file and we’ll record which one has the largest number of fields. We will then go back to the beginning and create a copy of each line in a new file, adding as many empty fields as we need so that they all have as many fields as the line with the most fields in the file.

Putting it together

First, let’s get set up properly:

  1. Copy the above data into your favorite text editor and save the file as data.txt.
  2. Create a new data mapping configuration, selecting CSV as the data source.
  3. Pick the file you just saved and press finish.

At this stage, the Tabular viewer pane displays NO DATA because the DataMapper is unable to make heads or tail of the file. But we’ll be changing that next.

  1. Switch to the Steps pane and click on the Preprocessor step (the first one).
  2. In the step’s properties, under the Preprocessor section, click on the Add button.
  3. Give the Preprocessor a descriptive name, like Convert to CSV.
  4. In the Preprocessor definition field, copy/paste the following code:
var maxFields = 0;
var oneLine;

// First pass: determine largest number of fields 
var inFile = openTextReader(data.filename);
while(oneLine = inFile.readLine()){
  maxFields = Math.max(oneLine.split(",").length, maxFields);
}
inFile.close();

// Second pass: add fields to each line if required
var inFile = openTextReader(data.filename);
var outFile = openTextWriter(data.filename+".csv");
while(oneLine = inFile.readLine()){
  outFile.write(oneLine + ",".repeat(maxFields-oneLine.split(",").length)+"\n");
}
inFile.close();
outFile.close();
copyFile(data.filename+".csv",data.filename);

Explanation of the code

The maxFields variable is used to store the largest number of fields found in the file. This is done in the First pass loop, when the file is first opened, then read through. With each line being read, the code uses the split() to determine how many chunks in the line are delimited by commas. That gives us the number of fields for this line. We then use the max() method to determine if this number is higher than what we already have in the maxfields variable, and if it is, we store the number in the variable as our new maximum number of fields so far. This operation gets repeated until we have read all the lines in the file.

The Second pass loop then starts, again reading all lines, but this time writing each line to an output file and optionally adding as many empty fields as required to each line by using the “,”.repeat(…) statement and specifying the number of fields to add to this line (which is done by subtracting the number of fields on this line from the maxFields variable). A line feed is also added at the end of each line to make sure we don’t simply append content to the previous line we wrote.

Converting and using the data

Now that we have everything ready, we can move on to the final steps.

  1. Make sure the tick box next to the preprocessor name is checked, then click the Apply button.
  2. Double-check that the Tabular viewer displays the data. If you followed each step. you should get some data displayed, but the headers are all wrong. That’s because by default, the DataMapper assumes the first line of the file contains the column headers.
  3. Click back on the Settings Tab and untick the First row contains field names option. This immediately fixes the issue.
  4. Now that we have a proper CSV, we can set the boundaries with a Trigger set to On field value, selecting the COLUMN1 field and setting the expression to 1.

That’s it, the transformation is complete and the data is now as easy to process as any standard CSV file.

Notes

At Design time, the preprocessor scripts do not run automatically each time you open a new data file. You have to explicitly select them and run them for the changes to be applied to the data file.

Also, to simplify this article and focus specifically on demonstrating the use of the preprocessor function in the DataMapper, the code above implements a very basic approach to parsing CSV lines. It wouldn’t work properly if some of the fields were double-quoted and contained commas (i.e. “Springfield, Missouri“).

For a more robust implementation of CSV-parsing, use the following code instead:

var maxFields = 0;
// Thanks to awwsmm for this clever RegEx 
// https://gist.github.com/awwsmm/886ac0ce0cef517ad7092915f708175f
var re = new RegExp('(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))','g');

var maxFields = 0;

var oneLine;
var inFile = openTextReader(data.filename);
while(oneLine = inFile.readLine()){
  maxFields = Math.max(getNumberOfFields(oneLine,re), maxFields);
}
inFile.close();

var inFile = openTextReader(data.filename);
var outFile = openTextWriter(data.filename+".csv");
while(oneLine = inFile.readLine()){
  outFile.write(oneLine + ",".repeat(maxFields-getNumberOfFields(oneLine,re))+"\n");
}
inFile.close();
outFile.close();
copyFile(data.filename+".csv",data.filename);

function getNumberOfFields(str,regex){
  var match, counter=0;
  while(match = regex.exec(str)) counter++;
  return counter;
}

The overall logic is identical, so the same explanations apply, except that in this more elaborate code sample we are using a regular expression to accurately determine the number of fields on each line. For a thorough explanation of how it works, visit the URL referenced at the top of the code.

Leave a Reply

Your email address will not be published. Required fields are marked *

All comments (2)

  • Roelfie

    The link to the Workflow version of the script leads to an empty page. Also, I was wondering, does DataMapper / Workflow also have native support for mapping from JSON? It would be nice if this page contains up-to-date information about mapping JSON in DataMapper…

    • Erik van den Heuvel

      In OL Connect 2021.1 we’ve added JSON data mapping.