Scripted boundaries for PDF

Avatar
OL Learn Blog Automation

What do you do when you have a single PDF file that contains multiple variable length documents and the only distinctive mark is a barcode? Read on to find out!

The challenge

Earlier this week there was a post in one of our user forums that I found quite intriguing: the user had a PDF file that contains several documents of variable length and he wanted to be able to determine the boundaries of each document within the PDF. So far, so good, that’s the bread and butter of the DataMapper.

However, there was a catch: there is no distinctive information that marks the beginning or the end of a document other than a datamatrix barcode on the first page of each document in the file. Fascinating, I thought, in my best Spock impersonation. The DataMapper is definitely not able to read the contents of a barcode, but the Workflow task name Barcode Scan can, so it should be relatively simple to link the two of them in order to process that file.

Another user had already mentioned the Barcode Scan task, and was proposing that the PDF file could be split into so many documents, according to the results of that task, and that each chunk of the original file could then be processed independently. And that made perfect sense but of course, me being me, I thought: I’m sure I can do better and handle the original file is a single pass, without having to split it first.

Please don’t judge me… 😜

The journey to discovery

I started tinkering with my idea while watching a hopeless hockey game in which my favorite local team was getting trounced again. Soon, this little project became much more interesting than the hockey game and I finally came up with a pretty clever solution. Well at least in my mind, it was pretty clever!

So I started writing up a response to the forum post, taking screenshots and explaining in detail what I did. I was eager to publish it, I derive a lot of satisfaction from being able to help our users with these kinds of challenges. But then, at some point, I noticed a little something that should have jumped at me from the start: the question had been asked in the PlanetPress Classic forum. And PlanetPress Classic does not have a DataMapper module. Which means that the other user’s proposal of splitting the PDF was in fact the best solution.

Gulp.

After kicking myself in the behind many times, just like what was being done to my favorite hockey team that night, I figured that my solution was still clever enough and that it might help OL Connect users facing the same kind of challenge, which lead me to writing this very article.

The Workflow part

All right. So we have a PDF file in which some pages have a datamatrix barcode. This indicates that a new document is starting. Handling that part is relatively easy in Workflow: capture the file and use the Barcode Scan task to generate metadata. The Workflow process would look something like this:

In the Barcode Scan task, I’m selecting only the datamatrix barcode option and setting the Process by dropdown to File, so that the task examines every single page in the original PDF. Once the task has run, it produces metadata that contains a single Document, and as many DataPage nodes as there are pages in the PDF. If a datamatrix barcode is not detected on a particular page, the metadata for that data page contains a single field named BarcodeCount, with a value of 0. Otherwise, BarcodeCount contains a positive number and additional fields are added to the metadata as well, but we won’t really be needing those for the purpose of this article.

I then use a script to extract the index of each page where a barcode was found, and I store it in a JSON structure which is itself stored in a process variable that I can pass later on to my DataMapper as a runtime parameter.

Here’s the script:

var myMeta = new ActiveXObject("MetadataLib.MetaFile");
myMeta.LoadFromFile(Watch.GetMetadataFilename());
var docs = [];
var doc = myMeta.Job().Group(0).Document(0);
for(var i=0;i<doc.Count;i++){
  var onePage = doc.Datapage(i);
  if(onePage.FieldByName("BarcodeValue")){
    docs.push({page:i,value:onePage.FieldByName("BarcodeValue")})
  }
}
Watch.SetVariable("DocArray",JSON.stringify(docs))

The script goes through the entire metadata and extracts the page index when a barcode is present, and it stores that index – along with the barcode’s value, for good measure – in an simple object that gets added to an array. In the end, that array will look something like this:

[
  {"page":0,"value":"INV2195569"},
  {"page":3,"value":"INV8588592"},
  {"page":4,"value":"INV8832571"},
  {"page":7,"value":"INV3829483"}
]

This array gets stored in a process variable named DocArray and it is that variable that gets passed as a runtime parameter to the DataMapper configuration that will actually be handling the PDF.

The DataMapper bit

The DataMapper config now receives two pieces of info: the original data file itself, and a runtime parameter containing an array of objects that specify on which page each individual document starts in the PDF.

To set the boundaries according to the values in that array, we’ll need a script. That means the DataMapper Boundaries’s Trigger option has to be changed from On Page to On Script. And the script is then deceptively simple:

var docs = JSON.parse(automation.parameters.docArray);
if(docs.find(function(e){return e.page===(boundaries.currentDelim-1)})){
    boundaries.set();
}

If you aren’t familiar with scripted boundaries, this script is run whenever a natural boundary is encountered in the data. In a PDF file, the natural boundary is a page, which means this script get executed for all pages in the PDF, prior to the actual data mapping process actually starting.

So for each page, the script does the following:

  • Parse the runtime parameter and store it in a local array (docs)
  • Use the array.find() method to determine if the current PDF page (boundaries.currentDelim) is present in the docs array (notice we are subtracting 1 because in the DataMapper, the first page’s index is 1, not 0.
  • If the index is found in the array, then a new boundary is set

When executing the data mapping process, we now have the proper number of variable length documents and we can start doing whatever needs to be done for each document.

Conclusion

This technique of passing the boundaries inside a runtime parameter can be used in other instances, it doesn’t necessarily have to be a PDF with barcodes. For instance, it could be used with a CSV file where the runtime parameter would let the DataMapper know how many CSV lines per document it should process.

But even if you never use it, I didn’t want to let my oh-so-clever(!?!) solution go to waste, so you can at least use it as a training exercise!

Tagged in: barcode, boundaries, runtime parameters



Leave a Reply

Your email address will not be published. Required fields are marked *

All comments (1)

  • Erik

    Thanks, Phil, great article!