Stop doing that! – 2

Avatar

Last time around, we looked at some bad practices commonly found in OL Workflow configurations. Today, we’re taking a look at some of my pet peeves with DataMapper configurations.

XML/JSON loops

XML and JSON are some of the most widely used data formats out there, so it makes sense to try and understand how to process them efficiently. In both cases, that means using a query language that allows the DataMapper to precisely target elements that need to be processed, while ignoring the others as if they simply don’t exist.

For XML, that query language is called XPATH and if you start googling for examples of using that language on an XML file, you may very well find it overwhelming.

Because it is.

However, you don’t need to master the entire XPATH language, you just need to focus on a few key methods that will make your life a whole lot easier and your data mapping processes lightning fast.

Consider this example: your company is a wholesaler and you’ve been tasked with extracting specific data from invoices. In this instance, the challenge is to only extract line items whose total value is above 10,000$.

Here’s the kind of logic I see all the time in DM configs:

Seems perfectly valid, doesn’t it? We just loop on all line items, and we use a condition to check if the total is larger than 10,000. If it is, we extract the line items info. All good… right?

Nope. The problem with this logic is that the process examines each individual line item. In my sample data, I have 43 line items, but only 13 that actually match the condition. So that means I have 30 extraneous loops that just slow down the processing for no reason. And if my data file contains thousands of invoice, each having as many (or more) line items… you do the math!

Here’s what each line item looks like in the XML file:

<ITEM>
  <Number>IPW8813950</Number>
  <Description>Coating: Plasma</Description>
  <UnitPrice>232.75</UnitPrice>
  <Ordered>6</Ordered>
  <Shipped>3</Shipped>
  <BackOrder>3</BackOrder>
  <Total>1396.50</Total>
</ITEM>

For the loop shown above, the Collection property for the loop is set to the following XPATH statement:

./CUSTOMER/INVOICE/ITEM

Which targets all ITEM elements, as the datamap indicators show:

But we can make a very simple change to that XPATH statement in order to target specific elements, based on one of their children’s value:

./CUSTOMER/INVOICE/ITEM[Total>10000]

And now, the resulting datamap:

That screenshot demonstrates how the loop now targets specific elements, which means we can do away with the condition inside the loop. We end up with this very elegant and efficient process:

And by the way, for a JSON file with a similar structure, the JsonPath syntax for the loop would be :

.ITEM[?(@.Total>10000)]

As you can see, you don’t have to learn all of XPATH or JsonPath. Just learning the proper syntax for filtering elements will already be a huge improvement.

So if you are still using conditions as filters inside your loops… please stop doing that!

Text/PDF loops

Yes, loops again. There’s a reason why I have this fixation on loops: that’s where most of the time is spent in data mapping configurations. And any loop iteration that does nothing is a waste of time, plain and simple. We already saw how this can be improved in XML and JSON files, now let’s take a look at Text and PDF files. I’ll be using a text file as an example, but the same principles apply to PDFs.

The data here is an invoice that may – or may not – overflow on multiple pages. The challenge is to extract all line items without having to examine every single line in the invoice.

Here’s what the data looks like for a single invoice (this is a bit small, but we’ll focus on more precise details later on):

This is typical of a text file that was designed specifically for use with pre-printed forms on dot matrix printers. If you don’t know what dot matrix printers are, count yourself lucky! They are to laser printers what rock and chisel were to papyrus and ink. But anyway… each page contains a header, and the line items just spill over to the next page.

Most of the times, this is processed with the same kind of logic as with XML files: loop through all lines, with a condition examining each line to determine whether or not it is a line item, while also double checking if we’ve reached the end of the record. Each line item looks like this:

So usually, the condition inside the loop checks if there is a numeric value at a certain position, or a decimal point, or perhaps a currency sign. But if you go back to the full view of the data above, you eventually realize that there are numbers and decimal points in the same horizontal position that do not belong to line items:

That forces you to tweak your condition inside the loop, sometimes requiring you to create a compound condition (i.e. one that checks for multiple matches).

Contrary to XML/JSON, there’s no query language for text or PDF files that would allow you to target specific elements. But there are ways to emulate such a query language with… you guessed it… Regular Expressions!

If you look carefully at the data, you’ll see that each product ID uses a consistent format: 3 capital letters, followed by 7 digits. This can be easily formalized with the following regex:

[A-Z]{3}\d{7}

So if we could target all lines that match that regular expression, we could do away with our conditions. Unfortunately, in text/PDF mode, the loop doesn’t have filtering options likes XML/JSON do, so you still have to specify that it should run until no more elements are found.

However, a text loop always includes a Goto step. In a standard loop, that Goto step usually targets the next line, which means the loop iterates through all lines. But instead of doing that, you can have the step target the Next occurence of the above regular expression, making sure that it only looks for matches at the beginning of each line. Which magically gives us the following:

While the loop is initially told to iterate on all lines, it’s the Goto step in this instance that allows us to skip any line that’s not a line item without having to use any condition inside the loop. This technique, just like the filtering for XML/JSON data, will tremendously speed up your data mapping process while streamlining its design.

Yes I know: I’m also fixated on regular expressions. But believe me when I say that spending an hour learning the very basics of it will save you countless hours of processing in the long run.

So if you are still using conditions as filters inside your loops (take 2)… please stop doing that!

Splitting data values

Now this one is a real pet peeve of mine. Say you have a line of data that looks like this:

North Charleston, SC

And you want to extract the City and State to two separate fields in your data model. I have probably seen more variations on this than there are cities in your country! They usually involve scripts of varying complexity.

First, the Field Definition mode property is set to Script (instead of Location). Then a script similar to the following is used:

var cityState = data.extract(1,60,9,2,"");
var comma = cityState.indexOf(",");
city = cityState.slice(0,comma);
city;

and the equivalent script for the state looks like this:

var cityState = data.extract(1,60,9,2,"");
var comma = cityState.indexOf(",");
state = cityState.slice(comma+1).trim();
state;

Of course, the more geeky-minded will write something like this:

data.extract(1,60,9,2,"").slice(0,data.extract(1,60,9,2,"").indexOf(","))

Because – they think – it makes it soooooooooooooo much more readable… NOT!

But either way, when the Field Definition mode is set to Script, the extracted data is no longer highlighted in the data Viewer (that’s by design), which can be confusing because you have keep double-checking whether or not you already extracted that piece of information.

But really, all you need to do is leave the Field Definition mode to its default Location value and then simply use a Post function. For the city:

split(",")[0].trim()

and for the state:

split(",")[1].trim()

The code is much cleaner, and the extracted area is still being highlighted in the data Viewer.

So if you have a tendency to overthink how to handle a simple task like this… please stop doing that!

Conclusion

Hopefully, this article has allowed you to identify processes that you can improve quickly without having to redesign the entire thing. And it has allowed me to get a few pet peeves off my chest. Win-win!!

If you want more example of … uhm … questionable practices, let us know in the comments and we may write follow-ups to this article.

Tagged in: datamapper, JSON, workflow, xml



Leave a Reply to Darren Keenan Cancel reply

Your email address will not be published. Required fields are marked *

All comments (3)

  • Darren Keenan

    Nice article with helpful tips which I’m guilty of. Could you add a save as PDF button so I can easily save a PDF to give to a customer?

    • Philippe Fontan

      We’ll see if we can plug that in real quick. But in the meantime, why not just tell your customer to subscribe to the blog?

  • Darren Keenan

    Nice article with some really helpful tips. I know that I’m guilty