2023.1 – DM performance metrics

Avatar

Starting with version 2023.1, the DataMapper provides performance metrics for each step of the DM flow. Those metrics help you identify the steps that consume the most time, which allows you to accurately pinpoint which steps should be reviewed and tweaked to improve overall performance.

How to obtain metrics

That’s the easy part: once you have implemented your logic in the DataMapper, click the Validate all records button (Alt-V). The operation does as it did before: it runs your DM flow for all records (up to the limit you defined for the Record limit setting in the Boundaries section).

But starting with Version 2023.1, the DM keeps an internal record of how much time each step takes, and how many times each step is called. Once the validation process is complete, a Window is displayed, listing all the metrics recorded by the operation.

Obviously, the more records are validated, the more precise those metrics are likely to be.

Interpreting the metrics

The metrics are displayed in descending order of Max time in record, the rightmost column in the table. The values in that column show the longest time spent executing each individual step (in microseconds) for all records that were validated. The specific index of the record in which this occurred is also displayed. Simply clicking on one of the table rows automatically moves the current record to that index so that you can immediately start investigating why, for that record, the step took longer than for others. In parallel to moving to that target record, clicking on one of the lines also causes the DataMapper to automatically highlight and select the specific task in your DM flow.

So in the example above, if we were to click on the first line, the DataMapper would automatically take us to record #8 and highlight the Extract item step inside the loop, but if we clicked on the second line, the DataMapper would navigate to record #18 and highlight the Goto next item step.

The second important metric is the Calls column. This shows you how many times the step was executed in total, for all records that were validated. In the above example, my file only has 50 records (which you can see by looking at the metrics for the Find first item step, at the bottom of the table). But the Extract item step was executed 1508 times in total, because that’s the step that extracts each detail line from the invoice (which means, on average, each of the invoices has about 30 line items).

The reason why the number of calls is important is that it allows you to gauge whether the Time(%) spent is critical or not. In the example above, two steps are running 1508 times, but one of them takes a whopping 92% of the time, while the other takes 6%. So obviously, you should first focus on the former if you plan on optimizing your flow.

Note that each time you run your validation process, those numbers will change and the order of the steps might change as well. That’s normal: your machine might be busier on one particular occasion, or the differences could be so minute that a few microseconds might alter the order. So you should not take those numbers at face value, but rather regard them as indications as to where a DM flow might benefit from optimization.

Detected problems

A second tab in the metrics dialog displays any potential problems that were identified as part of the validation process. Those problems are not errors per se, but they may highlight a potential problem with the configuration that did not cause an actual error with the current file, but that might with a different one.

As an example, I altered the above example so that the document boundaries are set to All pages (instead of actually identifying specific boundaries for the 50 invoices in my file). Now, when I validate the file, I get the following:

The validation process detected two potential issues: the first one explains that it could not validate all line items being extracted to the detail table because the number of items exceeds the limit set in Preferences > Editing > Detail records preview limit. You can change that limit temporarily and validate the file again to make sure everything is fine.

The second potential issue has to do with the boundaries themselves: the entire file only generates a single record. Now that may be a perfectly valid and intentional setting, but the DM validation process still highlights it because if you throw a file at it that contains thousands of pages, then you may end up with zillions of detail items in your table, which will lead to performance issues and, possibly, to memory issues as well. So if you know that this will never occur, you can safely ignore the warning.

There are a few other types of warnings and informational tidbits that can be displayed in this table. You can get all the information from the help page.

Note that there are no errors displayed in this tab: that’s because the Performance/Problems window is never displayed when the validation process runs into errors. The reasoning behind this is simple: it’s useless to start optimizing a process that doesn’t work in the first place! So fix your errors first, and then you can start looking at potential ways to improve it.

What isn’t validated

The pre/post processors are not validated because they are not executed for each record in the data stream. If you use scripted boundaries, then those are not measured either, for the same reason.

Also, for Repeat and Condition steps, the measure is based on the time spent evaluating the condition for the step, not on the entire branch (you have the individual step timings for that). So for instance, the time displayed for a Repeat step does not include the total time spent on each of the steps inside the loop structure.

Some final notes

The performance metrics are a very useful tool for identifying steps that may require optimization. But don’t go crazy with optimization, we are talking microseconds, here! If you shave off 10 microseconds from a step, after 100 000 records, you will have gained 1 whole second! Your are probably better off spending your time on some more rewarding projects…

That said, the steps that are most likely to be identified as slow-er are usually script-based. That’s because the script engine must be initialized for each step that uses it, which adds some overhead to the actual execution time. As a rule of thumb, try avoiding scripts when you can do the same thing through standard steps. But when you do use scripts, then make sure you write clean and efficient code, especially if your script uses internal loops.



Leave a Reply

Your email address will not be published. Required fields are marked *