2023.1 – PDF with non-embedded fonts

Avatar

When processing PDF input with OL Connect, it is often a problem if the input files do not have all fonts embedded. At the very least this can make the content look bad, but it can also cause DataMapper’s text extraction to produce incorrect data. With version 2023.1, OL Connect can be provided with the missing fonts, or substitute fonts, so it produces the right content and text extraction.

PDFs that have missing fonts often look fine in a PDF viewer like Acrobat Reader, because it will look for fonts on your system, and substitute a lookalike if it can’t find a match. Often this looks fine, since many people can’t spot the difference between similar fonts anyway. To a person reading a document, the most important thing is that the text is legible. If legibility is a problem, it can be remedied by installing an additional font, but the document doesn’t need to change. Corporate look is always an issue with font substitution, but that’s not always considered important.

With OL Connect, things are different. It partly relies on font information for text extraction, and it embeds fonts in its output to get reliable results. But fonts can’t be easily changed once embedded, so embedding substitute fonts may be a bad choice.

The best way to prevent issues with PDF input in OL Connect, is to have all fonts embedded. However, sometimes there is insufficient control over the production of the PDF input for Connect. In that latter case, things could be difficult with OL Connect: text extraction could fail (especially with Asian languages), and missing fonts would fall back to Courier in the output.

The benefit of falling back to Courier is that it’s hard to miss (in most cases): it looks really bad. But that’s of course also a problem: the output is not acceptable.

OL Connect output from PDF input with missing fonts

With version 2023.1, we give users the possibility to provide the missing fonts as separate files. It’s possible to provide the actual font, or to provide a substitute font.

Providing external fonts

The sample above is from a file that uses fonts from the Galano Grotesque family. If we happen to have those fonts we can fix the OL Connect output to look correct. This is how it works:

  1. Copy the fonts to the folder C:\ProgramData\Objectif Lune\OL Connect\Resources\PDF\fonts
  2. Create a file named Fontmap in the parent folder C:\ProgramData\Objectif Lune\OL Connect\Resources\PDF
  3. In the font map file, map the font names to the appropriate font files that sit in the fonts\ subfolder

In case of the Galano Grotesque fonts, the font map file would look like this:

font "GalanoGrotesque-Regular"	"RENE_BIEDER-GALANO_GROTESQUE.OTF"
font "GalanoGrotesque-Bold"	"RENE BIEDER - GALANO GROTESQUE BOLD.OTF"
font "GalanoGrotesque-Italic"	"RENE BIEDER - GALANO GROTESQUE ITALIC.OTF"

When we now create the same output, it looks a lot better:

OL Connect output from PDF input with external fonts provided by the user

Font substitution

The example above doesn’t show a font mapping for a bold italic variant of Galano Grotesque. That’s because that was not available. What I do have, is a semi-bold italic variant. I can add that font to my font map, and then I can use the semi-bold italic as a substitute for bold italic by creating an alias. An alias refers to another font name instead of an actual font file. It looks like this:

font "GalanoGrotesque-SemiBoldItalic" 	"RENE BIEDER - GALANO GROTESQUE SEMIBOLD ITALIC.OTF"
alias "GalanoGrotesque-BoldItalic" "GalanoGrotesque-SemiBoldItalic"

Although these fonts are likely very similar, it would still be advisable to check if the result is acceptable.

Text extraction

There are cases where text extraction fails because of missing font information. Providing the missing fonts can solve this. If the actual font is not available, providing a substitute can also fix the text extraction.

The screenshot below shows a problem with extraction of a Japanese text:

Even if you don’t know Japanese, it’s pretty clear that 請求NO: is not the same as ‮!!! ٻ !. When we add a suitable font and substitution to the font map:

font "KozMinPr6N-Regular" "KozMinPr6N-Regular.otf"
alias "MS明朝" "KozMinPr6N-Regular"

We get the result we want:

Unfortunately, there can be other issues with PDF input that cause text extraction to fail, so this is not guaranteed to work. We will continue to make improvements in this area.

System fonts

OL Connect currently does not use system fonts to resolve missing fonts or substitute fonts. The reason behind this is that not every system has the same fonts. For instance, I have more than 200 fonts on my desktop system, but a vanilla Windows Server has only around 90 system fonts. If OL Connect would fall back to system fonts on my system, I could be under the impression that my input is fine, while that input will look all wrong when processed on a different system.

The explicit font substitution mechanism ensures that you will be aware that the input requires external fonts.

Tagged in: output, pdf



Leave a Reply

Your email address will not be published. Required fields are marked *