SPONSORED – If you use PDFs somewhere in your enterprise workflow there’s a strong chance that – like 70% of the Fortune 50 – you’ve relied on the work of Raf Hens, CTO at Belgium’s iText, and his global team of developers.
From archiving to invoicing, pension statements to utility bills, PDFs are deeply embedded in many enterprise workflows and behind the scenes at the company – which sees over 15,000 downloads of its software daily – considerable innovation is happening around iText’s widely deployed server-side PDF software.
Speaking to The Stack, Raf highlights that the company’s blue chip customers are not just generating PDF documents for customers or internal use but also underpinning increasingly sophisticated Intelligent Document Processing (IDP) spanning not just data entry but also data extraction and analysis.
iText is also helping clients radically improve computational and storage overheads through document optimisation (e.g. via automated removal of unneeded elements that can bloat documents).
As he puts it: “We’ve carved out the space between the data world and the document world. Early on, the initial use case of iText was turning data into PDF documents and automating this in a programmatic way: e.g. a process to take data from a database, and turn it into a PDF invoice. Over the years we have built on top of that: powering not just document generation from data, but also processing of existing documents.
That includes content extraction. As Raf adds: “You can think of PDF not just as a document format, but a data container with different blocks of content: textual and image content, obviously, but it can also contain font information, colour information, and more.
“We’ve had cases where people were dealing with documents that turned out to be bloated with font information: what started as a 300 kilobyte document became a seven megabyte document, slowing their transmission speeds from a couple of seconds to a couple of minutes. The ability to remove embedded superfluous information through iText’s pdfOptimizer was very valuable to them.”
Building a stack on top of the core library
Raf’s team focus on building a very clean and efficient base library with all the robust underlying PDF functionalities relied on by many colossal enterprises: “With great power comes great responsibility” he says drily, “because you don’t want to get that wrong – unless you want to intentionally create corrupt PDF files to test your document workflow and some organisations do that! But what we do after that is build a ‘convenience stack’ on top as well, that abstracts away more of the PDF specifics. For example, in the first stage, it might be interesting to get just a piece of text out; but in the end you’re interested in what that data means…”
One of the ways in which iText is doing this is through intelligent templates. A customer can take a sample invoice, for example, annotate it (indicating where the interesting pieces of information are) and then automate extraction for similarly formatted documents. Just how similar do the other documents need to be?
“They need to be structurally similar, but not identical in layout”, Raf explains.
“To give you a very straightforward example, let’s say you get invoices from the same vendor. One month, the invoice could be one page, and the line items on the invoice are just in a table on one page. If the next month, that invoice turns out to be two or three pages and the invoice lines span a couple of pages, then our extraction system is intelligent enough to recognise that.”
(It’s not just extraction: the team also recently released iText DITO 2.0 – an updated template design and document generation engine that lets users easily create unique invoices or personalised documents.)
With many modern ERP systems offering similar data extraction and analysis capabilities out of the box before a document has become a PDF, why would these document-to-data capabilities be needed?
There’s usually three main cases, Raf notes: “We really provide developer-oriented components that can augment or improve certain document processes. So even if a customer has an existing document workflow system in place, they may use iText technology, including pdf2Data, to pre-process, post-process certain documents, and then ingest that data in the rest of their document flow logic.
“Alternatively, they might be building a custom application that needs iText technology in it, because they have very specific document needs that are not covered by existing solutions; that’s not uncommon. Or then again, sometimes it’s a ‘build versus buy’ decision — where the ‘buy’ option is immensely expensive, and it makes sense to just tweak their flow to the use that they are trying to solve specifically, using our software.”
When it comes to recent custom applications built with iText software at their heart, Raf has an example that springs to mind rapidly and it’s a topical one: A company using its capabilities to underpin its extensive work on FDA (US Food and Drug Administration) approvals. “Instead of shipping truckloads of physical paper (still common given sensitivity of many applications) they’re doing it in PDF format; but there are still very strict requirements on how that needs to be presented for approval. Many of the documents pre-exist, so the content is there, but it still needs to be shaped into the right format or structure for submission.”
“Adding a table of contents linking to all the different documents is not something you want to do manually at this scale! But based on certain document information you can extract the start of a chapter, for example, and build an automatic table of contents based on that. Doing that in a generic way is tricky, because it comes with a lot of specific requirements for FDA approval, so building a custom application makes sense.”
Compliance is a major issue for iText’s community of developers – many of whom are working in banking, insurance, pharmaceuticals, or government — and the ISO27001-accredited company is highly attentive to this, remaining heavily involved upstream with work around the technical specifications that define PDF and its sub-standards, like PDF/A for archiving. Its ongoing work on AI and data extraction (the company plans to make acquisitions in this space to further bolster its capabilities in the near future) is also important.
But that rock solid commitment to stability and performance is at the heart of the company.
Raf looks ahead: “There’s variety of regulations around how long a document needs to be stored; in Belgium, for example, some documents have to be stored for 75 years. Imagine trying to open a document from 10 years ago now: I struggle with compatibility issues. So imagine having to do having to do that in 75 years from now! I’m not even sure I dare to venture a guess what computing technology will look like in 75 years.”
“PDF/A has all the provisions in place to make a PDF document as much as possible, self-contained; with all the information necessary to still render it or present it for consumption to the user years afterwards. That’s an important one, when it comes to compliance. So is accessibility: making document content suitable for those who require assistive technology, which is required by law for certain industries and something we strongly support.”