What is the best file on the market

Output formats in comparison

TIFF and JPG are usually listed as output formats on data sheets for scanners. These serve the scan service provider as a basis for further processing up to and including the output format according to customer requirements.

A question that customers often ask is about the correct image format. Ultimately, this is often heavily dependent on the customer's downstream archive system. Nevertheless, it is important to know which formats are the best choice for the documents to be digitized and the customer's requirements. In this article, we have compared three of the most common output formats for you and summarized their advantages and disadvantages.

TIFF
The tagged image file format was developed jointly by Microsoft and Aldus (taken over by Adobe in 1994) for color separation in scanned raster graphics back in the mid-1980s. It was the preferred format for archiving scanned documents until the turn of the millennium.

The compression method CCITT G3 or G4 enables digitized files to be reduced to a minimum. However, existing color information was lost. This is not a problem for black-and-white documents or documents where the issue of color does not play a role. If the documents to be digitized also contain images - for example damage documents for insurance companies - these can no longer be compressed with G4, so that Colored files are created which are very large compared to G4 compression.

With the TIF format, there is also the option of storing multi-page documents in a TIF. It is then a multitiff in which you can navigate through the document by scrolling forwards and backwards.

+ small file sizes for black and white images
+ multi-page documents in one file
- Color information creates large files
- no metadata support

JPEG
The JPEG format saw the light of day in the early 1990s. The name can be traced back to the Joint Photographic Experts Group, which developed the format for the compression of color images. In the meantime, the abbreviation "JPG" has established itself, which is evident in the file extension.

It is an image rather than a document format and has no page logic, for example. This means that multi-page documents are stored in several individual JPGs. In addition, depending on the configuration, the compression actually blurs clear edges in letters, which means that full-text recognition via OCR delivers poorer results. In order to avoid this disadvantage, the documents must be created without loss during processing by the scanning service provider or optimized for OCR reading by means of binarization, i.e. the generation of a TIFF from the JPG, which makes the files significantly larger. The scan service provider has to compensate for this disadvantage by compressing it again before handing it over to the customer.

+ small file sizes for images
+ real document reproduction as color information is included
- Multi-page documents are saved in several JPGs
- poor OCR results

PDF
The Portable Document Format was developed by Adobe and first introduced in 1993. The intention of Adobe founder John Warnock at the time was to provide the IT world with an easy-to-use format that required little storage space and facilitated file exchange between different systems. In the meantime, PDF has long since established itself as a document format. This is also due to the fact that Adobe has always developed it further.

Functions such as the handling of metadata, the integration of digital signatures, and integrated compression have been added and have promoted the acceptance of PDF. For example, in the case of an invoice, the associated invoice data can also be stored in the PDF in addition to the image. B. simplified in incoming invoice processes. The market provides numerous solutions for creating, editing, compressing, etc. PDF files. The current status is PDF 2.0, which was published in 2017 by the ISO.

For example, we create PDF files that are used as a container for the generated TIFF or JPG. This makes it possible to store colored and black-and-white documents in a PDF file. The generation of multi-page documents is also not a problem.

+ Text and image can be integrated
+ multi-page documents in one file
+ good OCR results, resulting in document searchability
+ Metadata support
+ Low memory requirements, even for colored documents

PDF / A
A sub-specification of PDF is PDF / A, the ISO standard for long-term archiving. With this format you ensure the reproducibility of your files for decades. This is sometimes essential, for example when it comes to credit files, construction drawings, patient documents or just bills. To ensure this, certain regulations are defined in the standard. For example, a PDF / A file must provide all of the content that is required for its display (e.g. fonts) and must not contain anything that could impair this display (e.g. third-party dependencies).

+ stable long-term format
+ multi-page documents in one file
+ good OCR results, resulting in document searchability
+ small file sizes regardless of the content
+ Metadata support

We recommend our customers to store digitized documents in PDF / A format. In our opinion, TIFF should still be selected if the systems that process the digital images only accept this format. JPEG as an image format is only an option if the color information is important, but otherwise the documents should only be stored as "stupid" images.

Contact us if you have any questions about the formats!