Category: PDF/A

Overview of the PDF/A standards

The document format PDF got developed by the company Adobe in the early 90’s, on the base of the page description language “Postscript”. At first it was a proprietary but disclosed file format and in in 2008 submitted to the ISO and since them builds, in version 1.7, the ISO standard 32000.

PDF/A – The PDF for archiving:

PDF/A is the appellation for the ISO norm 19005 and defines a standard document format for the long term archiving of electronical documents. The norm ensures which PDF function have to be contained or not to archive documents in the long term.

Important: The PDF/A standard is “constitutive” – if a document is PDF/A-1 conform it is automatically also covered in the PDF/A-2 and PDF/A-3 standard – the higher standards allow more PDF functions. But there is no “better” and “worse” PDF/A level but you take the required level and standards to assign the required functions.

PDF/A-1 (since 2006)

For PDF/A-1 there are 2 levels:

  • PDF/A-1b: basic – this one is for the explicit visual peproducability of PDF/A documents.
  • PDF/A-1a: accesible – like 1b – but has to also include the content structuring of the document (tagged PDF) – this level can’t be created automated through direct conversion, scan, OCR or printer drivers – technically yes but the content structuring usually has to be created and completed manually already in the source application.

PDF/A-2  (since Juni 2011)

For PDF/A-2 there are 3 levels:

  • PDF/A-2b: basic – consistent with the 1b – with extensions of the level 2
  • PDF/A-2a: accessible – consistent with the 1a – with extensions of the level 2
  • PDF/A-2u: unicode – hierzu gibt es keine Entsprechung im Level 1 – entspricht dem Level 2b – jedoch muss der eingebettete Text im UniCode Standard abgebildet sein.

Extensions compared to PDF/A-1 :

  • JPEG2000 compression
  • Transparency
  • Layers
  • OpenType-font
  • digital signatures as PAdes (PDF Advanced Electronic Signatures)
  • Container: PDF/A-1 files can be implemented in PDF/A-2 files
  • the page limit got extende to 381 x 381 km

PDF/A-3 (since October 2012)

The essential extensions of the PDF/A level 3 is, that it is possible to embed any files into the PDF/A. With that, for the archiving, a PDF file can be combined with the archiving of the source file, for searching, displaying and printing. Would you only archive the PDF file for a MS-EXCEL, eventually important additional informations like the formulas which it’s based on, would get lost. The embedded (source) files can be extracted from the PDF at any time.

More ISO normalized PDF standards are:

  • PDF/E – PDF for Engineering: ISO 24517PDF/E-documents implement: Layers for installation- and construction plans as well as three-dimensional models inclusive predefined 3D-views.
  • PDF/H (Healthcare) – PDF in the health system (best practice) for the diagnostics by imaging and for the storage of patients data and medical reports.
  • PDF/X (Exchange) für Druckvorlagen: ISO 15929 / 15930 – The PDF/X-standard got developed for the exchange of announcement data for newspapers as well as for the transfering of print models and jobs. PDF/X is available in the following levels: 1a, 2, 3, 4, 5, 5g, 5gp, 5n
  • PDF/UA (Universal Accessibility) – ISO 14289 – for universal accessible documents, z. B. as reading help for visually handicapped people.
  • PDF/VT (Variable Transactional) – ISO 16612-2 – for the “printing of variables or transactional document contents”.
  • PDF Level 1,7 – ISO 32000: The ISO has approved the Portable Document Format (PDF) 1.7 as international standard.

New Web-Site – www.OCRServer.at – online

We summarized all of our OCR Products on our newly created Website

You can get more information about the following products there:

  • AutoOCR
  • AutoOCR light
  • DropOCR
  • FineOCR
  • ifresco Transformer
  • FileConverter (pro)
  • ifresco Profiler + Plugins

FileConverter – automatically convert documents and e-mails from folders or e-mail boxes to PDF, PDF/A and TIFF

The FileConverter is an application, installable as service in MS-Windows (32 and 64bit), to monitor folders and e-mail boxes and automatically convert the contained documents to the PDF, PDF/A or TIFF file format. With that, multiple folders or also MS-Exchange and POP3 mailboxes can be configured and monitored.

The following input-documentformats are supported:

  • DOC, DOCX, RTF, TXT,
  • XLS, XLSX,
  • PPT, PPTX,
  • XFDF, FDF,
  • PNG, BMP, TIF, TIFF, JPG, JPEG
  • ZIP, RAR, 7Z,
  • MSG, EML,
  • PDF,
  • HTM, HTML, MHTML,
  • PMT, PMTX

file format – features:

  • With ZIP/RAR/7Z containers, all containing and supported documents get automatically extracted and converted. The containing folder structure of the container gets build in de output directory.
  • PMT and PMTX – are PDFMerge XML dataformats – which contain hierarchic structure information as well as links to the documents or the documents themself. The FileConverter produces from this files, like the PDFMerge server, a single total PDF file, which is merged from the to PDF converted single documents. The structure defined in the XML gets displayed as PDF-bookmarks.

Conversion:

  • The PDF/TIFF conversion takes place directly without the usage of the source application. So for the processing, no installation of MS-Office or Adobe Acrobat is necessary. Optional, the PDF’s also can be exported in the ISO standardized PDF/A-1b format.
  • In the standard scope also the iOCR engine, for creation of searchable PDF(/A)’s out of PDF or image documents, is implemented. Optional – also Abbyy, the most efficient OCR engine at the moment, can be installed. With the OCR processing, PDF documents get analyzed page by page and only documents which don’t include text information yet get processed (intelligent OCR processing) – this saves resources and increases the quality and the processing speed.

Functions – general:

  • MS-Windows service application for document conversion of MS-Office, PDF, image, HTML, ZIP, MSG and e-mail to PDF, PDF/A or TIFF
  • Multiple folders as well as MS-Exchange and POP3 e-mail boxes can be monitored and processed parallel.
  • Direct conversion without usage of additional necessary source applications (MS-Office, Adobe Acrobat)  or printer drivers.
  • Flattened of filled PDF forms: PDF forms (XFDF,FDF) can be converted into normal PDF documents. The forms either can be deposited fixed or newly loaded every time.
  • Parallel processing with configurable amount of processes – allows the optimal exploitation of the hardware und garants the fast processing.
  • Logging of all conversion instances, forwarding of failed e-mail conversions or sending of error – e-mails via SMTP

In / out folder processing:

  • Processing of files and folders out of configured in / out – folders via time lapse or “ready” file, incl. subfolder processing (one level)
  • Erstellen einer Index-Text-Datei über alle bei einem Verarbeitungsvorgang erzeugten Dateien.
  • After the processing: deleting, moving into archive folder, renaming – of the files or folders (.con / .err)
  • Configuration of the filename extension which shouldn’t be converted – these get ignored and not processed. E-mails with attachments and not identifyable extensions get handled as errors and forwarded to an e-mail address.
  • Single page output with configurable amount of locations for the site index
  • Configuration of the TIFF conversion – compression / color depth / resolution / JPEG-quality
  • extensive parameters for the OCR processing – iOCR or Abbyy – the FileConverter has the same OCR functions as AutoOCR
  • Parameters for the HTML conversion – page size and margins – HTML document and e-mails get scaled automatically.

Processing of e-mail boxes:

  • Processing of POP3 / MS-Exchange e-mail boxes – forwarding  or deleting at successful or incorrect processing, or moving into an archive / error folder under MS-Exchange. Direct access to MS-Exchange 2007/2010/2013 through the SOAP web-service-interface.
  • EML and MSG – body and attachments get converted – generation of the e-mail header information in the body document – from, date, to, subject
  • Output of a XML-file with the processed e-mails with the metadata and file-links – configurable: from, to , cc, bcc, received, subject, body, attachments
  • Output per e-mail in separated subfolders or “flat” in the destination folder.

 

1_FileConverter - general settings - email & folder processing 2_FileConverter - processing options  3_FileConverter - service configuration  4_Fileconverter - SMTP server configuration  5_FileConverter - configuration folder processing  6_FileConverter - configuration e-mail box processing  7_FileConverter - MS-Exchange configuration  8_FileConverter - POP3 configuration  9_FileConverter - TIFF conversion settings  10_FileConverter - OCR settings  11_FileConverter - HTML conversion settings  12_FileConverter - Log

  Download – FileConverter – documents & e-mails to PDF, PDF/A and TIFF >>>

New Features ifresco Transformer for Alfresco – with AutoOCR version 1.10.3

Because of the new version of AutoOCR 1.10.3 there are new features available for the ifresco AutoOCR Transformer:

  • iOCR – new default OCR engine in addition to Abbyy
  • intelligent processing of PDF documents
  • Alfresco integration – ready to test without installation of an OCR server – you can use our AutoOCR Test server accessible from  the internet.
  • New Step by Step installation and setup documentation.

iOCR – additional OCR engine available

Starting with version AutoOCR version 1.10.3 the setup installs iOCR as default OCR engine which can be used standalone or in addition to the Abbyy OCR engine. iOCR has no page license limitations and is able to process PDF, TIFF or JPEG as input and can generate searchable PDF´s and TXT files.

Differences between iOCR and Abbyy

  • iOCR supports not so much languages like Abbyy
  • no mixed language recognition – only one main language can be selected
  • not the same level of accuracy and recognition quality like Abbyy
  • no image pre-processing functions
  • no page orientation detection (autorotate)
  • Not so much functions and features to configure and input / output formats.

But iOCR is a good solution for low cost and high volume OCR recognition e.g. to extract text information from PDF´s and images to built up a full text index (e.g. Alfresco Transformer > TXT) and to create searchable PDF´s from scans with a good quality.

The best is to make tests with own documents to see which OCR engine best fits your needs. Both engines Abbyy and iOCR can be installed and used parallel – you only have to create different OCR profiles for the different settings and OCR engines. Both OCR engines can also be tested by the use of our ready to use AutoOCR test server (autoocr.may.co.at)

Intelligent PDF processing:

A PDF document can contain only images from a scanner or can be created e.g. by a printer driver or by a direct PDF export. An image PDF does not contain any text and has to be OCR processed. The other “normal” PDF´s already contains text and does not need to be OCR processed. The Alfresco Transformer is not able to recognize it and to decide if a PDF has to be OCR processed or not. OCR processing costs time and resources and so starting with AutoOCR version 1.10.3 we implemented an “intelligent PDF-OCR processing”. When this option is checked on then each PDF document which is sent to the AutoOCR server is checked, and if the file already contains text – the PDF is not OCR processed. In this case the PDF or the extracted TXT data is direct sent back without OCR processing. To enable this feature the OCR profile on the AutoOCR server has to be configured for “intelligent OCR processing of PDF files”

PDF - intelligent OCR processing - Abbyy PDF - intelligent OCR processing - iOCR

AutoOCR Test server – ready to use

With the installation of 2 AMP´s you can integrate the AutoOCR server with Alfresco.  The integration works like a standard Alfresco Transformer or can also be used via Scripting or Java. The communication between AutoOCR and Alfresco is done via HTTTP(S) using REST.  To make it more easy to start testing AutoOCR  and the Alfresco integration you can use our ready installed and configured AutoOCR test server (autoocr.may.co.at) which is reachable over the internet and which has both OCR engines (Abbyy and iOCR) installed.

Step by Step – Installation and Setup documentation

With this document each step for the installation of the Abbyy Engine,  of AutoOCR, the licensing, the use of our test server and the integration with Alfresco are described in detail with screen shots.

Download – Installation and Setup documentation – ifresco AutoOCR transformer for Alfresco >>>

Test and Demo version is available – please contact us >>>

ifresco AutoOCR Transformer – OCR processing integrated with Alfresco Share

The AutoOCR Server is integrated via REST as a dynamic configurable Alfresco document transformer. AutoOCR creates searchable PDF´s or other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML from image of PDF files. The OCR functions can be used via Java, JavaScript or as a document transformer. Config is done from the Share UI which also has a new document action “Transform” and gives access to all Alfresco transformers.

AutoOCR is an OCR server / service which is based on the obviously best OCR engine from Abbyy. The AutoOCR server has a REST web-serverice interface which was used to integrate it with Alfresco. AutoOCR is able to convert image- or PDF- files to searchable PDF´s. In addition to PDF other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML can also be created.

The configuration is simple and uses OCR profiles to summarize all possible settings. With an AMP install module  the direct integration of AutoOCR to Alfresco is realized. OCR functions are available in   Alfresco as a dynamically configurable transformer. Appropriate bindings allow the use of the OCR out services also from JavaScript and Java. From Alfresco 4.0, the configuration and monitoring will be done directly on the UI of the Share Administrator console.

In addition, we have extended the  Alfresco share document actions with the Alfresco Transformer integration. Transformer functions are available on  any document via the share interface and allow the conversion of documents into different formats.

AutoOCR as Alfresco Transformer:

The OCR function can be bound to a folder as an action. So if e.g. a scanned document will be placed in this folder, the processing starts automatically started and the document will be passed to the AutoOCR server. The result is a searchable PDF or other document format that can be immediately sought and found on the Alfresco full-text index.

AutoOCR JavaScript binding for Alfresco:

The JavaScript API allows direct access to the AutoOCR service from Alfresco scripts. From Repository JavaScripts (Webscript controller script, scripted actions) all the features of AutoOCR API can be adressed. This API is completely independent from the integration of AutoOCR services as Alfresco Transformer.

Alfresco Share – “Transform” document action

By implementing the additional “transform” document action to the Share UI you can use all your Alfresco transformes and not only the AutoOCR transformers. The “transform” action is implemented general and not only OCR specific.

Highlights / features:

  • Direct AutoOCR integration as Alfresco transformer with REST web service interface.
  • Separate AutoOCR service / server which does not strain the Alfresco server
  • Based on ABBYY – the leading OCR engine
  • Easy configuration by selecting OCR profiles – all available ABBYY OCR engine settings are combined.
  • In addition to PDF other output formats can be generated (TXT, RTF, DOC, etc.)
  • Dynamic transformer configuration at runtime using the Alfresco Share Admin interface.
  • JavaScript client for the AutoOCR service, available in Alfresco repository scripts (WebScripts, actions, etc.)
  • Java client for the AutoOCR service, for use in Java code.
  • The Java client itself has no dependencies for Alfresco.
  • New Share document action “Transform” enhances Share not only with OCR but with all supported transformers.

Requirements:

  • Alfresco 4.x – dynamic configuration via Share Userinterface
  • Alfresco 3.x – manual configuration w/o Share UI
  • AutoOCR from Version 1.9.8 on Microsoft Windows as a service
  • ABBYY FineReader Engine 10 (starting with 10.000 pages per month)

20-autoocr-admin-status 22-autoocr-admin-transformerconfig2 23-autoocr-admin-jobs 01-autoocr-action-menu 02-autoocr-shareaction-dialog 03-autoocr-shareaction-transform-waiting 04-autoocr-shareaction-results 05-autoocr-shareaction-transformed-docs

Step by Step – Setup & Installation documentation for ifresco AutoOCR Transformer >>>

Test and Demo version is available – please contact us for details >>>

Price information you can find here >>>

Webshop