ifresco Profiler – scan, edit, OCR, barcode, capture metadata, Alfresco integration

The ifresco Profiler provides important, easy-to-use page oriented document processing functions for PDF and image document, on any workplace. It enables the possibility of quickly and easily storing documents with metadata with individual and specific profiling masks in Alfresco as searchable PDF’s. Area OCR with an integrated OCR Engine, creating of searchable PDF’s on export with the integrated OCR or extern AutoOCR Server, barcode recognition for file names and document split, export to folder, as e-mail attachment or via installed plugin’s to Alfresco together with metadata, are some essential features of the Software.

The application consists of 2 components – the profiler basic software, which contains all general functions, and one or more installable plugins. These plugins represent the interface to Alfresco and allow to use individually to the requirements and field of application adjusted profiling masks. The complete logic for the metadata, filing structure and naming is displayed in a plugin.

ifresco Profiler base:

  • Processes PDF and image files – black & white, grayscale, color – without having to keep an eye on file format and color – all functions are implemented across.
  • Integrated scan function to scan documents via local connected scanners. The scan settings can be chosen directly by preconfigured scan-profiles.
  • Capturing of documents out of folders – display as document list e.g. for multifunction devices, network scanners wiht scan to folder function or via printer driver created, also as to process via e-mail received documents.
  • Quick changing of the document names – with automatic selection of the next file in the list, after finish of the change.
  • Area OCR by local integrated OCR engine, to assign file names.
  • Deleting / cutting out areas
  • Page preview – zoom, paging, turning – as well as thumbnail miniatures of the whole document
  • Page oriented document processing – turn pages left/right, delete pages, page moving in the thumbnail view via drag&drop.
  • split total document – at marked pages, after x pages, after barcode.
  • Merge single documents to a total one – specifying the order, automatic deleting of the single documents.
  • Export – into a folder, send as e-mail attachment or store into Alfresco with metadata via profiling – in the native format as PDF image or PDF-OCR
  • Within the export – generating of searchable PDF-OCR documents by the local integrated iOCR engine or by the via web-service integrated AutoOCR Server with Abbyy OCR
  • Intelligent OCR processing – only image pages get OCR processed – normal PDF pages get taken over without changes .

ifresco Profiler plugins:

The profile form and the logic of the profiling for the deposit of the documents in Alfresco gets realized from the ifresco Profiler via plugins. Because every company has it’s own data model and deposit logic, the plugins get developed and implemented individually after specifications. Here it’s possible to fall back to already realized plugins. For testing and illustration of the possibilities there is a demo plugin as well as already realized plugins available.

  • installable plugins – for profiling and capturing of metadata for filing of documents in Alfresco.
  • One or more plugins can be installed, chosen and with that, also switched to other Alfresco servers – each plugin contains its own individual logic for the profiling as stand-alone installed .NET / C# application which inserts itself into the ifresco Profiler base framework and uses its functions.
  • parallel displaying of the profile mask and the document preview with the capturing of the metadata.
  • free programmable logic and functions of the profiling mask with e.g. extern XML template rules with dynamic fields to always build the name / title the same, access to external data sources – MS-XLS, SQL, web-service (e.g. SugarCRM), linked tables and pre-assignment of fields with values of the table, type ahead part-string search over single or combined fields, usage of Alfresco categories as lookup’s, assignment of existing Alfresco tags, automatic new applyment of tags, automatic creation of the Alfresco folder structure as well as the file names out of profile field values, searching for folders in Alfresco, counter via web-service, stamping of the document with informations from the metadata before the upload, searching for in Alfresco available documents and takeover of profile values and so on.
  • interactive processing – with OCR and upload or alternative
  • background / batch processing – for PDF-OCR conversion and Alfresco upload – the user is already able to continue working while the OCR processing and the Alfresco upload takes place in the background.
  • preserve existing profile values / delete mask
  • automatic loading of the next document in the list – processed document gets deleted ore moved into the archiving area after upload.

Download ifresco Profiler >>>

 

Intelligent PDF OCR processing via AutoOCR for Abbyy and iOCR

PDF documents can be generated in different ways. PDFs are able to summarize various contents and sources in one document. Pages can be constructed from “normal” PDF content consisting of text, images, and vector graphics, and typically already have textual content that can be used for full text indexing and search. However, a PDF document can also contain scanned pages in black and white or color. Such pages or documents must undergo OCR recognition to insert the textual information for indexing and searching.

So there are certain PDF documents which either should not be subjected to any OCR processing, or only individual pages or all of them have to be processed because they were generated by a scan process.

Normally, all these types of PDF documents occur in business processes and the user can not distinguish whether or not a document needs to become OCR – viewed from the outside via the Adobe Reader or on the printer, this can not be immediately recognized and distinguished.

If you would generally process every PDF document / page in the same way, regardless of how they are structured and whether an OCR processing makes sense or not, there would be some disadvantages:

Each PDF page is “rasterized” again, regardless of the structure and content, ie converted into an image and then processed OCR. This is like printing the document, scanning it again and then subjecting it to OCR processing. This gives you a picture from a “normal” PDF page with underlying text recognized by the OCR engine.

  • the quality is not the same as before
  • the documents become bigger
  • special PDF properties are lost (bookmarks, links, etc.)
  • Processing time and resources are consumed
  • OCR page licenses are consumed unnecessarily

A PDF OCR processing should therefore be “intelligent”, so that in the process and by the user does not have to decide with difficulty whether a PDF document must be subjected to OCR processing or not. Even more difficult is when a single PDF document consists of mixed normal and scanned parts.

That’s why we’ve integrated intelligent OCR processing into AutoOCR, which works in the same way with both the Abbyy and the iOCR OCR engine. This can be controlled per input folder or for the web service interface via the OCR profile and is available for both PDF> PDF and PDF> TXT processing.

AutoOCR Abbyy - Intelligente OCR Verarbeitung  iOCR - intelligente PDF Verarbeitung

Highlights – Intelligent PDF OCR processing:

  • works for both PDF> PDF and PDF> TXT processing
  • for the Abbyy OCR and iOCR engine
  • at the folder as well as for the web service processing
  • the PDF document as well as every single page are analyzed and only those pages OCR are processed that do not contain any text – these are usually scanned pages that have not yet been processed by OCR.
  • existing normal PDF documents and pages are taken over unchanged and not processed
  • OCRed documents and pages are not processed again.
  • in the case of PDF> TXT processing, the text is extracted from the normal PDF pages and OCR is only performed on pages without text.
  • PDF functions and bookmarks are retained and are included in the target document.
  • Saves processing time and Abbyy OCR page licenses
  • the files are not enlarged
  • the quality of the PDF pages is preserved.

The “intelligent PDF OCR processing” can be found in addition to AutoOCR in all other of our software products that support OCR processing z.b. ifresco Profiler, FileConverter, DropOCR, PDFMerge, etc.

New Features ifresco Transformer for Alfresco – with AutoOCR version 1.10.3

Because of the new version of AutoOCR 1.10.3 there are new features available for the ifresco AutoOCR Transformer:

  • iOCR – new default OCR engine in addition to Abbyy
  • intelligent processing of PDF documents
  • Alfresco integration – ready to test without installation of an OCR server – you can use our AutoOCR Test server accessible from  the internet.
  • New Step by Step installation and setup documentation.

iOCR – additional OCR engine available

Starting with version AutoOCR version 1.10.3 the setup installs iOCR as default OCR engine which can be used standalone or in addition to the Abbyy OCR engine. iOCR has no page license limitations and is able to process PDF, TIFF or JPEG as input and can generate searchable PDF´s and TXT files.

Differences between iOCR and Abbyy

  • iOCR supports not so much languages like Abbyy
  • no mixed language recognition – only one main language can be selected
  • not the same level of accuracy and recognition quality like Abbyy
  • no image pre-processing functions
  • no page orientation detection (autorotate)
  • Not so much functions and features to configure and input / output formats.

But iOCR is a good solution for low cost and high volume OCR recognition e.g. to extract text information from PDF´s and images to built up a full text index (e.g. Alfresco Transformer > TXT) and to create searchable PDF´s from scans with a good quality.

The best is to make tests with own documents to see which OCR engine best fits your needs. Both engines Abbyy and iOCR can be installed and used parallel – you only have to create different OCR profiles for the different settings and OCR engines. Both OCR engines can also be tested by the use of our ready to use AutoOCR test server (autoocr.may.co.at)

Intelligent PDF processing:

A PDF document can contain only images from a scanner or can be created e.g. by a printer driver or by a direct PDF export. An image PDF does not contain any text and has to be OCR processed. The other “normal” PDF´s already contains text and does not need to be OCR processed. The Alfresco Transformer is not able to recognize it and to decide if a PDF has to be OCR processed or not. OCR processing costs time and resources and so starting with AutoOCR version 1.10.3 we implemented an “intelligent PDF-OCR processing”. When this option is checked on then each PDF document which is sent to the AutoOCR server is checked, and if the file already contains text – the PDF is not OCR processed. In this case the PDF or the extracted TXT data is direct sent back without OCR processing. To enable this feature the OCR profile on the AutoOCR server has to be configured for “intelligent OCR processing of PDF files”

PDF - intelligent OCR processing - Abbyy PDF - intelligent OCR processing - iOCR

AutoOCR Test server – ready to use

With the installation of 2 AMP´s you can integrate the AutoOCR server with Alfresco.  The integration works like a standard Alfresco Transformer or can also be used via Scripting or Java. The communication between AutoOCR and Alfresco is done via HTTTP(S) using REST.  To make it more easy to start testing AutoOCR  and the Alfresco integration you can use our ready installed and configured AutoOCR test server (autoocr.may.co.at) which is reachable over the internet and which has both OCR engines (Abbyy and iOCR) installed.

Step by Step – Installation and Setup documentation

With this document each step for the installation of the Abbyy Engine,  of AutoOCR, the licensing, the use of our test server and the integration with Alfresco are described in detail with screen shots.

Download – Installation and Setup documentation – ifresco AutoOCR transformer for Alfresco >>>

Test and Demo version is available – please contact us >>>

ifresco AutoOCR Transformer – OCR processing integrated with Alfresco Share

The AutoOCR Server is integrated via REST as a dynamic configurable Alfresco document transformer. AutoOCR creates searchable PDF´s or other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML from image of PDF files. The OCR functions can be used via Java, JavaScript or as a document transformer. Config is done from the Share UI which also has a new document action “Transform” and gives access to all Alfresco transformers.

AutoOCR is an OCR server / service which is based on the obviously best OCR engine from Abbyy. The AutoOCR server has a REST web-serverice interface which was used to integrate it with Alfresco. AutoOCR is able to convert image- or PDF- files to searchable PDF´s. In addition to PDF other document formats like TXT, DOC(X), XLS(X), PPT(X), XML, RTF and HTML can also be created.

The configuration is simple and uses OCR profiles to summarize all possible settings. With an AMP install module  the direct integration of AutoOCR to Alfresco is realized. OCR functions are available in   Alfresco as a dynamically configurable transformer. Appropriate bindings allow the use of the OCR out services also from JavaScript and Java. From Alfresco 4.0, the configuration and monitoring will be done directly on the UI of the Share Administrator console.

In addition, we have extended the  Alfresco share document actions with the Alfresco Transformer integration. Transformer functions are available on  any document via the share interface and allow the conversion of documents into different formats.

AutoOCR as Alfresco Transformer:

The OCR function can be bound to a folder as an action. So if e.g. a scanned document will be placed in this folder, the processing starts automatically started and the document will be passed to the AutoOCR server. The result is a searchable PDF or other document format that can be immediately sought and found on the Alfresco full-text index.

AutoOCR JavaScript binding for Alfresco:

The JavaScript API allows direct access to the AutoOCR service from Alfresco scripts. From Repository JavaScripts (Webscript controller script, scripted actions) all the features of AutoOCR API can be adressed. This API is completely independent from the integration of AutoOCR services as Alfresco Transformer.

Alfresco Share – “Transform” document action

By implementing the additional “transform” document action to the Share UI you can use all your Alfresco transformes and not only the AutoOCR transformers. The “transform” action is implemented general and not only OCR specific.

Highlights / features:

  • Direct AutoOCR integration as Alfresco transformer with REST web service interface.
  • Separate AutoOCR service / server which does not strain the Alfresco server
  • Based on ABBYY – the leading OCR engine
  • Easy configuration by selecting OCR profiles – all available ABBYY OCR engine settings are combined.
  • In addition to PDF other output formats can be generated (TXT, RTF, DOC, etc.)
  • Dynamic transformer configuration at runtime using the Alfresco Share Admin interface.
  • JavaScript client for the AutoOCR service, available in Alfresco repository scripts (WebScripts, actions, etc.)
  • Java client for the AutoOCR service, for use in Java code.
  • The Java client itself has no dependencies for Alfresco.
  • New Share document action “Transform” enhances Share not only with OCR but with all supported transformers.

Requirements:

  • Alfresco 4.x – dynamic configuration via Share Userinterface
  • Alfresco 3.x – manual configuration w/o Share UI
  • AutoOCR from Version 1.9.8 on Microsoft Windows as a service
  • ABBYY FineReader Engine 10 (starting with 10.000 pages per month)

20-autoocr-admin-status 22-autoocr-admin-transformerconfig2 23-autoocr-admin-jobs 01-autoocr-action-menu 02-autoocr-shareaction-dialog 03-autoocr-shareaction-transform-waiting 04-autoocr-shareaction-results 05-autoocr-shareaction-transformed-docs

Step by Step – Setup & Installation documentation for ifresco AutoOCR Transformer >>>

Test and Demo version is available – please contact us for details >>>

Price information you can find here >>>

Windows Service – access to network resources – what to consider?

Our document conversion tools – FileConverter, AutoOCR and FileConverterPro – are used to monitor one or more input folders and to automatically start processing for new documents. This can be done either via local drives or via network resources.

Particularly when installing the applications as a service and when using network resource, some things have to be considered with regard to the correct configuration:

  • The service that has access to network resources must run under a user account and not as a system account.
  • The user at which the service runs must have the appropriate rights (read / write / delete) on the network resource.
  • You must not use a mapped drive to access the network resources (in / out / error / archive / log folder), but the direct network Share (UNC path) must be used.
  • The processing option for folder monitoring must be changed from “File System Events” to “Read File Blocks“.

The network connection to a drive letter is managed via the “Network Connection Service”. Mapping a drive so there are some things you should know

  • Option – Reconnect at the logon – Used to automatically restore the drive mapping at the next login.
  • The mapping of the drives is implemented via users – if the user is not logged in, the mapped drives are not available.
  • Mapped drives are not available via a service – regardless of whether the user is running the same account as the user currently logged in – because a service runs only under the “user credentials”, but is not logged in.

In general – even if the applications are not installed and operated as a service – it is recommended to use network shares (UNC paths) instead of mapped drives. A network share – the direct access to the network resource – is always available for the service (under a user account) as well as for normal applications and is defined by the remote server. However, this does not apply to the local system account – it does not have access to network resources and therefore can not be used for a service that must have access to network resources.

Office2PDFA – Scripting Support

Office2PDFA now also supports scripting. It supports two CLR languages: VB.NET & C#.  CLR = Common Language Routines, ie. C# and VB.NET are based on (implements the) Common Language Routines – there are other CLR programming languages J#, IronPhyton. VBScript is also supported but only for 32bit version.

It is possible to execute script before, and after the conversion. The script execution can be enabled/disabled. The scripts for before and after can be written in different programming languages.

A script consists of a list of functions and declarations. The runtime will generate a class from the functions and declarations and will execute the “Run” method. The Run method has one parameter of type IScriptContext.

IScriptContext properties:

  • SkipConversion (boolean): if the value is set to true, the application will not convert the document to PDF. The script is responsible to convert the document and write the result to a temporary folder. The path of the temportary folder should be stored in the DestinationFile property. The destination file type (extension) should be specified also from the script,  using the DestinationFileExt property.
  • DestinationFile (string): the fully qualified path of a file
  • DestinationFileExt (string): the extension of the destination file
  • FolderPath: the path of the folder of the input file
  • RootFolderPath: the root folder path, it can be different than the FolderPath property in the case if the subfolder monitoring is enabled
  • Error (boolean): the script should set this property some errors are occured.
  • ErrorDescription (string): the script should set this property and provide a description of the error.
  • FilePath: the fully qualified path of the source file
  • Folder (IFolder): a reference to an Office2PDF monitored folder.

Folder properties – use this object to get additional information about the monitored folder:

  • Name (the name of the monitored folder)
  • InputFolder (path to the input folder)
  • OutputFolder (path of the output foldre)
  • ErrorFolder (error folder)
  • ArchiveFolder (archive folder)

Sample VB.NET Script to convert a MS-Word DOCX to a DOC document – The following pre-script can be used to convert DOCX documents to DOC:

Download – Sample script to convert DOCX to DOC >>>

It is also possible to specify additional assembly references which are used by the scripts. For this script we used the following references:

  • System.Windows.Forms.dll
  • System.Data.dll
  • System.Drawing.dll
  • System.Xml.dll
  • Microsoft.VisualBasic.dll

The workflow of the Office2PDFA remains, the destination file is handled like a normal PDF conversion result. All options are available. PDF metadata, and other PDF properties cannot be applied.

Office2PDFA_Scripting_sample_docx_to_doc_#1 Office2PDFA_Scripting_sample_docx_to_doc_#2

Additional info about IScriptContext which is used by the Run method

It has 2 additional methods:

  • GetParam(name, defValue)
  • SetParam(name, value)

You can store in the IScriptContext parameter your script state data if you want to transmit some data from the pre-script to the post-script, because the same IScriptContext object is used for both scripts to ensure the correct workflow.

The post action script’s parameter contains all information about:

  • the source file/folder, subfolder
  • the desitnation file/folder, subfolder
  • the error (if any)

It is possible to use only the „post“ action script, or only the „pre“ script or both. Scripts can be written in C#, VB.NET or VBScript.  All features of the .NET framework and all features of these programming languages are available. The script security context (Evidence) is the security context of the Office2PDFA application. That means that the script will have the same security context like the application.

 

Office2PDFA also supports VBScript:

The method signature should look like:

sub Run(byref context)
end sub

all parameters are also available with VBScript

PDFSecurity – eDocPrintPro Plugin – Attach PDF security settings

Often one would like to protect a generated PDF document with a password to open, or restrict certain modification possibilities and functions of the PDF documents. The PDFSecurity plugin for the eDocPrintPro PDF printer driver is used for this purpose.

eDoc_PDFSecurity_Plugin

Functions – PDFSecurity Plugin:

  • Password to open
  • Password for setting and changing PDF permissions
  • Encryption 40/128 bit
  • Print – Lock, allow low / high resolution
  • Document changes – allow blocking, inserting / deleting / rotating pages, filling in form fields and allowing signing, allowing commenting, allowing anything but removing pages, allowing copying and removal of content
  • View settings before printing to change / Apply Silent preferences.
  • 32 / 64bit Versionen – MSI Setup
  • 30 days fully functional demo version – online unlockable

Download – eDocPrintPro PDFSecurity Plugin – 32bit >>>
Download – eDocPrintPro PDFSecurity Plugin – 64bit >>>

Webshop