Category: AutoOCR

ifresco Tools – RepoWorker scripts – convert Alfresco documents to searchable PDF or PDF/A automatically

The module ifresco Tools offers the following functions for the Alfresco ECM / DMS:

  • ifresco-RepoWorker – enables time-controlled execution of a repository-JavaScript on a definable amount of documents.
  • ifresco-ScriptAction – enables the definition of share-actions which execute Repository-JavaScript on documents.

RepoWorker – scripts integrate AutoOCR and FileConverterPro:

With the RepoWorker we created an extension for the ifresco Transformer based on scripts. With that all existing and / or newly added documents of specific content- or MIME-types of an Alfresco server are converted to searchable PDF or PDF/A documents. The user doesn’t has to be concerned with it, the conversion takes place at the server automatically, indepent of how the documents are added into the ECM / DMS.

Functions:

  • time-controlled execution of JavaScript on a definable amount of documents
  • existing documents of a specific content- and MIME-type get converted to searchable PDF or PDF/A and replace the source-documents.
  • processed documents get marked with the “Transform” aspect to prevent a repeated processing.
  • singular or in definable time intervals repeated execution of scripts e.g. every 5 min
  • scripts can easily and quickly be adjusted to new conditions and requirements.
  • easy installation and configuration

Description – RepoWorker scripts for AutoOCR / FileConverterPro >>>

GitHub – RepoWorker scripts for AutoOCR / FileConverterPro >>>

Requirements:

  • Alfresco 4.x,
  • AutoOCR or FileConverterPro ,
  • ifresco Transformer (AMP).
  • ifresco Tools (AMP)

A demo installation can also be found on our ifresco / Alfresco testserver (admin / admin)

1_TIFF Datei in einen Alfresco Folder kopiert    2_TIFF Datei wird gefunden in ein durchsuchbares PDF konvertiert und ersetzt die Ursprungsdatei

FileConverterPro & AutoOCR – test website available

To test the functions of FileConverterPro and AutoOCR and to run own conversion without having to install the software we made a server with FileConverterPro and AutoOCR, accessible via the internet for free.

Under MS-Windows the applications DropConvert (for FileConverterPro) and/or DropOCR (for AutoOCR) can be installed to carry out processings and to be able to run tests with these applications.

These Services can be used without installation of a client software and from any platform with only a browser. Therefor we have set up own test-websites to upload documents and convert them to PDF or PDF/A and/or run a PDF-OCR conversion.

FileConverterPro – test website:

URL: http://autoocr.may.co.at:3000/fcpro

Supported input-document formats:

  • DOC, DOCX, DOCM, RTF, TXT, ODT
  • XLS, XLSX, XLSM
  • PPT, PPTX, PPS, PPSX,
  • FDF, XFDF (Adobe Formulare),
  • XML
  • PNG, BMP, TIF, TIFF, JPG, JPEG, GIF
  • ZIP, RAR, 7Z,
  • MSG, EML,
  • PDF,
  • HTM, HTML, MHTML,
  • PMTX (PDFMerge)
  • DWG, DXF, DWF
  • Abbyy: PDF, TIF, TIFF, PNG, JPG, JPEG, BMP, GIF, PCX, DCX, JP2, JPC, DJV, DJVU, WDP
  • iOCR:  PDF, TIFF, JPEG, PNG

Processing profiles:

At all profiles placeholder pages get inserted when conversion errors occur and for not convertible file formats.

  • Default – direct conversion without MS-Office 2010, no OCR processing
  • Direct + iOCR German – direct conversion without MS-Office 2010, iOCR german
  • Direct – no OCR – PDFA – direct conversion without MS-Office 2010, PDF/A, no OCR processing
  • Direct – no OCR – with draft stamp and overlay – direct conversion without MS-Office 2010, stamps top left with filename / date / time, watermark (stamp) “Draft”, Sample stationery is underlayed, no OCR processing
  • MS-Office + Abbyy + PDFA – conversion of the Office documents via MS-Office 2010, PDF/A-1b output, Abbyy OCR – german & english
  • MS-Office + Abbyy – conversion of the Office documents via MS-Office 2010, Abbyy OCR – german & english
  • MS-Office – no OCR – PDFA – conversion of the Office documents via MS-Office 2010, PDF/A-1b output, no OCR processing

 

AutoOCR – test website:

URL: http://autoocr.may.co.at:3000/autoocr

Supported input-document formats:

  • Abbyy: PDF, TIF, TIFF, PNG, JPG, JPEG, BMP, GIF, PCX, DCX, JP2, JPC, DJV, DJVU, WDP
  • iOCR:  PDF, TIFF, JPEG, PNG

Processing profiles:

  • Abbyy PDFA – German & English – PDF/A output, languages – english & german
  • AbbyyFR10 – english & german – no PDF/A, languages – english & german
  • iOCR – English – PDFA – PDF/A – output, language – english
  • iOCR – English – no PDF/A, language – english
  • iOCR – German no PDF/A, language – german

On the test-sites it can be switched between the FileConverterPro and the AutoOCR test-site directly.

 

Node.js as base for the test websites:

For the implementing of the test websites for the FileConverterPro and AutoOCR we used the currently most modern tools for web-software-development. The programming was realized with JavaScript only, client- as well as server side.

The following components come to use:

  1.  Node.js – JavaScript for the server – http://nodejs.org/
  2. Node.js  FileConverterPro / AutoOCR Libraryhttps://github.com/XKEYGmbH/node-fcpro
  3. Bootstraphttp://getbootstrap.com/
  4. AngularJShttps://angularjs.org/

1_FileConverterPro - Test Site - Dokumente hochladen und nach PDF bzw. PDFA konvertieren3_Die eingefügten Dateien werden in der Liste angezeigt - die Auswahl des Verarbeitungsprofils ist pro Datei möglich   4_Mit Start der Konvertierung - werden die Dateien auf den Testserver hochgeladen und gleich konvertiert  5_Nach der Konvertierung können die erzeugten PDFs über den Download Link abgerufen werden  2_AutoOCR Test Site - Scans, Images und PDF hochladen und in durchsuchbare PDF bzw.PDFA konvertieren

DropOCR – version 1.2.5 available

Innovations DropOCR version 1.2.5:

  • Direct selection of the AutoOCR processing profile through the context menu of the icon tray application
  • function “Cancel all jobs” – with that currently running transfers and processes can be canceled immediatly
  • The “AutoStart” Option is now activated by default
  • The max. page amount is now preset to 1000 by default
  • The connection data of the AutoOCR testserver are already preassigned with the installation

DropOCR - Context Menu - Icon Tray Anwendung  DropOCR - Konfigurationseinstellungen 1.2.5

Download – DropOCR >>>

DropOCR – version 1.2.1 available

Innovations DropOCR version 1.2.1 :

  • Userinterface switchable between german and english
  • HTTP and HTTPS support
  • Logging of the conversion processes, deleting of the log file
  • AutoStart function to start the application when the PC is started
  • Doubleclick on the Drop Zone opens the destination folder
  • AutoOCR testserver preconfigured

Our AutoOCR testserver is reachable vie the following URL and may be used for testing purposes:

  • https://autoocr.may.co.at:8001/AutoOCRService2/
  • User: admin
  • Password: autoocr

DropOCR Konfiguration DropOCR - Context Menü - DropZone

Download – DropOCR >>>

AutoOCR version 1.10.11 – run subsequent processing through DLL

With the version 1.10.9 a new function was implemented to run a subsequent action after the OCR and the creation of the destination file. This could take place at monitored folders as well as at the processing via web-services as C# or VB.NET scripts.

With the AutoOCR version 1.10.11 this possibility got further extended – Now it is also possible to use external DLL’s to run subsequent functions.

Via a checkbox it can be switched between source code (script) and DLL processing and via a selective list the DLL can be chosen.

For that there is a new interface action IAction2 which is inherited from IAction. For the DLL to be available to choose it has to be copied into the AutoOCR installation folder. All DLL’s which end with %NAME%.AutoOCRPlugin.dll get referenced. Please keep in mind that with the installation of AutoOCR as windows service no message boxes or other user interactions are possible and therefor can’t be used.

For the additional tab to show up and be configurable AutoOCR has to be started with the commandline parameter /ShowAction.

Zusätzlicher Tab bei den Ordner Eigenschaften für Aktionen über DLL oder Script  Zusätzlicher Tab bei den OCR Profilen für die Web-Service Schnittstelle Aktionen über DLL oder Script

Download – sample project – DLL action – C# / .NET >>>
Download – AutoOCR – OCR Server incl. iOCR engine (ca. 150MB) >>>

For the Abbyy OCR engine version 10 there are demo licenses for 30 days or 500 pages available – these can be requested from us

Download- Abbyy FineReader 10.x Rel 4 OCR Engine Setup (ca. 460MB) >>>
Request demolicensekey for FineReader OCR engine

ifresco AutoOCR – Version 1.18 available

With the Version 1.18 of ifresco AutoOCR – the OCR server integration for Alfresco, there are new functions and extensions:

  • implementation of the new paging API for the Jobs-list of the AutoOCR server – page browsing (back/forth), deleting of all jobs, deleting older than x days, sort jobs, select jobs by date.
  • free configurable run-time transformer. File-, as well as Pipe-IO based commandline tools can be used to configure additional transformers.
  • Like the commandline based run-time transformators, also Transformer can be used through JavaScripts.
  • AutoOCR Content Model extension for the OCR status (aspect) gets installed to be able to deposit and request the OCR status of a file as metadata.
  • The optional ifresco Tools AMP – allows the background OCR processing in defined intervals for the primary processing of existing document collections or for the following processing of the newly added documents. The detection of the documents which should be processed, as well as the processing itself happens via JavaScripts, which are executed, on the server, batch oriented and timed in the background. Thereby also additional Alfresco Share – document actions can be configured and executed through JavaScript e.g. to convert the chosen PDF and image documents to searchable PDF(/A)’s through the AutoOCR server and automatically replace the input files with them. With the ifresco Tools there are, through JavaScripts, AutoOCR functions independend from the configured Alfresco transformer available, for the mass-batch- as well as the interactive single processing.

AMP of the version 1.18 are available for the following Alfresco versions: 4.0.1 EE, 4.0.2 EE, 4.0d CE, 4.1.1 EE, 4.1.2 EE, 4.1.3 EE, 4.1.4 EE, 4.2b CE, 4.2c CE
AMP of the ifresco Tools 1.1 for: 4.2c CE, 4.2d CE

ifresco AutoOCR - New Job functions  ifresco AutoOCR - Runmtime transformer  ifresco AutoOCR - Transformer configuration Content Model for ifresco-AutoOCR

Download – ifresco AutoOCR – Runtime Transformer description >>>
Download – ifresco AutoOCR – Transformer through JavaScript description>>>
Download – ifresco AutoOCR – Example JavaScript Transformer >>>

ifresco AutoOCR – JavaScript Binding for Alfresco

Alfresco and AutoOCR are, with the installation of the AMP’s, integrated through a REST web-service interface. Server-based JavaScript offers an easy, flexible and quickly implemented Possibility to expand and adjust Alfresco functions.

JavaScripts can be initiated timed as batch processes to e.g. process a bigger amount of documents in the background. But they also can be called by the client e.g. Alfresco Share, to be used as document-actions for single or multiple documents.

The JavaScript Binding  of the AutoOCR functions allows direct access to the AutoOCR service from Alfresco scripts. In Repository JavaScripts (WebScript-controller scripts, scripted actions) all functions of the AutoOCR API can be called. This API is completely independent from the integration of the AutoOCR-service as Alfresco-transformer. It gives the possibility of using OCR functions out of JavaScripts which, deposited in Alfresco, are executed directly on the server.

Download – Documentation JavaScript Binding for Alfresco >>>
Download – extensive demo script >>>

New Web-Site – www.OCRServer.at – online

We summarized all of our OCR Products on our newly created Website

You can get more information about the following products there:

  • AutoOCR
  • AutoOCR light
  • DropOCR
  • FineOCR
  • ifresco Transformer
  • FileConverter (pro)
  • ifresco Profiler + Plugins

Intelligent PDF OCR processing via AutoOCR for Abbyy and iOCR

PDF documents can be generated in different ways. PDFs are able to summarize various contents and sources in one document. Pages can be constructed from “normal” PDF content consisting of text, images, and vector graphics, and typically already have textual content that can be used for full text indexing and search. However, a PDF document can also contain scanned pages in black and white or color. Such pages or documents must undergo OCR recognition to insert the textual information for indexing and searching.

So there are certain PDF documents which either should not be subjected to any OCR processing, or only individual pages or all of them have to be processed because they were generated by a scan process.

Normally, all these types of PDF documents occur in business processes and the user can not distinguish whether or not a document needs to become OCR – viewed from the outside via the Adobe Reader or on the printer, this can not be immediately recognized and distinguished.

If you would generally process every PDF document / page in the same way, regardless of how they are structured and whether an OCR processing makes sense or not, there would be some disadvantages:

Each PDF page is “rasterized” again, regardless of the structure and content, ie converted into an image and then processed OCR. This is like printing the document, scanning it again and then subjecting it to OCR processing. This gives you a picture from a “normal” PDF page with underlying text recognized by the OCR engine.

  • the quality is not the same as before
  • the documents become bigger
  • special PDF properties are lost (bookmarks, links, etc.)
  • Processing time and resources are consumed
  • OCR page licenses are consumed unnecessarily

A PDF OCR processing should therefore be “intelligent”, so that in the process and by the user does not have to decide with difficulty whether a PDF document must be subjected to OCR processing or not. Even more difficult is when a single PDF document consists of mixed normal and scanned parts.

That’s why we’ve integrated intelligent OCR processing into AutoOCR, which works in the same way with both the Abbyy and the iOCR OCR engine. This can be controlled per input folder or for the web service interface via the OCR profile and is available for both PDF> PDF and PDF> TXT processing.

AutoOCR Abbyy - Intelligente OCR Verarbeitung  iOCR - intelligente PDF Verarbeitung

Highlights – Intelligent PDF OCR processing:

  • works for both PDF> PDF and PDF> TXT processing
  • for the Abbyy OCR and iOCR engine
  • at the folder as well as for the web service processing
  • the PDF document as well as every single page are analyzed and only those pages OCR are processed that do not contain any text – these are usually scanned pages that have not yet been processed by OCR.
  • existing normal PDF documents and pages are taken over unchanged and not processed
  • OCRed documents and pages are not processed again.
  • in the case of PDF> TXT processing, the text is extracted from the normal PDF pages and OCR is only performed on pages without text.
  • PDF functions and bookmarks are retained and are included in the target document.
  • Saves processing time and Abbyy OCR page licenses
  • the files are not enlarged
  • the quality of the PDF pages is preserved.

The “intelligent PDF OCR processing” can be found in addition to AutoOCR in all other of our software products that support OCR processing z.b. ifresco Profiler, FileConverter, DropOCR, PDFMerge, etc.

Webshop