With AutoOCR version 1.17.2 there is an option to delete empty pages before OCR processing. The detection of a page as a “Blank” via a set threshold. The default value is 1%. A page is recognized as “empty” if less than 1% of the pixels of a page are “not white”. This value must be adjusted if necessary to be processed scans, as it may be when scanning with impurities also that a blank page having “more pixels”, and certain pages are then not detected as empty. However, if the threshold is set too high, it may be that pages with little content are also recognized as empty and thus deleted.
Until now the used font was completely embedded in the created PDF’s. Therefore, especially with input files with one page always quite large PDF output files were generated. However, since the PDFs generated by AutoOCR use an image for display in the foreground and do not require fonts to display, we have changed that. By default, iOCR does not embed PDF fonts. There is the option to embed only the used part of the fonts. Thus, especially for documents that consist of only one or a few pages significantly smaller PDF files are created without embedded fonts.
An option was implemented in version 1.16.1, which made it possible to start the processing per monitored folder with a delay. This option is needed especially for multifunction devices which scan or copy a PDF or an image file directly into an AutoOCR monitored folder.
Certain multifunction devices create a file with 0-Byte, right at the start of a scan process, and either ‘fill’ it step by step with data or collect the scans locally on the device, to copy the finished complete file into the destination directory in the end. This process can take, depending on data volume, number of pages or speed of the data connection, between a few seconds and up to ten or more minutes.
So far, if the start delay is not activated (parameter = 0), AutoOCR starts the processing right away, when a file is created. If the file is not completely created or it is not ready for the processing, it will, within short intervals and with every access or process trial by AutoOCR, result in an error message in the log or in an error email. It can also lead to internal crashes of the OCR processing, which in turn triggers error messages, processing repeats and moves the input file to the error folder.
With this parameter it is possible to configure per folder, by how many seconds (0 to 999) the start of the processing should be delayed. A value has to be found, which complies with the requirements.
With version 1.16.1. AutoOCR has been brought up to date with our basic components. Significant improvements have been made to the iOCR / vsOCR image processing routines. Now there is a much better recognition of the page orientation with which twisted pages are automatically aligned correctly and reliably.
The iOCR / vsOCR setup containing the language and dictionary files of our standard OCR engine is more than 270MB in size. In order to make the downloads and the setups smaller, we decided to split the iOCR / vsOCR into a “base” and an “additional setup”. The basic setup, which is available through our applications, eg. AutoOCR, FileConverterPro, or PDFmdx now only contains a selection of major European languages and has been reduced to 127MB.
If all available languages are to be installed, this is possible at any time. The additionally available “exotic languages” can be installed via a separate setup.
Danish, German, English, Finnish, French, Italian, Catalan, New Greek, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Swedish, Slovakian, Slovenian, Spanish, Czech, Turkish, Ukrainian, Hungarian
iOCR extende languages:
Afrikaanis, Albanian, Arabic, Azerbaijani, Bahasa Indonesian, Bengali, Bulgarian, Cherokee, Chinese – Traditional, Chinese – Simplified, Estonian, Franconian, Gallic, Hebrew, Hindi, Icelandic, Japanese, Korean, Croatian, Latvian, Lithuanian, Macedonian, Malay , Serbian, Swahili, Tagalog, Tamil, Telugu, Thai, Vietnamese, Belarusian
With the AutoOCR installation from version 1.15.3, modified installation prerequisites are checked – if these are fulfilled, these installation steps are skipped by the setup and are not executed.
The following components are checked and, if necessary, post-installed:
- C++ 2010 Runtime 64bit
- C++ 2010 Runtime 32bit
- iOCR – Is now installed as vsOCR to C: \ Program Files (x86) \ Common Files \ MAYComputer \ vsOcr – (~270MB)
If these components are already installed, they are not reloaded and only AutoOCR is installed. If all or individual components are not available or not in the appropriate version, the AutoOCR Setup tries to download them from our FTP server and install them. If an installation is to be made without an Internet connection, the setups of these components should be downloaded and installed beforehand.
The AutoOCR settings and the license are retained when uninstalling / updating the new version.
AutoOCR can be operated with one or more different OCR engines. The iOCR (vsOCR) processing is standard.
In addition, or even only, the Abbyy OCR for AutoOCR can be used as an option. However, an additional Abbyy setup must be downloaded and installed. For the Abbyy OCR Engine version 10 demo licenses are available for 30 days or 500 pages – which you can request from us.
If only the Abbyy OCR engine is to be used, the download and installation of iOCR can be skipped during setup.
Innovations AutoOCR Version 1.15.3:
- New iOCR Engine – We replaced the previous standard iOCR engine with a new product – vsOCR –. This results in a better detection rate as well as a significantly better performance for multicore / multiprocessor computers. With the new OCR Engine, we now support parallel / multithread processing with multi-page TIFF and PDF documents. The OCR processing speed is thereby multiplied, if, for example, 4 or 8 cores are available.
- iOCR – PDF rendering resolution configurable – Since only image / image documents can be processed by OCR, PDF documents are always subjected to an image (rendering) prior to OCR processing. There is now the possibility to configure the rendering resolution for SW and color, whereby the default value for SW and color is 300dpi.
- Abbyy OCR – New default settings – Based on our experience so far, we have redefined the default settings to achieve the best possible recognition rate as well as the highest possible OCR performance. A single option can affect the processing speed, especially for multi-page documents with a lot of text by a factor of 5 or 10 or more – A 10 page document can be stored in either 10 sec. or in 5 min. depending on whether the “Recognize font formatting” option is enabled or not.
- “Remove black border“ – Was added as a new general image processing function for iOCR and Abbyy. Thus, a possible black border is detected and removed in all documents before OCR processing. The page size is not changed.
- Configure an invalid license response – Stopping the service (default value) or demo stamp on the document.
- Further adjustments: Autostart of the AutoOCR User Interface – is now activated by default. Error while creating the optional TXT file with iOCR has been fixed. Read-only PDF documents do not produce endless loops when processing. The temporary Abbyy Folder is now correctly deleted after the set number of days. Language-specific special characters are now encoded correctly with the Abbyy PDF/A output.
For the Abbyy OCR Engine version 10 demo licenses are available for 30 days or 500 pages – you can request them per mail
Features – AutoOCR processing / monitoring folders:
1.) Processing input folders / structures: Here an input folder or an entire folder structure is processed. The generated PDF files are stored in the same folder structure with the same name as the original file. However, a special case, PDF files, as there are PDF files which do not require OCR processing and others which require one. It may also happen that only certain pages of a PDF file need to OCR processing.
In order not to process the PDF’s again that have already been processed by AutoOCR files are indicated in the data structure by a “label”.
At the start of the service AutoOCR the folder structure completely scanned to identify backlogged files. Each PDF file needs to be checked for this “label”. It should be noted that with large data sets, this process takes a long time since any PDF file must be opened and checked.
2.) retain date / time of the original file: With this option, the date and time of creation, modification, and last access can be transmitted from the source file to the generated by OCR PDF file. The PDF document is thus replaced without changing the attributes.
3.) Smart OCR processing of PDF files: PDFs can be pure image files without text, “normal” PDF files that already contain text or be mixed documents. Here individual pages are scanned image files with no text and the remaining pages are normal PDF content with text. Without special functionality always the whole PDF document will be OCR processed and so all pages regardless of the content. This takes time, resources and increases the PDF files unnecessarily. That is why you should activate the “intelligent OCR processing“. Only those documents and pages OCR be processed, where it is necessary. “Normal” PDF files are not processed, but only marked with a “label” – see 1.).
4.) Folder Monitor – File System Events / block processing: If it is required that during the current processing newly added files are immediately detected and processed, so the “File System Events” must be selected. If selected “block processing“, so newly added files are not automatically detected. The “block processing” is specifically designed for the initial processing of large volumes of documents. After the initial processing should then be switched to “file system events” so that newly added files are immediately processed. If the AutoOCR service stopped and restarted, the complete folder structure is searched for unprocessed files first always.
5.) Process files / folders from network shares: After installing the AutoOCR service runs by default as “Local System Account”. Must to files and folders are handled by the network shares, allowed so you have to create a “user account” to be used for the AutoOCR service which also has the appropriate rights to access the network shares used access.
Because PDFs can contain text already and therefore not all the documents / pages to be subjected to OCR processing, we have implemented the intelligent OCR processing. Previously, this feature was only for PDF output.
The Alfresco integration AutoOCR can also be configured for plain text output. Here AutoOCR generates only the text required for the Alfresco full-text search. With the AutoOCR 1.10.17 now the intelligent OCR processing is not only for the PDF, but also for plain text output. So it will be OCR processed only PDF image files. For normal PDF’s the text is extracted directly without OCR processing. This saves time and resources.
For the Abbyy OCR Engine Version 10 Demo licenses are available for 30 days or 500 pages – these can be requested from us.
With the DropOCR version 1.3.2 the parallel upload as well as the communication with the AutoOCR server was completely revised. By that all deficiencies of the previous version were fixed. Especielly with large documents with a lot of pages, long processing times and a big amount of documents to be processed there were problems with the processing – not all documents were processed, errors which didn’t occur got logged or the communication with the AutoOCR server was aborted. All of these problems are now fixed with the version 1.3.2.