PDFmdx – Read position data via group / subgroup fields

In addition to document fields, PDFmdx can also read position data. Position data is lists or tables with rows and columns. These are typically found on invoices to cite several items in the document. We use the term “sliding group / subgroup. One or more columns (= fields) in on or more rows, on one or more pages, are searched and read in a vertically defined area.

From the PDFmdx version 3.5.0 there is a 2-stage structure where in addition to the groups a subgroup level is also possible. One or more subgroup datasets can be recognized and read out for a group dataset. There are documents with 2-stage position data, eg. in the case of textiles or clothing where an item (number, description) can also have a “sub-level” with sizes or color specifications. The item itself is simply listed and in the level below there are the quantities / prices for individual characteristics.

Two-level readout of position data:

  • “Document/Group/Subgroup” fields define the detection level.

  • An area defined by 2 red horizontal boundary lines will be scanned on all pages of the document for the group (red boxes) and subgroup (green boxes) records.

  • The specified conditions are used to identify and read out the group (G) and subgroup (U) data records.

  • Along with the lowest-level records, the information of the group and document fields is also available.

For tests and as a starting point for your own tests, we have created two example templates with PDF test files. The *.pmdx templates only need to be imported into the PDFmdx Editor via drag&drop and the output path may need to be adjusted. For processing, it is then necessary to create a job with input and error folders in the PDFmdx processor and to select the two test templates for the job.

Download – PDFmdx – Templates and examples for two-level reading of position data >>>
Download – PDFmdx Template Editor & Processor >>>

PDFmdx version 3.5.3 available

New features PDFmdx version 3.5.3:

  • Field / Area OCR / Invert area / Always execute OCR:

Normally for PDFmdx processing, PDF files are used as input, which already contain text – either “normal” PDF or scanned PDF which have received an additional text layer via a previous OCR process (eg. via AutoOCR or FileConverterPro).

PDFmdx also has an integrated OCR function to determine the text in the areas of the positioned fields from the image information.

With the general PDFmdx OCR settings it is possible to specify how the texts from the PDF are to be obtained – “Original”, “OCR” or “SmartOCR”. With “Original” the text is always taken from the PDF, with “OCR” the text is always obtained via a PDFmdx OCR process, even if a text already exists in the PDF. With the “SmartOCR” setting, the PDFmdx OCR function is only executed if there is no text in the PDF, otherwise the existing text in the PDF is taken. These settings generally apply to the entire template and all associated layouts.

In this context, there are now 2 new functions that allow to recognize white text on a black background.

Individual areas with white text on a black background can not be recognized via an automatic OCR process, because before the OCR process the area would have to be inverted in order to be recognized. This can only be done interactively by selecting the area manually.

In the PDFmdx Editor it is now possible to activate the option “Invert Area” in the field configuration. In this case, the field area is inverted for the OCR processing. This creates black text on a white background which can be recognized by the OCR.

There is another new field function “Execute OCR always” with which the general setting “SmartOCR” can be overridden. OCR recognition is then always executed for this field, even if an underlying text already exists.

  

  • PDFmdx Editor – find condition, call layout: There is now a search function to search in the conditions for a (partial) string forward and backward. A line in the conditions can thus be jumped to directly. The linked layout can then be called directly from the condition line. This feature makes it easy to work with a large number of conditions.

  • The web service functions have been revised. In the web service example the metadata can now also be downloaded as XML.
  • For the metadata XML, the new variables JobID, JobName, JobDescription and ProcessID have been added.

Download – PDFmdx Template Editor & Processor >>>

PDFmdx Version 3.5.0 available

Innovations PDFmdx Version 3.5.0:

  • Subgroups – additional hierarchy for moving groups: A sliding group is used, for example, to recognize invoice items that occur several times in a document or on a page and to be able to form several data records from this. However, there are documents where these records require a further hierarchy level, if there are multiple sub-records under one heading, e.g. to differentiate different characteristics of an article according to color or size. This can be done either as a list or in the form of a matrix. In order to be able to recognize and read out such additional characteristics it is now possible to form “subgroups” for a moving group.

There are now 3 field levels – the “Document fields”, the “Group fields” and the “Subgroup fields”. Subgroup records are defined by conditions such as the group records. The output also provides the information of the document and the group for subgroup records.

For the output, you can configure whether – all data records are output, or whether the group or document records are to be suppressed. The fields of the higher levels are also available in the group/subgroup dataset. To identify the data record level, the variable %RECORD_LEVEL% can be used with the values (D)ocument, (G)roup, (S)ubgroup.

The fields of the different levels are displayed in different colors in the PDFmdx Editor – document fields “Blue”, group fields “Red” and subgroup fields “Green”.

The working/search area for the moving group/subgroup is represented in the PDFmdx Editor by 2 horizontal red lines, which can be positioned vertically in the preview. The search for data records takes place only within the specified range.

  • MS-SQL Database Support for Metadata / Log & Error Log Function: In addition to exporting the metadata to an XLSX/CSV/XML file, there is now also the option to write the records into MS-SQL database tables. The read-out variables are written as documents/groups/subgroup data sets with configurable fields and contents, the log table with a fixed structure.

MS-SQL Export Functions:

    • Configuration – MS-SQL Server / Database.
    • Create / delete SQL tables / delete data from the tables.
    • Create / delete SQL columns in the selected table.
    • For each template, the SQL export can be activated and the SQL table can be selected. Fields (variables) or fixed text can be assigned to any SQL column.
    • Enable SQL – Logging / Error Log. The name of the log table is configurable.
    • The SQL log contains the following information: PROCESS_ID, computer name (WsName), user name (UserName), template (Template), layout, status (OK, ERROR), error code (ErrorCode), error message as text (ErrorMessage), information about the input/output file (InputPath, InputFileName, InputFolder, OutputPath, OutputFileName, OutputFolder), start/end of processing (StartTime, EndTime), processing time (ProcessingTime).

PDFmdx error codes in the log:

    • 0 = Successful processing.
    • 1 = No pages remaining in the PDF.
    • 2 = Configured stationery could not be found.
    • 3 = Missing license.
    • 4 = Error loading text plugin.
    • 5 = Error writing the PDF file.
    • 6 = No matching template/layout found for the specified criteria.
    • 7 = Error writing printer (PCF) configuration file.
    • 8 = Processing error.
    • 9 = Error creating the output folder.
    • 10 = Error creating the output file.
    • 11 = Error when overlaying/underlaying the stationery.
    • 12 = Error while signing.
    • 13 = Error when sending emails.
    • 14 = Error writing metadata.
    • 15 = Error generating the XML file.

  • PDFmdx Editor – Test Functions: The test feature in PDFmdx Editor and PDFmdx Processing are now based on the same component. This ensures that the result of the “Test” in the PDFmdx Editor, for the recognition, the splitting and the reading, yields the same result as for the processing by the PDFmdx Processor.

In a PDFmdx template you can configure if and how a layout should be identified by conditions. In the “Test” function in the PDFmdx Editor, the conditions are checked, the recognized layout is identified and the fields specified in the layout are read out. On the test mask there is now a checkbox to ignore the layout recognition/criteria. The fields are then read out and displayed only via the manually selected layout.

  • Field substring from the end: The substring field function is now not only possible from the beginning of a field, but also from the end (switchable).

  • New OCR version, several recognition languages: The field OCR function for fields has been updated and is now based on Tesseract Version 4.0. As a result, it is now possible to recognize multiple languages.

  • Default values for fields – layout related: In addition to the function to give a general value, there is now also a function to assign an individual default value for a field for each layout. A variable is assigned the default value if the field was not positioned on a layout or if the field was positioned but nothing can be read because the area is empty (=blank). This allows the layout recognition of a variable to assign a fixed value – eg. a customer number that can not be read directly from the document.
  • New “Composite” field type: The “Composite” type allows you to create combined fields, that consist of several other fields or text. Such composite fields are available for output (folder, filename, metadata), but not for conditions. These fields can be composed of variables of the documents, groups and subgroups.

  • Option – No remaining pages – Do not move document to the error folder: When splitting, deleting pages (cover pages) and deleting blank pages, it may happen that the remaining document no longer has any remaining pages left for processing. This option determines whether the “remaining document” is to be retained and moved to the error folder, or whether such a document is not preserved and the process is only logged in the error log.

  • Export of additional formats, selectable for – “Successful / Error / Both”: It is now also possible to convert PDF files, that have been moved to the error folder, into other formats (eg. TXT) to carry out further evaluations.

Download – PDFmdx Template Editor & Processor >>>

PDFmdx Version 3.3.0 available

Innovations PDFmdx Version 3.3.0:

  • Export additional formats – By integrating the PDF2DOCX Converter, HTML, DOCX, XML, TXT and XLS can now be created in addition to the generated PDF. These additional filed are created from the generated PDF and stored in the same output path as the PDF. One or more additional file formats can be created at the same time.

  • PDFmdx Editor – Save and load the conditions created in the editor as an XML file to easily and quickly save and reload different states of conditions. The file name is automatically suggested when saving on basis of – template name, date and time.

  • PDFmdx Editor – Move conditions up / down or to the beginning / end. This allows conditions to be easily resorted, grouped and to align related rows underneath each other.

 

  • PDFmdx Editor – Conditions – insert / rename a seperator. Conditions can be provided with additional dividing lines to increase the readability and clarity of large structures. An inserted dividing line can be removed and the text can be edited.

  • Error correction – An action associated with a condition – Detect, Split, Delete, Sliding Groups – can be limited to specific pages. For example, only on the first or on the first and second page. This speeds up processing, because not all pages in a batch have to be processed. Fixed an issue where the page limit specification was not applied and all pages were searched. With version 3.3.0, only the specified pages are processed.

  • Keep field contents from deleted pages. If pages are deleted via conditions it was not possible to use the field information from these pages for conditions, for the output of the metadata or for the creation of the path and file name. For example, to use a barcode value of a cover page as a document identifier, for separating a batch, for selecting the layout, for the file name, and finally for deleting this seperator page. In order to preserve field contents despite the deletion of pages, the field definition now has the option “Persistant value”. This makes it possible to identify a layout, divide the stack, delete the pages and use the read-out value for the file name in a single condition and a single step.

  • PDFmdx Editor – Save template / layout structure as XML. The tree structure of the templates and layouts created in the PDFmdx Editor can be written to an XML file and automatically updated when the PDFmdx Editor is closed.

  • PDFmdx Editor – New field type – „Filename“ – Thus, the file name of the input file can be used for the conditions of processing and layout recognition. For example, the layout to be used can be controlled by the file name or parts of the name.

    

  • PDFmdx Editor – Conditions – Direct selection of the layout to be used via option <VALUE>. If you want to select a layout via the value of a variable (for example the file name), so either a seperate condition must be created for each layout and linked with “OR”, or you can use the selection <VALUE> under the conditions. This automatically checks the given variable against every layout name created for the template and selects the layout in which the layout name matches the content of the field.

 

  • %FILENAME% variable – The case of the file name is preserved – previously the file name was always converted to lowercase.
  • Overwrite file / Append counter – There is now an option to overwrite files with the same name during processing. If this option is not checked then a new file will be created as usual and a counter will be added to the existing file name.

Download – PDFmdx Template Editor & Processor >>>

iPaper 3.x – MDX Option – Product Video available – Read content and use as variables

For iPaper version 3.x there is the “MDX – MetaDataXtraction” add-on module. Key features of the PDFmdx application have been integrated into iPaper. Documents can be recognized on the basis of content, the corresponding stationery can be selected, or field, template and layout definitions can be used to read out information from the document. Fields / variables are filled with values which can be used later on in the iPaper actions. Fixed information or information read from the document can also be “stamped” on the PDF as text or as 1D / 2D / QR barcode.

iPaper MDX applications:

  • Automatically select the stationery to be used on the basis of the document contents.
  • For serial letters or document stacks it can be recognized at which page a new document begins to select the stationery again or to start again with the first stationery page.
  • Read e-mail addresses from the document and use them to send the document immediately.
  • Documents can be recognized on the basis of criteria, fields can be read out of the document via layout masks, variables can be assigned and used for iPaper actions such as e.g. e-mail, save as, program call and so on.
  • QR code barcodes (e.g. For quick transfers), 1D / 2D barcodes or text stamps can be applied to vouchers. It is also possible to assign read field contents from the document.

iPaper MDX Product Video – Read content and use as file name:

PDFmdx Version 3.2.7 available

Innovations PDFmdx Version 3.2.7:

  • Multiline Edit Box for Barcode- and Text-stamp – Create QR code for payment instructionsUp to now, only a single-line string could be specified for the text and barcode stamping. CR / LF was not consideredNow there is a multi-line input field for capturing the texts. Line breaks (CR / LF) and blank lines are transferred correctly to the stamps and barcodesNow QR codes can also be generated for the creation of SEPA payment instructions – See QR-Code “Zahlen mit Code”The basis for this QR code is a standard of the European Payments Council. Many banks offer eBanking apps for smartphones with the functionality of which such QR code can be read. The information is automatically transferred to a transfer.

    

  • Identify the same receiversUp to now, every PDF file created could only be sent in a separate email message. Now it is also possible, when processing a job, to collect all documents with the same recipient address and send in only one message. The recipient receives an email, which contains all documents, instead of several mails with only one attachment.

  • Remove charactersSo far there have been only the function to remove at the beginning and end of a field read certain characters. Now there is also the possibility to remove one or more fixed characters from the whole extracted string – no matter where they are.

  • Replace several characters at onceThere was already the function to define several characters which should be replaced.However, the function was not executed “one at a time” but one after the other. Thus, for example, not possible to convert 1.234.56 to 1.234.56. This has been changed and the function is executed with all defined replacement characters at once, which makes such conversions possible now.

  • XLSX instead of XLS – as well as sheet name configurableThe MS-Excel XLS format has been replaced by the XLSX format. The sheet name can now be assigned freely. Previously, the sheet name in the XLS was fixed with “PDFmdx” fixed.

  • Run Job weekly – Time-controlled execution of a job – In addition to the “Daily” option, there is now also the option “Weekly”

 

  • Email Address Search – Document / Page – Troubleshooting – In addition to reading e-mail addresses via fields, there is also the possibility to search all e-mail addresses from the document or on certain pages and to use it for sending.

  • HTML Body – Embed images –  Troubleshooting HTML EMail Sharing – For some EMail Clients / Web-based EMail services (eg Web.de), if images were embedded in the body, the message was displayed as HTML code / text and thus not correctly displayed .

Download – PDFmdx Template Editor & Processor >>>

PDFmdx-CL Version 1.0.25 – Commandline application available for PDFmdx

PDFmdx-CL is a command line application that allows to transfer PDF documents or whole folder structures to a PDFmdx service via the Web service interface and to store the results of the processing in a target folder.

PDFmdx-CL is a free add-on for the PDFmdx server, can be installed on any MS Windows workstations and requires no additional licensing.

PDFmdx-CL scope of application:

  • recognize PDF documents across fields and their contents by means of stored criteria
  • Split of document stacks into single documents by criteria
  • Read out field information from the documents and write it as a metadata (ASCII-TXT) file
  • PDF stationery underlay / overlay controlled via field contents
  • Sign PDF documents
  • Create PDF / A-1b or PDF / A-3b compliant documents
  • Fill PDF Infofelder with the read metadata
  • Copy text / watermark – fixed or via contents / variables from the document
  • 1D / 2D Barcodes – fixed or via contents / variables from the document

The PDFmdx server also offers the possibility to re-name the documents, save them on the server in a folder structure, send them by e-mail, or print them using the PDF2Printerprint server. These functions can only be used directly at the PDFmdx server, but not yet via the PDFmdx-CL application.

PDFmdx-CL features:

  • Command line application for PDFmdx.

 

  • Web service communication (SOAP) – local (host) or remote PDFmdx processing service.
  • Processing of individual PDF files as well as all PDFs of a folder / ZIP file or folder structures.
  • User interface for the configuration as well as to set default settings.

  • Create job templates (name / description) and select the processing template (s). Processing templates are created via the PDFmdx editor and are stored on the PDFmdx server.

  • New processing jobs can be created using an already created job template and filled with documents (individual or entire folders) – Required parameters are either specified or are defaulted by default.

  • The results documents (PDF’s + metadata) are downloaded to the specified destination folder
  • Job details can be displayed through the job list.

 

Download – PDFmdx-CL Commandline Add-on Client für PDFmdx >>>

pdfFM – PDF Folder Merge – Convert documents with the same name to a total PDF (/A)

With PDFmdx, document stacks can be easily split into single documents according to the most diverse criteria and named range contents can be named. Sometimes, however, it may also be necessary to automatically create documents with the same name from different sources in a certain sequence automatically into an overall document.

For a customer project, we have developed pdfFM – an application where 3 folders are specified. When processing, the folders are searched for documents with the same name, the same documents are added to a new total PDF in the order of the specified folders and stored in a destination folder. If a file is missing in one of the folders, these documents are moved to the error folder. A log file logs the processing. The processing can be executed either interactively or via command line call.

In addition to the merge to an overall PDF, the output file can also be converted to an ISO PDF / A-1b, 2b or 3b file.

pdfFM - Konfiguration  pdfFM - Commandline Parameter

PDFmdx – Video – Automatically send invoices via EMail

This PDFmdx application example shows how a PDF document reads out areas and the information is subsequently used for automated email sending of the finished invoice.

  • Fields and areas are defined to: – read the company, the invoice number, the invoice date and the e-mail address from the document.
  • The input file is named based on the information read out. A PDF stationery is deposited. In addition, the read-out invoice number is applied to the invoice as a 1D bar code and a 2D QR code with a web link.
  • As a last step, an email message is generated via an HTML EMail template. Variables which have been inserted in the subject and in the message text are replaced with the read-out information. The PDF invoice as well as additional files are inserted as attachments and then automatically sent via an SMTP EMail server.

 

PDFmdx Version 3.2.5 available

Innovations PDFmdx Version 3.2.5:

  • New option for sending HTML emails – So far it was only possible to use external links, which were also available for the recipient, for pictures in the message. Now the images are embedded directly into the HTML message – either “all images” or “only the local images”. This means that no external resources accessible to all receivers need to be used.

HTML Body - Referenzierte Bilder werden im EMail eingebettet verschickt

  • If the option to preserve the creation date / time is activated, then this information is now also transferred from the output file for files or subfiles that are moved to the error folder.
  • The% COUNTER% variable now supports values> 9999
  • If the “Delete Blank Pages” function is active and a document is processed with only one blank page, it now correctly lands in the error folder and not in the destination folder.

Download – PDFmdx Template Editor & Processor >>>