Skip to main content

Connect trained processors to OGRRE

OGRRE relies on Google Document AI extractors for document processing. These must be trained before incorporating into OGRRE. To learn more about Document AI, check out their documentation. To connect your processor to OGRRE, follow this guide.

Create Processor List and Schemas

OGRRE relies on some JSON files to know which processors to look for and which versions of those processors to use. First, you must create JSON file containing a list of your processors, called Extractor Processors.json. Each processor should contain the following:
  1. Processor Name: The name of the processor
  2. Processor ID: The processor ID on Google,
  3. Model ID: The ID of the processor version you wish to use. Processors can have multiple trained versions. If not provided, OGRRE will use the default model.

For an example of this file, see here.

For each processor you want to incorporate, you must create a schema. Google does not provide an API for fetching processor schemas, so a schema is necessary to ensure all fields show up on each document, whether found by the processor or not. Schemas should be a JSON file containing a list of the fields that you want to extract. The name of the file should match the name of the processor in Google Document AI. Each field should contain the following:
  1. page_order_sort: Order you want the field to show up in OGRRE
  2. name: The key of the field
  3. google_data_type: The google data type (one of Plain text, Datetime, Checkbox, Parent, Address)
  4. occurrence: Field occurrence (one of Optional once, Optional multiple, Required once, Required multiple)
  5. grouping: Group that this field is associated with, if any (optional, for fields/subfields)
  6. database_data_type: The data type you want in your database (optional, one of int, str, float, bool, date, table)
  7. cleaning_function: The cleaning function you want applied to your field (optional, one of clean_date, clean_bool, string_to_float, string_to_int, convert_hole_size_to_decimal)
  8. model_enabled: Boolean indicating whether the field is enabled (optional)

For an example of a processor schema, see here.

Add JSON files to project

Once you have created the processor data files, you must add them to the proper location. OGRRE relies on a third repository, OGRRE_data_cleaning. This is where the processor schemas are defined. To include your processors in your version of OGRRE, follow these steps:
  1. Fork and clone OGRRE_data_cleaning
  2. Inside the repository, you will find the directory OGRRE_data_cleaning/src/ogrre_data_cleaning/processor_schemas. Inside this directory, create a subdirectory called <your-project>_extractors. Note: The project name must match the backend environment variable ENVIRONMENT. For more information see the backend environment variables section.
  3. Inside <your-project>_extractors, add your Extractor Processors.json file and all your processor schemas.