Schema Spreadsheet

In order to bridge the gap between Google's Document AI and the OGRRE Tool, the OGRRE Workflow relies on the Schema Spreadsheet, an xlsx formatted spreadsheet. The Schema Spreadsheet starts with the primary Trained Models Sheet with the sheet name "Trained Models" that contains the records of the trained models for each processor and is used to indicate which model to use for each processor. In addition to the Trained Models Sheet for each Extractor Processor a Schema Record Sheet is added using the processor's name as the sheet name. Schema Record Sheets record the processor schema field definition information and define how each field's extracted values are processed by the OGRRE Tool.

Trained Models

The Trained Models sheet has a row record for each model created for each processor with the column fields: Processor Type, Processor Name, Model Name, F1 Score, Primary Model in Processor, Training Documents, Testing Documents, Date Trained, Foundation Model, Processor ID, and Model ID.

Column Field	Description
Processor Type	Indicates the processor type between Splitter, Classifier, or Extractor processor
Processor Name	The name of the processor (must match exactly)
Model Name	The name of the model
F1 Score	The model's overall F1 Score
Primary Model in Processor	For each processor, the word "primary" is used to indicate which model will be used by the OGRRE Tool when processing documents
Training Documents	The number of training documents when training the model
Testing Documents	The number of testing documents when training the model
Date Trained	The date the model was trained
Foundation Model	Indicates which foundation model was used for training the model or if a Custom model was trained
Processor ID	The ID of the processor (found on the processor details/ overview tab or the Document AI "filepath" bar)
Model ID	The ID of the Model (found on the Manage Versions/ Deploy & Use tab under the column "Version ID")

Schema Record

For each Extractor Processor a Schema Record Sheet is created in which each schema field is entered as a row with the column fields: Google Processor Name, OGRRE Document Type, Page Order Sort, Name, Google Data type, Occurrence, Grouping, Database Data Type, Cleaning Function, Model Enabled, and (Model Name) F1 scores.

Column Field	Description
Google Processor Name	The name of the processor
OGRRE Document Type	The name of the Document Type used in the OGRRE Tool
Page Order Sort	The order in which each field is expected to appear on the documents starting from 1 for the top left most field. This field is used by the OGRRE Tool.
Name	The name of the field (must match exactly). The convention used for Child Field Labels is the (Table Label Name)::(Child Field Name). i.e. Packer::Setting_Depth
Google Data type	The Data Type in the Google Schema
Occurrence	The Occurrence in the Google Schema
Grouping	(Optional) This field can be used to tag related Field Labels
Database Data Type	The data type that the extracted value will be stored as. Available options are bool, str, int, float, date, or Table.
Cleaning Function	The cleaning function that will be applied to field. (See Cleaning Function Documentation for available cleaning functions).
Model Enabled	(Optional) This field can be used to track which labels do or do not meet minimum requirement and then can be referred to while disabling Schema Fields before training a new model.
(Model Name) F1 Scores	For each model, a separate column should be added to record the F1 scores for each field

Updating the OGRRE Tool Schema

With the Excel_to_Json(excel_file_path) command from excel_to_json.py the xlsx formatted Schema Spreadsheet is converted to a set of json files used by the OGRRE Tool.

Visual Record

To help ensure consistent labeling for Splitter and Classifier Processors, the OGRRE Workflow recommends creating a Visual Record for reference. Unfortunately, neither the Google Interface nor the OGRRE Tool can be used for creating a visual record of documents, so an alternate program will need to be used, the OGRRE Team uses PowerPoint. As best practice the Visual Record consists of sections for each Document Type subsectioned into Document Versions paired with separate Extractor Processors. The Document Versions subsection should include the records of even subtle variations which is helpful in the case of wanting to divide a Version subsection and can be used for reference while relabeling affected documents.

Trained Models​

Schema Record​

Updating the OGRRE Tool Schema​

Visual Record

Trained Models

Schema Record

Updating the OGRRE Tool Schema