To be clear, this problem isn’t entirely new. Government agencies have long struggled with how to make use of the unstructured data found in most documents. There are two general ways organizations address this issue, and both have serious shortcomings.
One option is to manually extract data from traditional electronic documents, such as PDFs, Word files or HTML documents. For structured data, such as the amount owed on an invoice, this may be simple and automated. But for unstructured data, it is less straightforward. For example, an invoice may include a description of the services provided. To process this information, project managers must review and verify whether the services align with work on an approved contract and describe work that was actually performed. This likely involves reviewing multiple other documents, all of which also involve unstructured data, and may require the specialized skills of additional government workers, such as lawyers or procurement officials.
The other option is to use structured documents — electronic documents where the various elements of the document have meaningful labels. The most common method would be to use a standard like XML. In XML, the creator of a document can use a schema that defines the elements in the document, the data types of those elements, and any defaults or attributes of those elements. Unfortunately, as many software engineers have found, creating structured documents is easier said than done. The process can be tedious and technical, and changes to schemas must be closely monitored and validated, otherwise nothing may work.
However, artificial intelligence is creating a new option for organizations to make better use of data in their documents. Using natural language processing, deep learning and other methods, AI can help recognize and categorize data in documents and then mark up that data to create a structured document. For example, NASA and the National Science Foundation have partnered with AI startup Docugami to explore how to use its technology to automatically scrape, structure and categorize documents and their elements.
Again, the challenge is not just extracting data from documents, but obtaining data and metadata from them to create meaning so that information can be understood in context. Indeed, Tim Berners-Lee (the inventor of the World Wide Web) and others promoted a vision of linked data called the “Semantic Web” over two decades ago, and the World Wide Web Consortium (W3C) has promoted various standards for encoding semantics with data, such as the Resource Description Framework (RDF) and the Web Ontology Language (OWL). However, this vision for the web has largely gone unrealized for the same reasons organizations have struggled to move to structured data.
Using AI might help solve this problem. For example, Docugami’s founder notes that his company is focused on understanding not just “big data” but also “small data.” So, for example, if analysts are searching through thousands of unstructured medical documents for the word “penicillin,” they are able to distinguish between those instances where the drug is listed in reference to an allergy and others where it is listed as a prescription. For government agencies, this opens up new possibilities too because more semantic data could help an agency not only better manage a wide variety of documents, such as invoices, contracts and proposals, but also eventually use the technology to answer questions using the data contained within them — a search engine on steroids.
Many government agencies still have a lot of work ahead of them to fully digitize before these new technologies will likely be of significant value to them. But these tools show the potential of where technology is leading and the possibilities that will emerge as AI continues to make inroads in government.