Archiving unstructured data is a business necessity across various industries. In this insight piece we explore this need and ways in which our high volume scanning solution ArcMate Capture can help.

The amount of data generated by enterprises is enormous and constantly expanding. It spans petabytes of data and billions of files. Enterprises are required to retain a large chunk of that data for regulatory compliance. This is normally handled through the process of information archiving which takes care of the following:

  • Digitizing any data that is in paper form through scanners and scanning software.
  • Ingesting digital data such as electronic documents and electronic data from legacy applications.
  • Processing data to extract key information that is used in indexing the data to ensure that it can be easily searched and retrieved through dedicated interfaces.
  • Handling the backup and warehousing of the data, managing retention schedules, and offloading old data into cold storage.

It is estimated by experts that approximately 80% of enterprise data is unstructured, or data that has not been structured in a predefined manner.

Until recently, information archiving has been very limited in extracting information from unstructured data. Very often, the cost-benefit analysis didn’t justify spending more time and resources on extracting more than just the basic metadata. Full text OCR and indexing was common but fewer companies were able to put it to a good use.

This is now changing rapidly and there is a growing need to get more out of unstructured data. This is driven by several factors:

  • Mounting compliance requirements to maintain detailed records of everything and be able to extract and present information in an easy way as opposed to pulling thousands of pages for manual review.
  • Enterprises are realizing there’s a huge untapped opportunity in all that unstructured data to extract countless insights that would not only serve regulatory compliance, but also help reduce costs, increase revenues, and improve customer experience.
  • Technology has gotten much better at processing unstructured data. From intelligent recognition of scanned documents in the case of paper documents, to natural language processing, machine learning and artificial intelligence when working with content.
  • With a lot of cold storage now shifting to cloud-native architecture, enterprise archives can now easily plug in to live applications through APIs or get delivered as a service.

In the case of scanned documents that are unstructured, ArcMate Capture provides intelligent recognition of the content of documents with advanced options for configuration. Using ArcMate Capture’s smart zones, the software can detect documents by appearance or patterns and process them accordingly. It can detect and capture key information on documents in various ways. For example:

  • It can locate and process dates and amounts regardless of their format.
  • It can locate names of people or entities and match them against predefined lists.
  • It can locate words matching predefined dictionaries.
  • It can locate surrounding words and sequences to extract key information, such as First Party and Second Party to a contract.
  • It can detect complex strings using REGEX. For example, account numbers that start with DXB-00 and have a total of 10 digits.

Example of Smart Zone Definition in ArcMate Capture

ArcMate Capture makes it possible to scale such operations with ease while maintaining performance and quality. With ArcMate Capture’s programmable workflows you can achieve the following:

  • Document scanning, classification, image quality optimization, data extraction, and quality control can all be automated and distributed across multiple workstations and managed from a central command and control dashboard which allows for the best management and utilization of resources such as processing power and storage.
  • Documents can be sent through full text extraction stations that can run on separate schedules off-peak and can channel their output to third party applications that further process and analyze the extracted text.
  • Confidence levels in extracted data can be measured and managed to flag up anything that falls below a certain threshold for manual review.
  • Custom stages can be programmed to connect with other systems and fetch data that would help in the process.
  • Ability to process huge volumes of scanned documents per day and achieve the highest throughputs possible thanks to ArcMate Capture’s optimized architecture.

Definition of Stages in ArcMate Capture