Back to the Title Page

POST

Processor of Scanned Text

POST is a computer-aided scanned texts processing system. Its intended use is in small-scale projects involving the digitizing of scanned printed materials. POST aims at eliminating most of the routine work in such projects, limiting the human involvment to proofing and minor editing of the computer-generated results. This way, high efficiency is achieved without compromising the quality of the end result.

POST works under Unix (it has been tested on Linux and Mac OS X). It consists of two main parts - pbmprepare and CROX - that correspond to two main stages involved in scanned texts processing.

POST allows you to:
  • Split two-page images into separate pages

  • Deskew text on the page (i.e. eliminate the over-all tilting)

  • Crop the pages to eliminate unwanted black shadows

CROX stands for "Crop that Works". It is a shell script that works with imagemagick and allows a user to load any number of image files in any format, interactively crop each image, and write the cropped image files out in the original format. It is best suited for cropping a large number of files and can save work in a task file that can be used to resume cropping later. It is simple to use and has a minimal buil-in help system. CROX is a standalone system that can be used independently of the rest of the processor.
pbmprepare is a shell script that prepares a task file for use with CROX starting from bitonal images in TIFF format. It can can be thought of as a preprocessor for CROX and is the part of the system that does all the "intelligent" work. It does its job by using a set of specialized tools that work with intermediate PBM files and which constitute an extension of netpbm. Some of the tools are listed below along with their functions:

  • pbmmedian - Uses a heuristic algorithm to split scanned images with two pages per image into the left- and right-hand-side pages.

  • pbmdeskew - A very fast and effective algorithm for finding the skew angle of a bitonal image. It is a speed-enhanced version of the algorithm used in the lepton-lib. In fact, according to tests it is probably the fastest such algorithm ever written.

  • pbmrotate - A very fast and accurate algorithm for rotating an image by a given angle. It is specially tuned to rotate images containing text to minimize the deskewing artifacts. It is several times faster than its closest competitor, lepton-lib, and produces images of unparalleled quality.

  • pbmpurify - Removes everything from an image that doesn't look like text. It makes the system robust against images that contain a lot of noise and scan shadows.

  • pbmanalyze - Uses an iterative heuristic algorithm to find the bounding (crop) box for the text area. It's output is used to create the task file for CROX.
Klein's lectures

An example of a scanned document (right-click to download the bitonal TIFF).

Left page Right page

The result of processing in POST (right-click to download the bitonal TIFFs).

Realistic Example and Benchmarks

To demonstrate the power and versatility of the processor, below are results for a low-quality scan of an entire book. Both before and after processing the files were compressed using DjVu - one of the best compression technologies currently available.

Files:

  • Original scan: 2974 KB
  • Processed scan: 2492 KB

Number of pages: 231
Compression ratio (original): 12.87 KB/page
Compression ratio (processed): 10.79 KB/page
Processing mode: Automatic
Proofreading/Correction: Yes
Percentile of pages corrected: 3%

The entire processing cycle is maximally streamlined: Most of the processing is done by the processor in the automatic regime, while human intervention is limited (in the example above) to as few as 3% percent of the pages processed. This gives a crucial speed advantage while maintaining high quality of the end result and reducing the compressed file size.

Molodshii's book

An example of a scanned document (right-click to download the bitonal TIFF).

Left page Right page

The result of processing in POST (right-click to download the bitonal TIFFs).


Counter