Back to the Title Page
POSTProcessor of Scanned Text |
|
|
POST is a computer-aided scanned texts processing system. Its
intended use is in small-scale projects involving the digitizing of scanned
printed materials. POST aims at eliminating most of the routine
work in such projects, limiting the human involvment to proofing and minor
editing of the computer-generated results. This way, high efficiency is
achieved without compromising the quality of the end result.
POST works under Unix (it has been tested on Linux and Mac OS X). It consists of two main parts - pbmprepare and CROX - that correspond to two main stages involved in scanned texts processing. |
POST allows you to:
|
| CROX stands for "Crop that Works". It is a shell script that works with imagemagick and allows a user to load any number of image files in any format, interactively crop each image, and write the cropped image files out in the original format. It is best suited for cropping a large number of files and can save work in a task file that can be used to resume cropping later. It is simple to use and has a minimal buil-in help system. | CROX is a standalone system that can be used independently of the rest of the processor. |
|
pbmprepare is a shell script that prepares a task file for use with
CROX starting from bitonal images in TIFF format.
It can can be thought of as a preprocessor for CROX
and is the part of the system that does all the "intelligent" work. It does
its job by using a set of specialized tools that work with intermediate PBM
files and which constitute an extension of
netpbm. Some of the tools are
listed below along with their functions:
|
An example of a scanned document (right-click to download the bitonal TIFF). The result of processing in POST (right-click to download the bitonal TIFFs). |
Realistic Example and Benchmarks |
|
|
To demonstrate the power and versatility of the processor, below are
results for a low-quality scan of an entire book.
Both before and after processing the files were compressed using
DjVu
- one of the best compression technologies currently available.
Files:
Number of pages: 231 The entire processing cycle is maximally streamlined: Most of the processing is done by the processor in the automatic regime, while human intervention is limited (in the example above) to as few as 3% percent of the pages processed. This gives a crucial speed advantage while maintaining high quality of the end result and reducing the compressed file size. |
An example of a scanned document (right-click to download the bitonal TIFF). The result of processing in POST (right-click to download the bitonal TIFFs). |