Scanning with sane's scanimage from an adf scanner to pdf and ocred text
January 5, 2008
Using libsane and tesseract, you can scan from an ADF (or non ADF) scanner in Ubuntu to a PDF and OCR’ed text document with a few easy steps.
First we need to make sure we have the necessary packages installed.
The tesseract-ocr package gives us a utility called tesseract which takes a TIFF file as input and will output the OCR’d .txt file of the tiff.
Now we need a command line method to grab the TIFFs from the scanner for that, the sane-utils package comes to the rescue. The command “scanimage” from sane will do what we need here. It is a great utility that I recommend reading up on to learn more about its features and options, as they may vary based on the type of scanner you have. My scanner has an Auto Document Feeder (ADF) so be aware that my instructions are specific to an ADF scanner.
**Note: This example is for scanning a letter sized piece of paper in batch mode from an ADF saving output in the format of a TIFF **
This will output a new TIFF for each page that is scanned.
The below script combines several steps to output a single PDF document and .txt file for a scan job.
I name the above script “scandoc” and it can be run by typing “scandoc myoutput.pdf” which will drop a pdf file (called myoutput.pdf) and a .txt (called myoutput.pdf.txt) file in the current directory with all the pages from the ADF. Very handy!
EDIT: I’ve added Joe’s contributions in the comments to a gist at github.
EDIT2: Some copyediting and clarifications throughout.comments powered by Disqus