[Top] | [Contents] | [Index] | [ ? ] |
This manual is for GNU Ocrad (version 0.22, 9 July 2013).
GNU Ocrad is an OCR (Optical Character Recognition) program and library based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are collectively known as pnm.
Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.
1. Character sets | Input charsets and output formats | |
2. Invoking ocrad | Command line interface | |
3. Library version | Checking library version | |
4. Library functions | Descriptions of the library functions | |
5. Library error codes | Meaning of codes returned by functions | |
6. Image format conversion | How to convert other formats to pnm | |
7. Algorithm | How ocrad does its job | |
8. OCR results file | Description of the ORF file format | |
9. Reporting Bugs | Reporting bugs | |
Concept index | Index of concepts |
Copyright © 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 Antonio Diaz Diaz.
This manual is free documentation: you have unlimited permission to copy, distribute and modify it.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The character set internally used by ocrad is ISO 10646, also known as UCS (Universal Character Set), which can represent over two thousand million characters (2^31).
As it is unpractical to try to recognize one among so many different characters, you can tell ocrad what character sets to recognize. You do this with the `--charset' option.
If the input page contains characters from only one character set, say
`ISO-8859-15', you can use the default `byte' output
format. But in a page with `ISO-8859-9' and
`ISO-8859-15' characters, you can't tell if a code of 0xFD
represents a 'latin small letter i dotless' or a 'latin small letter y
with acute'. You should use `--format=utf8' instead.
Of course, you may request UTF-8 output in any case.
NOTE: 10^9 is a thousand millions, a billion is a million millions (million^2), a trillion is a million million millions (million^3), and so on. Please, don't "embrace and extend" the meaning of prefixes, making communication among all people difficult. Thanks.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The format for running ocrad is:
ocrad [options] [files] |
Ocrad supports the following options:
Print an informative help message describing the options and exit. `ocrad --verbose --help' describes also hidden options.
Print the version number of ocrad on the standard output and exit.
Append generated text to the output file instead of overwriting it.
Enable recognition of the characters belonging to the given character set.
You can repeat this option multiple times with different names for
processing a page with characters from different character sets.
If no charset is specified, `iso-8859-15' (latin9) is assumed.
Try `--charset=help' for a list of valid charset names.
Pass the output text through the given postprocessing filter.
`--filter=letters' forces every character that resembles a
letter to be recognized as a letter. Other characters will be output
without change.
`--filter=letters_only', same as `--filter=letters',
but other characters will be discarded.
`--filter=numbers' forces every character that resembles a
number to be recognized as a number. Other characters will be output
without change.
`--filter=numbers_only', same as `--filter=numbers'
but other characters will be discarded.
Try `--filter=help' for a list of valid filter names.
Force overwrite of output files.
Select the output format. The valid names are `byte' and `utf8'.
If no output format is specified, `byte' (8 bit) is assumed.
Invert image levels (white on black).
Enable page layout analysis. Ocrad is able to separate blocks of text of arbitrary shape as long as they are clearly delimited by white space.
Place the output into file instead of into the standard output.
Quiet operation.
Scale up the input image by value before layout analysis and recognition. If value is negative, the input image is scaled down by -value.
Perform given transformation (rotation or mirroring) on the input image
before scaling, layout analysis and recognition.
Try `--transform=help' for a list of valid transformation names.
Set binarization threshold for pgm or ppm files or for `--scale' option (only for scaled down images). value should be a rational number between 0 and 1, and may be given as a percentage (50%), a fraction (1/2), or a decimal value (0.5). Image values greater than threshold are converted to white. The default value is 0.5.
Cut the input image by the rectangle defined by left, top,
width and height. Values may be relative to the image size
(-1.0 <= value <= +1.0), or absolute (abs( value ) > 1).
Negative values of left, top are relative to the
right-bottom corner of the image. Values of width and height
must be positive. Absolute and relative values can be mixed. For example
`ocrad --cut 700,960,1,1' will extract from `700,960' to
the right-bottom corner of the image.
The cutting is performed before any other transformation (rotation or
mirroring) on the input image, and before scaling, layout analysis and
recognition.
Verbose mode.
Write (export) OCR results file to file (see section OCR results file). `-x -' writes to stdout, overriding text output except if output has been also redirected with the `-o' option.
Exit status: 0 for a normal exit, 1 for environmental problems (file not found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or invalid input file, 3 for an internal consistency error (eg, bug) which caused ocrad to panic.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Returns the library version as a string.
This constant is defined in the header file `ocradlib.h'.
The application should compare OCRAD_version and OCRAD_version_string for consistency. If the first character differs, the library code actually used may be incompatible with the `ocradlib.h' header file used by the application.
if( OCRAD_version()[0] != OCRAD_version_string[0] ) error( "bad library version" ); |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
These are the OCRAD library functions. In case of error, all of them return -1 or a null pointer, except `OCRAD_open' whose return value must be verified by calling `OCRAD_get_errno' before using it.
Initializes the internal library state and returns a pointer that can only be used as the ocrdes argument for the other OCRAD functions, or a null pointer if the descriptor could not be allocated.
The returned pointer must be verified by calling `OCRAD_get_errno' before using it. If `OCRAD_get_errno' does not return `OCRAD_ok', the returned pointer must not be used and should be freed with `OCRAD_close' to avoid memory leaks.
Frees all dynamically allocated data structures for this descriptor. After a call to `OCRAD_close', ocrdes can no more be used as an argument to any OCRAD function.
Returns the current error code for ocrdes (see section Library error codes).
Loads image into the internal buffer. If invert is true, image levels are inverted (white on black). Loading a new image deletes any previous text results.
Loads a image from the file filename into the internal buffer. If invert is true, image levels are inverted (white on black). Loading a new image deletes any previous text results.
Set the output format to `byte' (if utf8=false) or to `utf8'. By default ocrad produces `byte' (8 bit) output.
Set binarization threshold for greymap or RGB images. threshold values between 0 and 255 set a fixed threshold. A value of -1 sets an automatic threshold. Pixel values greater than the resulting threshold are converted to white. The default threshold value if this function is not called is 127.
Scale up the image in the internal buffer by value. If value is negative, the image is scaled down by -value.
Recognize the image loaded in the internal buffer and produce text results which can be later retrieved with the `OCRAD_result' functions. The same image can be recognized as many times as desired, for example setting a new threshold each time for 3D greymap recognition. Every time this function is called, the produced text results replace any previous ones. If layout is true, page layout analysis is enabled, probably producing more than one text block.
Returns the number of text blocks found in the image by the layout analyser or 1 if no layout analysis was requested.
Returns the number of text lines contained in the given text block.
Returns the total number of text characters contained in the recognized image.
Returns the number of text characters contained in the given text block.
Returns the number of text characters contained in the given text line.
Returns the line of text specified by blocknum and linenum.
Returns the byte result for the first character in the image. Returns 0 if the image has no characters or if the first character could not be recognized. This function is a convenient short cut to the result for images containing a single character.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Most library functions return -1 or a null pointer to indicate that they have failed. But this return value only tells you that an error has occurred. To find out what kind of error it was, you need to verify the error code by calling `OCRAD_get_errno'.
Library functions do not change the value returned by `OCRAD_get_errno' when they succeed; thus, the value returned by `OCRAD_get_errno' after a successful call is not necessarily OCRAD_ok, and you should not use `OCRAD_get_errno' to determine whether a call failed. If the call failed, then you can examine `OCRAD_get_errno'.
The error codes are defined in the header file `ocradlib.h'.
The value of this constant is 0 and is used to indicate that there is no error.
At least one of the arguments passed to the library function was invalid.
No memory available. The system cannot allocate more virtual memory because its capacity is full.
A library function was called in the wrong order. For example `OCRAD_result_line' was called before `OCRAD_recognize'.
A bug was detected in the library. Please, report it (see section Reporting Bugs).
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are a lot of image formats, but ocrad is able to decode only three of them; pbm, pgm and ppm. In this chapter you will find command examples and advice about how to convert image files to a format that ocrad can manage.
Portable Network Graphics file. Use the command
pngtopnm filename.png | ocrad
.
In some cases, like the ocrad.png icon, you have to invert the image
with the `-i' option: pngtopnm filename.png | ocrad -i
.
Postscript or Portable Document Format file. Use the command
gs -sPAPERSIZE=a4 -sDEVICE=pnmraw -r300 -dNOPAUSE -dBATCH -sOutputFile=- -q filename.ps | ocrad
.
You may also use the command
pstopnm -stdout -dpi=300 -pgm filename.ps | ocrad
,
but it seems not to work with pdf files. Also old versions of
pstopnm
don't recognize the `-dpi' option and produce an
image too small for OCR.
TIFF file. Use the command
tifftopnm filename.tiff | ocrad
.
JPEG file. Use the command
djpeg -greyscale -pnm filename.jpg | ocrad
.
JPEG is a lossy format and is in general not recommended for text images.
Pnm file compressed with gzip. Use the command
gzip -cd filename.pnm.gz | ocrad
Pnm file compressed with lzip. Use the command
lzip -cd filename.pnm.lz | ocrad
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Ocrad is mainly a research project. Many of the algorithms ocrad uses are ad hoc, and will change in successive releases as I myself gain understanding about OCR issues.
The overall working of ocrad may be described as follows:
1) read the image.
2) optionally, perform some transformations (cut, rotate, scale, etc).
3) optionally, perform layout detection.
4) remove frames and pictures.
5) detect characters and group them in lines.
6) recognize characters (very ad hoc; one algorithm per character).
7) correct some errors (transform l.OOO into 1.000, etc).
8) output result.
Ocrad recognizes characters by its shape, and the reason it is so fast is that it does not compare the shape of every character against some sort of database of shapes and then chooses the best match. Instead of this, ocrad only compares the shape differences that are relevant to choose between two character categories, mostly like a binary search.
As there is no such thing as a free lunch, this approach has some drawbacks. It makes ocrad very sensitive to character defects, and makes difficult to modify ocrad to recognize new characters.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Calling ocrad with option `-x' produces an OCR results file (ORF), that is, a parsable file containing the OCR results. The ORF format is as follows:
For each text block in the source image, the following data follows:
For each line in every text block, the following data follows:
Running ./ocrad -x test.orf examples/test.pbm
in the source directory
will give you an example ORF file.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
There are probably bugs in ocrad. There are certainly errors and omissions in this manual. If you report them, they will get fixed. If you don't, no one will ever know about them and they will remain unfixed for all eternity, if not longer.
If you find a bug in GNU Ocrad, please send electronic mail to bug-ocrad@gnu.org. Include the version number, which you can find by running `ocrad --version'.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Jump to: | A B G I L O U V |
---|
Jump to: | A B G I L O U V |
---|
[Top] | [Contents] | [Index] | [ ? ] |
This document was generated on July, 18 2013 using texi2html 1.76.
The buttons in the navigation panels have the following meaning:
Button | Name | Go to | From 1.2.3 go to |
---|---|---|---|
[ < ] | Back | previous section in reading order | 1.2.2 |
[ > ] | Forward | next section in reading order | 1.2.4 |
[ << ] | FastBack | beginning of this chapter or previous chapter | 1 |
[ Up ] | Up | up section | 1.2 |
[ >> ] | FastForward | next chapter | 2 |
[Top] | Top | cover (top) of document | |
[Contents] | Contents | table of contents | |
[Index] | Index | index | |
[ ? ] | About | about (help) |
where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:
This document was generated on July, 18 2013 using texi2html 1.76.