README.md 7.18 KB
Newer Older
Tom's avatar
Tom committed
1
2
3
ocropy
======

Philipp Zumstein's avatar
Philipp Zumstein committed
4
[![Build Status](https://travis-ci.org/tmbdev/ocropy.svg)](https://travis-ci.org/tmbdev/ocropy)
Philipp Zumstein's avatar
Philipp Zumstein committed
5
[![CircleCI](https://circleci.com/gh/UB-Mannheim/ocropy/tree/pull%2F4.svg?style=svg)](https://circleci.com/gh/UB-Mannheim/ocropy/tree/pull%2F4)
6
[![license](https://img.shields.io/github/license/tmbdev/ocropy.svg)](https://github.com/tmbdev/ocropy/blob/master/LICENSE)
Philipp Zumstein's avatar
Philipp Zumstein committed
7
[![Wiki](https://img.shields.io/badge/wiki-11%20pages-orange.svg)](https://github.com/tmbdev/ocropy/wiki)
Philipp Zumstein's avatar
Philipp Zumstein committed
8
9
[![Join the chat at https://gitter.im/tmbdev/ocropy](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/tmbdev/ocropy?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

Tom's avatar
Tom committed
10
11
12
OCRopus is a collection of document analysis programs, not a turn-key OCR system.
In order to apply it to your documents, you may need to do some image preprocessing,
and possibly also train new models.
The Gitter Badger's avatar
The Gitter Badger committed
13

Tom's avatar
Tom committed
14
15
16
17
18
19
20
21
22
In addition to the recognition scripts themselves, there are a number of scripts for
ground truth editing and correction, measuring error rates, determining confusion matrices, etc.
OCRopus commands will generally print a stack trace along with an error message;
this is not generally indicative of a problem (in a future release, we'll suppress the stack
trace by default since it seems to confuse too many users).

Installing
----------

23
To install OCRopus dependencies system-wide:
24

Tom's avatar
Tom committed
25
    $ sudo apt-get install $(cat PACKAGES)
Tom's avatar
Tom committed
26
    $ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
Tom's avatar
Tom committed
27
28
29
    $ mv en-default.pyrnn.gz models/
    $ sudo python setup.py install

30
31
Alternatively, dependencies can be installed into a
[Python Virtual Environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/):
32

33
    $ virtualenv ocropus_venv/
34
    $ source ocropus_venv/bin/activate
35
    $ pip install -r requirements.txt
36
37
    $ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
    $ mv en-default.pyrnn.gz models/
38
    $ python setup.py install
39

40
41
42
43
44
45
46
47
48
An additional method using [Conda](http://conda.pydata.org/) is also possible:

    $ conda create -n ocropus_env python=2.7
    $ source activate ocropus_env
    $ conda install --file requirements.txt
    $ wget -nd http://www.tmbdev.net/en-default.pyrnn.gz
    $ mv en-default.pyrnn.gz models/
    $ python setup.py install

Tom's avatar
Tom committed
49
50
51
To test the recognizer, run:

    $ ./run-test
52

Tom's avatar
Tom committed
53
54
Running
-------
Tom's avatar
Tom committed
55
56

To recognize pages of text, you need to run separate commands: binarization, page layout
Tom's avatar
Tom committed
57
58
59
60
61
62
analysis, and text line recognition. The default parameters and settings of OCRopus assume
300dpi binary black-on-white images. If your images are scanned at a different resolution, the
simplest thing to do is to downscale/upscale them to 300dpi. The text line recognizer is
fairly robust to different resolutions, but the layout analysis is quite resolution dependent.

Here is an example for a page of Fraktur text (German);
Tom's avatar
Tom committed
63
64
you need to download the Fraktur model from tmbdev.net/ocropy/fraktur.pyrnn.gz to run this
example:
Tom's avatar
Tom committed
65
66

    # perform binarization
Tom's avatar
Tom committed
67
    ./ocropus-nlbin tests/ersch.png -o book
Tom's avatar
Tom committed
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

    # perform page layout analysis
    ./ocropus-gpageseg 'book/????.bin.png'

    # perform text line recognition (on four cores, with a fraktur model)
    ./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png'

    # generate HTML output
    ./ocropus-hocr 'book/????.bin.png' -o ersch.html

    # display the output
    firefox ersch.html

There are some things the currently trained models for ocropus-rpred
will not handle well, largely because they are nearly absent in the
current training data. That includes all-caps text, some special symbols
(including "?"), typewriter fonts, and subscripts/superscripts. This will
be addressed in a future release, and, of course, you are welcome to contribute
new, trained models.
87
88
89
90
91
92
93

You can also generate training data using ocropus-linegen:

    ocropus-linegen -t tests/tomsawyer.txt -f tests/DejaVuSans.ttf

This will create a directory "linegen/..." containing training data
suitable for training OCRopus with synthetic data.
Inndy's avatar
Inndy committed
94

Tom's avatar
Tom committed
95
96
## Roadmap

Philipp Zumstein's avatar
Philipp Zumstein committed
97
98
99
100
------------------------
| Project Announcements
|:-----------------------
| The text line recognizer has been ported to C++ and is now a separate project, the CLSTM project, available here: https://github.com/tmbdev/clstm
101
| New GPU-capable text line recognizers and deep-learning based layout analysis methods are in the works and will be published as separate projects some time in 2017.
Philipp Zumstein's avatar
Philipp Zumstein committed
102
103
104
| Please welcome @zuphilip and @kba as additional project maintainers. @tmb is busy developing new DNN models for document analysis (among other things). (10/15/2016)
------------------------

Tom's avatar
Tom committed
105
106
107
108
109
A lot of excellent packages have become available for deep learning, vision, and GPU computing over the last few years.
At the same time, it has become feasible now to address problems like layout analysis and text line following
through attentional and reinforcement learning mechanisms. I (@tmb) am planning on developing new software using these
new tools and techniques for the traditional document analysis tasks. These will become available as separate
projects.
Tom's avatar
Tom committed
110
111
112
113
114
115
116
117
118
119
120
121
122

Note that for text line recognition and language modeling, you can also use the CLSTM command line tools. Except for taking different command line options, they are otherwise drop-in replacements for the Python-based text line recognizer.

## Contributing

OCRopy and CLSTM are both command line driven programs. The best way to contribute is to create new command line programs using the same (simple) persistent representations as the rest of OCRopus.

The biggest needs are in the following areas:

 - text/image segmentation
 - text line detection and extraction
 - output generation (hOCR and hOCR-to-* transformations)

Tom's avatar
Tom committed
123
124
125
126
127
128
129
130
131
132
133
134
135
## CLSTM vs OCRopy

The CLSTM project (https://github.com/tmbdev/clstm) is a replacement for 
`ocropus-rtrain` and `ocropus-rpred` in C++ (it used to be a subproject of
`ocropy` but has been moved into a separate project now). It is significantly faster than 
the Python versions and has minimal library dependencies, so it is suitable 
for embedding into C++ programs.

Python and C++ models can _not_ be interchanged, both because the save file 
formats are different and because the text line normalization is slightly 
different. Error rates are about the same.

In addition, the C++ command line tool (`clstmctc`) has different command line 
Philipp Zumstein's avatar
Philipp Zumstein committed
136
options and currently requires loading training data into HDF5 files, instead
Tom's avatar
Tom committed
137
138
139
of being trained off a list of image files directly (image file-based training
will be added to `clstmctc` soon).

Tom's avatar
Tom committed
140
141
142
143
144
The CLSTM project also provides LSTM-based language modeling that works very
well with post-processing and correcting OCR output, as well as solving a number
of other OCR-related tasks, such as dehyphenation or changes in orthography
(see our publications). You can train language models using `clstmtext`.

Tom's avatar
Tom committed
145
146
147
148
149
150
151
152
Generally, your best bet for CLSTM and OCRopy is to rely only on the command
line tools; that makes it easy to replace different components. In addition, you
should keep your OCR training data in .png/.gt.txt files so that you can easily 
retrain models as better recognizers become available.

After making CLSTM a full replacement for `ocropus-rtrain`/`ocropus-rpred`, the
next step will be to replace the binarization, text/image segmentation, and layout 
analysis in OCRopus with trainable 2D LSTM models.