scantools
1.0.4
Graphics manipulation with a view towards scanned documents
|
Reads and interprets HOCR files, the standard output file format for Optical Character Recognition systems. More...
#include <HOCRDocument.h>
Public Member Functions | |
HOCRDocument () | |
Constructs an empty HOCR document. | |
HOCRDocument (QIODevice *device) | |
Constructs an HOCR document from a QIODevice. More... | |
HOCRDocument (QString fileName) | |
Constructs an HOCR document from a file. More... | |
HOCRDocument (const QImage &image, QStringList languages=QStringList()) | |
Constructs an HOCR document by running the tesseract OCR engine. More... | |
void | clear () |
Resets the document. More... | |
bool | hasError () const |
Error status. More... | |
QString | error () const |
Error message. More... | |
bool | hasWarnings () const |
Warning status. More... | |
QSet< QString > | warnings () const |
Warning messages. More... | |
QSet< QString > | system () const |
System(s) that generated this file. More... | |
QSet< QString > | capabilities () const |
OCR capabilites. More... | |
QList< HOCRTextBox > | pages () const |
Pages in the document. More... | |
bool | isEmpty () const |
Returns true if the document contains no pages. More... | |
bool | hasText () const |
Check if the document does contain text. More... | |
HOCRTextBox | takeFirstPage () |
Removes the first page of the document and returns it. More... | |
void | read (QIODevice *device) |
Reads an HOCR document from a QIODevice. More... | |
void | read (const QString &fileName) |
Reads an HOCR document from a file. More... | |
void | read (const QImage &image, const QStringList &languages=QStringList()) |
Generates an HOCR document by running the tesseract OCR engine. More... | |
QFont | suggestFont () const |
Suggest font. More... | |
QString | toPDF (const QString &fileName, resolution _resolution, const QString &title=QString(), const QPageSize &overridePageSize=QPageSize(), QFont *overrideFont=0) const |
Export to PDF. More... | |
QList< QImage > | toImages (QFont *overrideFont=0, QImage::Format format=QImage::Format_Grayscale8) const |
Export to images. More... | |
QString | toText () const |
Export this document as text. More... | |
void | append (const HOCRDocument &other) |
Appends other HOCRDocument. More... | |
Static Public Member Functions | |
static QStringList | tesseractLanguages () |
List of languages supported by tesseract. More... | |
static bool | areLanguagesSupportedByTesseract (const QStringList &lingos) |
Check if languages are supported by tesseract. More... | |
Reads and interprets HOCR files, the standard output file format for Optical Character Recognition systems.
It also provides an interfact to the tesseract OCR engine.
This class reads and interprets HOCR files, or uses the tesseract OCR engine directly to turn an image into an HCOR document, which is then read. The result is a list of text boxes that can be rendered (e.g. onto a QImage) or turned into PDF drawing code.
The methods of this class are reentrant, but not thread safe.
Definition at line 41 of file HOCRDocument.h.
|
inlineexplicit |
Constructs an HOCR document from a QIODevice.
This is a convenience constructor that constructs an empty document and calls read(device).
device | Pointer to a QIODevice from which the file is read |
Definition at line 54 of file HOCRDocument.h.
|
inlineexplicit |
Constructs an HOCR document from a file.
This is a convenience constructor that constructs an empty document and calls read(fileName).
fileName | Name of the HOCR file |
Definition at line 63 of file HOCRDocument.h.
|
inline |
Constructs an HOCR document by running the tesseract OCR engine.
This is a convenience constructor that constructs an empty document and calls read(image, languages).
image | Image that is to be OCR'ed |
languages | List of languages that is passed on to the OCR engine. If empty, then "eng" (= English) is used as a default language. |
Definition at line 75 of file HOCRDocument.h.
void HOCRDocument::append | ( | const HOCRDocument & | other | ) |
Appends other HOCRDocument.
Appends another HOCRDocument to this document. This method must not be called if this or if the other document has an error status set.
other | Document to be appended to the current one. |
|
static |
Check if languages are supported by tesseract.
lingos | A list of language codes, such as "deu" or "fra" |
|
inline |
OCR capabilites.
If specified in the HOCR file, this method returns a set of strings describing the OCR capabilites ("ocr_page ocr_carea ocr_par ocr_line ocrx_word"). If nothing is specified, an empty set is returned.
Definition at line 129 of file HOCRDocument.h.
void HOCRDocument::clear | ( | ) |
Resets the document.
This method clears all errors, warnings, and any file content.
|
inline |
Error message.
Definition at line 96 of file HOCRDocument.h.
|
inline |
Error status.
Definition at line 88 of file HOCRDocument.h.
bool HOCRDocument::hasText | ( | ) | const |
Check if the document does contain text.
This differs from 'isEmpty()' because a non-empty document can contain empty pages.
|
inline |
Warning status.
Definition at line 105 of file HOCRDocument.h.
|
inline |
Returns true if the document contains no pages.
Definition at line 142 of file HOCRDocument.h.
|
inline |
Pages in the document.
Definition at line 136 of file HOCRDocument.h.
void HOCRDocument::read | ( | const QImage & | image, |
const QStringList & | languages = QStringList() |
||
) |
Generates an HOCR document by running the tesseract OCR engine.
This method runs the tesseract OCR engine on the given image, and reads the results into the present document. If languages are not available in the present installation of tesseract, an error condition is set.
image | Image that is to be OCR'ed |
languages | The languages specified here will be passed on to tesseract, in order to improve recognition quality. Tesseract identifies languages by their 3-character ISO 639-2 language codes (e.g. "deu" for German or "fra" for French). The languages specified must be present in the current tesseract installation. If empty, then "eng" (= English) is used as a default language. |
void HOCRDocument::read | ( | const QString & | fileName | ) |
Reads an HOCR document from a file.
This is a convenience method that opens a file and calls read(QIODevice device).
fileName | Name of the HOCR file |
void HOCRDocument::read | ( | QIODevice * | device | ) |
Reads an HOCR document from a QIODevice.
This method clears the document, and reads an HOCR file from a QIODevice. After method returns, the caller shoud check if an error occurred, by using the methods hasError() and/or error(). The caller might also wish to check if the file contained fixable errors, by calling the method warnings().
The absense of an error or of warnings does not imply that the HOCR file is valid. In fact, only minimal checks are made.
device | QIODevice from which the file should be read |
QFont HOCRDocument::suggestFont | ( | ) | const |
Suggest font.
Suggests a font for rendering this document. Three standard fonts ("Helvetica", "Times", "Courier") are tried, and the one is chose that fits the text box best.
|
inline |
System(s) that generated this file.
Definition at line 119 of file HOCRDocument.h.
|
inline |
Removes the first page of the document and returns it.
This function assumes the document contains pages. To avoid failure, call isEmpty() before calling this function.
Definition at line 162 of file HOCRDocument.h.
|
static |
List of languages supported by tesseract.
QList<QImage> HOCRDocument::toImages | ( | QFont * | overrideFont = 0 , |
QImage::Format | format = QImage::Format_Grayscale8 |
||
) | const |
Export to images.
Renders the pages of the document to sequence of images. We expect that this method will mainly be used for debugging purposes. This method must not be called if an error condition is set.
overrideFont | If null, the method will try a few standard fonts, to see which one fits the document best. The best font will then be taken. If not null, then the specified font will be taken. |
format | Format of the resulting graphics files |
QString HOCRDocument::toPDF | ( | const QString & | fileName, |
resolution | _resolution, | ||
const QString & | title = QString() , |
||
const QPageSize & | overridePageSize = QPageSize() , |
||
QFont * | overrideFont = 0 |
||
) | const |
Export to PDF.
Renders the document to a PDF file. The file is not in PDF/A format, as we expect that this method will mainly be used for debugging purposes. This method must not be called if an error condition is set.
fileName | Name of the PDF file that will be created. If the file exists, it will be overwritten. |
_resolution | Resolution with which the document will be rendered. |
title | Title string that will be set in PDF file's metadata. |
overridePageSize | If null, then the page size is computed individually for each page from the resolution and page's bounding box. If not null, then the given page size will be used for all pages of the document. |
overrideFont | If null, the method will try a few standard fonts, to see which one fits the document best. The best font will then be taken. If not null, then the specified font will be taken. |
QString HOCRDocument::toText | ( | ) | const |
Export this document as text.
This method must not be called if an error condition is set.
|
inline |
Warning messages.
Definition at line 111 of file HOCRDocument.h.