scantools  1.0.4
Graphics manipulation with a view towards scanned documents
Public Member Functions | Static Public Member Functions | List of all members
HOCRDocument Class Reference

Reads and interprets HOCR files, the standard output file format for Optical Character Recognition systems. More...

#include <HOCRDocument.h>

Public Member Functions

 HOCRDocument ()
 Constructs an empty HOCR document.
 
 HOCRDocument (QIODevice *device)
 Constructs an HOCR document from a QIODevice. More...
 
 HOCRDocument (QString fileName)
 Constructs an HOCR document from a file. More...
 
 HOCRDocument (const QImage &image, QStringList languages=QStringList())
 Constructs an HOCR document by running the tesseract OCR engine. More...
 
void clear ()
 Resets the document. More...
 
bool hasError () const
 Error status. More...
 
QString error () const
 Error message. More...
 
bool hasWarnings () const
 Warning status. More...
 
QSet< QString > warnings () const
 Warning messages. More...
 
QSet< QString > system () const
 System(s) that generated this file. More...
 
QSet< QString > capabilities () const
 OCR capabilites. More...
 
QList< HOCRTextBoxpages () const
 Pages in the document. More...
 
bool isEmpty () const
 Returns true if the document contains no pages. More...
 
bool hasText () const
 Check if the document does contain text. More...
 
HOCRTextBox takeFirstPage ()
 Removes the first page of the document and returns it. More...
 
void read (QIODevice *device)
 Reads an HOCR document from a QIODevice. More...
 
void read (const QString &fileName)
 Reads an HOCR document from a file. More...
 
void read (const QImage &image, const QStringList &languages=QStringList())
 Generates an HOCR document by running the tesseract OCR engine. More...
 
QFont suggestFont () const
 Suggest font. More...
 
QString toPDF (const QString &fileName, resolution _resolution, const QString &title=QString(), const QPageSize &overridePageSize=QPageSize(), QFont *overrideFont=0) const
 Export to PDF. More...
 
QList< QImage > toImages (QFont *overrideFont=0, QImage::Format format=QImage::Format_Grayscale8) const
 Export to images. More...
 
QString toText () const
 Export this document as text. More...
 
void append (const HOCRDocument &other)
 Appends other HOCRDocument. More...
 

Static Public Member Functions

static QStringList tesseractLanguages ()
 List of languages supported by tesseract. More...
 
static bool areLanguagesSupportedByTesseract (const QStringList &lingos)
 Check if languages are supported by tesseract. More...
 

Detailed Description

Reads and interprets HOCR files, the standard output file format for Optical Character Recognition systems.

It also provides an interfact to the tesseract OCR engine.

This class reads and interprets HOCR files, or uses the tesseract OCR engine directly to turn an image into an HCOR document, which is then read. The result is a list of text boxes that can be rendered (e.g. onto a QImage) or turned into PDF drawing code.

The methods of this class are reentrant, but not thread safe.

Definition at line 41 of file HOCRDocument.h.

Constructor & Destructor Documentation

◆ HOCRDocument() [1/3]

HOCRDocument::HOCRDocument ( QIODevice *  device)
inlineexplicit

Constructs an HOCR document from a QIODevice.

This is a convenience constructor that constructs an empty document and calls read(device).

Parameters
devicePointer to a QIODevice from which the file is read

Definition at line 54 of file HOCRDocument.h.

◆ HOCRDocument() [2/3]

HOCRDocument::HOCRDocument ( QString  fileName)
inlineexplicit

Constructs an HOCR document from a file.

This is a convenience constructor that constructs an empty document and calls read(fileName).

Parameters
fileNameName of the HOCR file

Definition at line 63 of file HOCRDocument.h.

◆ HOCRDocument() [3/3]

HOCRDocument::HOCRDocument ( const QImage &  image,
QStringList  languages = QStringList() 
)
inline

Constructs an HOCR document by running the tesseract OCR engine.

This is a convenience constructor that constructs an empty document and calls read(image, languages).

Parameters
imageImage that is to be OCR'ed
languagesList of languages that is passed on to the OCR engine. If empty, then "eng" (= English) is used as a default language.

Definition at line 75 of file HOCRDocument.h.

Member Function Documentation

◆ append()

void HOCRDocument::append ( const HOCRDocument other)

Appends other HOCRDocument.

Appends another HOCRDocument to this document. This method must not be called if this or if the other document has an error status set.

Parameters
otherDocument to be appended to the current one.

◆ areLanguagesSupportedByTesseract()

static bool HOCRDocument::areLanguagesSupportedByTesseract ( const QStringList &  lingos)
static

Check if languages are supported by tesseract.

Parameters
lingosA list of language codes, such as "deu" or "fra"
Returns
true is the QStringList lingos contains a subset of the list returned by tesseractLanguages()
See also
tesseractLanguages()

◆ capabilities()

QSet<QString> HOCRDocument::capabilities ( ) const
inline

OCR capabilites.

If specified in the HOCR file, this method returns a set of strings describing the OCR capabilites ("ocr_page ocr_carea ocr_par ocr_line ocrx_word"). If nothing is specified, an empty set is returned.

Returns
Set of strings that describe the capabilites.

Definition at line 129 of file HOCRDocument.h.

◆ clear()

void HOCRDocument::clear ( )

Resets the document.

This method clears all errors, warnings, and any file content.

◆ error()

QString HOCRDocument::error ( ) const
inline

Error message.

Returns
If read() succeeded, the method returns an empty string. Otherwise, this method returns a description of the error in enlish language.

Definition at line 96 of file HOCRDocument.h.

◆ hasError()

bool HOCRDocument::hasError ( ) const
inline

Error status.

Returns
true if read() terminated with an error. In that case, an error description can be retrieved using the error() method.

Definition at line 88 of file HOCRDocument.h.

◆ hasText()

bool HOCRDocument::hasText ( ) const

Check if the document does contain text.

This differs from 'isEmpty()' because a non-empty document can contain empty pages.

Returns
True if the document does contain text.

◆ hasWarnings()

bool HOCRDocument::hasWarnings ( ) const
inline

Warning status.

Returns
true if read() found issues in the JBIG2 file that could be fixed. For instance, Google's 'jbig2' encoder is known to produce JBIG2 files that contain segements whose 'retain bit for this segment' contains wrong values.

Definition at line 105 of file HOCRDocument.h.

◆ isEmpty()

bool HOCRDocument::isEmpty ( ) const
inline

Returns true if the document contains no pages.

Returns
true if the document contains no pages, otherwise false

Definition at line 142 of file HOCRDocument.h.

◆ pages()

QList<HOCRTextBox> HOCRDocument::pages ( ) const
inline

Pages in the document.

Returns
a list (possibly empty) containing all the pages of the document, in order.

Definition at line 136 of file HOCRDocument.h.

◆ read() [1/3]

void HOCRDocument::read ( const QImage &  image,
const QStringList &  languages = QStringList() 
)

Generates an HOCR document by running the tesseract OCR engine.

This method runs the tesseract OCR engine on the given image, and reads the results into the present document. If languages are not available in the present installation of tesseract, an error condition is set.

Parameters
imageImage that is to be OCR'ed
languagesThe languages specified here will be passed on to tesseract, in order to improve recognition quality. Tesseract identifies languages by their 3-character ISO 639-2 language codes (e.g. "deu" for German or "fra" for French). The languages specified must be present in the current tesseract installation. If empty, then "eng" (= English) is used as a default language.
See also
tesseractLanguages()

◆ read() [2/3]

void HOCRDocument::read ( const QString &  fileName)

Reads an HOCR document from a file.

This is a convenience method that opens a file and calls read(QIODevice device).

Parameters
fileNameName of the HOCR file

◆ read() [3/3]

void HOCRDocument::read ( QIODevice *  device)

Reads an HOCR document from a QIODevice.

This method clears the document, and reads an HOCR file from a QIODevice. After method returns, the caller shoud check if an error occurred, by using the methods hasError() and/or error(). The caller might also wish to check if the file contained fixable errors, by calling the method warnings().

The absense of an error or of warnings does not imply that the HOCR file is valid. In fact, only minimal checks are made.

Parameters
deviceQIODevice from which the file should be read

◆ suggestFont()

QFont HOCRDocument::suggestFont ( ) const

Suggest font.

Suggests a font for rendering this document. Three standard fonts ("Helvetica", "Times", "Courier") are tried, and the one is chose that fits the text box best.

Returns
Font that fits the text box best
Warning
This method is expensive

◆ system()

QSet<QString> HOCRDocument::system ( ) const
inline

System(s) that generated this file.

Returns
If specified in the HOCR file, this method returns strings describing the system that generated this file ("tesseract 3.04.01"). If nothing is specified, an empty set is returned.

Definition at line 119 of file HOCRDocument.h.

◆ takeFirstPage()

HOCRTextBox HOCRDocument::takeFirstPage ( )
inline

Removes the first page of the document and returns it.

This function assumes the document contains pages. To avoid failure, call isEmpty() before calling this function.

Returns
First page of the document. If the document is empty, an empty HOCRTextBox is returned.

Definition at line 162 of file HOCRDocument.h.

◆ tesseractLanguages()

static QStringList HOCRDocument::tesseractLanguages ( )
static

List of languages supported by tesseract.

Returns
A list of languages that are found in the current installation of the tesseract OCR engine. Typical results are strings such as "eng" for English or "deu" for German.

◆ toImages()

QList<QImage> HOCRDocument::toImages ( QFont *  overrideFont = 0,
QImage::Format  format = QImage::Format_Grayscale8 
) const

Export to images.

Renders the pages of the document to sequence of images. We expect that this method will mainly be used for debugging purposes. This method must not be called if an error condition is set.

Parameters
overrideFontIf null, the method will try a few standard fonts, to see which one fits the document best. The best font will then be taken. If not null, then the specified font will be taken.
formatFormat of the resulting graphics files
Returns
A list of images

◆ toPDF()

QString HOCRDocument::toPDF ( const QString &  fileName,
resolution  _resolution,
const QString &  title = QString(),
const QPageSize &  overridePageSize = QPageSize(),
QFont *  overrideFont = 0 
) const

Export to PDF.

Renders the document to a PDF file. The file is not in PDF/A format, as we expect that this method will mainly be used for debugging purposes. This method must not be called if an error condition is set.

Parameters
fileNameName of the PDF file that will be created. If the file exists, it will be overwritten.
_resolutionResolution with which the document will be rendered.
titleTitle string that will be set in PDF file's metadata.
overridePageSizeIf null, then the page size is computed individually for each page from the resolution and page's bounding box. If not null, then the given page size will be used for all pages of the document.
overrideFontIf null, the method will try a few standard fonts, to see which one fits the document best. The best font will then be taken. If not null, then the specified font will be taken.
Returns
An error message. On success, an empty string is returned

◆ toText()

QString HOCRDocument::toText ( ) const

Export this document as text.

Returns
Text contained in the document

This method must not be called if an error condition is set.

◆ warnings()

QSet<QString> HOCRDocument::warnings ( ) const
inline

Warning messages.

Returns
A set with descriptions of fixable issues found in the HOCR file.

Definition at line 111 of file HOCRDocument.h.


The documentation for this class was generated from the following file: