scantools  1.0.4
Graphics manipulation with a view towards scanned documents
Public Slots | Signals | Public Member Functions | List of all members
PDFAWriter Class Reference

Simple generator for PDF/A-2b compliant documents. More...

#include <PDFAWriter.h>

Inherits QObject.

Public Slots

void waitForWorkerThreads ()
 Waits for all worker threads to finish. More...
 

Signals

void authorChanged ()
 Emitted when author changes.
 
void keywordsChanged ()
 Emitted when keywords change.
 
void subjectChanged ()
 Emitted when subject changes.
 
void titleChanged ()
 Emitted when title changes.
 
void pageSizeChanged ()
 Emitted when pageSize changes.
 
void resolutionOverrideHorizontalChanged ()
 Emitted when resolutionOverrideHorizontal changes.
 
void resolutionOverrideVerticalChanged ()
 Emitted when resolutionOverrideVertical changes.
 
void autoOCRChanged ()
 Emitted when autoOCR changes.
 
void autoOCRLanguagesChanged ()
 Emitted when autoOCRLanguages change.
 
void finished ()
 Emitted just before waitForWorkerThreads() returns. More...
 
void progress (qreal percentage)
 Progress indicator. More...
 

Public Member Functions

 ~PDFAWriter ()
 Destructor. More...
 
 PDFAWriter (bool bestCompression=false)
 Constructor. More...
 
QString author ()
 Metadata: Author. More...
 
void setAuthor (const QString &author)
 Set the author string in the PDF/A meta data. More...
 
QString keywords ()
 Metadata: Keywords. More...
 
void setKeywords (const QString &keywords)
 Set the author string in the PDF/A meta data. More...
 
QString subject ()
 Metadata: Subject string. More...
 
void setSubject (const QString &subject)
 Set the subject string in the PDF/A meta data. More...
 
QString title ()
 Metadata: Title String. More...
 
void setTitle (const QString &title)
 Set the title string in the PDF/A meta data. More...
 
paperSize pageSize ()
 Page Size. More...
 
void setPageSize (const paperSize &size)
 Sets page size, effective for future calls of the methods addPage() More...
 
void setPageSize (paperSize::format size=paperSize::empty)
 Sets page size, effective for future calls of the methods addPage() More...
 
resolution resolutionOverrideHorizontal ()
 Horizontal resolution. More...
 
void setResolutionOverrideHorizontal (resolution horizontal)
 Set horizontal resolution. More...
 
resolution resolutionOverrideVertical ()
 Vertical resolution. More...
 
void setResolutionOverrideVertical (resolution vertical)
 Set vertical resolution. More...
 
void setResolutionOverride (resolution horizontal, resolution vertical)
 Sets graphic resolution for future calls of the methods addPage() More...
 
void setResolutionOverride (resolution res)
 Overloaded method that sets horizontal and vertical resolution to the same value. More...
 
void clearResolutionOverride ()
 Set horizontal and vertical override resolution to zero.
 
bool autoOCR ()
 AutoOCR. More...
 
void setAutoOCR (bool autoOCR)
 Specify if the tesseract OCR engine should be run automatically. More...
 
QStringList autoOCRLanguages ()
 List of languages used for OCR. More...
 
QString setAutoOCRLanguages (const QStringList &OCRLanguages)
 Specify languages used by the tesseract OCR engine. More...
 
void appendToOCRData (const HOCRDocument &doc)
 Specify pre-processed OCR data. More...
 
HOCRDocument OCRData ()
 Return a copy of the internal HOCRDocument. More...
 
void clearOCRData ()
 Delete all pages from the internal HOCRDocument. More...
 
QString addPages (const QImage &image, QStringList *warnings=0)
 Add an image to the PDF document. More...
 
QString addPages (const JBIG2Document &jbig2doc, QStringList *warnings=0)
 Add JBIG2 images to the PDF document. More...
 
QString addPages (const QString &imageFileName, QStringList *warnings=0)
 Add images to the PDF document. More...
 
 operator QByteArray ()
 Conversion to a QByteArray containing PDF data. More...
 

Detailed Description

Simple generator for PDF/A-2b compliant documents.

The class takes a number of images and embeds them into a PDF/A file (Conformance level PDF/A-2b), generating one page per image. OCR data can optionally used to create an invisible text overlay, which means that the PDF/A file will be searchable and that text can be copied from the file. The class contains an conversion operator to a QByteArray, which generates the actual PDF/A file. This makes it extremely simple to write the PDF data to a QFile, or any other I/O device. The life cycle of a PDF/A writer object is mostly this

  1. Create a PDF/A writer
  2. Set meta data using the methods setAuthor(), setTitle(), etc.
  3. Add pages to the PDF/A, using one of the methods addPages()
  4. Generate a PDF/A document using the operator QByteArray()
  5. Delete the writer object.

Examples

A minimal example which creates a PDF/A file might look like this.

// Minimal example
// Generate document and add JBIG2 images as pages to the document.
// Perform OCR so that the resulting PDF file will have a text layer.
PDFAWriter writer();
setAutoOCR(true); // Run tesseract OCR engine using default language English
writer.addPages("Test.jb2");
// Write PDF/A file to 'output.pdf'
QFile outFile("output.pdf");
outFile.open(QIODevice::WriteOnly)
outFile.write(doc); // Implicitly uses conversion operator to create PDF
// End
Simple generator for PDF/A-2b compliant documents.
Definition: PDFAWriter.h:128
void setAutoOCR(bool autoOCR)
Specify if the tesseract OCR engine should be run automatically.

A more sophisticated example which uses preprocessed HOCR files to create a text overlay might look like this.

// Not quite minimal example
// Prepare an HOCR document which will be used to create a text overlay
HOCRDocument hoDoc("Test.hocr");
// Generate document
PDFAWriter writer();
// Add two image files to the document, using pages from hoDoc to create a text
// overlay.
writer.appendToOCRData(hoDoc);
writer.addPages("Test-1.jb2");
writer.addPages("Test-2.tif");
// Write PDF/A file to 'output.pdf'
QFile outFile("output.pdf");
outFile.open(QIODevice::WriteOnly)
outFile.write(doc); // Implicitly uses conversion operator to create PDF
// End
Reads and interprets HOCR files, the standard output file format for Optical Character Recognition sy...
Definition: HOCRDocument.h:42

Implementation details and limitations

All methods of this class are reentrant and thread-safe.

Definition at line 127 of file PDFAWriter.h.

Constructor & Destructor Documentation

◆ ~PDFAWriter()

PDFAWriter::~PDFAWriter ( )

Destructor.

The destructor waits for all worker threads to finish and can therefore take considerable time.

◆ PDFAWriter()

PDFAWriter::PDFAWriter ( bool  bestCompression = false)
explicit

Constructor.

Constructs a PDFAWriter. The following default values are set.

  • The metadata strings 'author', 'keywords', 'subject', 'title' are set to empty.
  • The paper size is set to 'empty', so pages will be exactly the size of the graphics added to the PDFAWriter
  • The override resolutions are set to zero, so all graphic files are expected to contain correct information about their resolutions.
  • autoOCR is set to 'false', so no OCR engine will be run automatically.
Parameters
bestCompressionIf true, then use the slow, but very effective zopfli compression algorithm for the lossless compression of bitmap graphics. Once the PDFAWriter is constructed, this property cannot be changed anymore.

Member Function Documentation

◆ addPages() [1/3]

QString PDFAWriter::addPages ( const JBIG2Document jbig2doc,
QStringList *  warnings = 0 
)

Add JBIG2 images to the PDF document.

This method differs from the generic method addPages() only in the arguments: it expects a JBIG2Document instead of a filename.

The images contained in jbig2doc will be embedded in the PDF without re-encoding. The method does not check in detail if the file complies with the JBIG2 standard. If invalid input data is fed into this method, then the resulting PDF file might possibly not comply to the PDF/A standard.

Parameters
jbig2docReference to a document whose images will be added to the PDF file
warningsIf non-zero, pointer to a QStringList where warnings will be stored
Returns
Error message, or empty string on success.

◆ addPages() [2/3]

QString PDFAWriter::addPages ( const QImage &  image,
QStringList *  warnings = 0 
)

Add an image to the PDF document.

This method differs from the generic method addPages() only in the arguments: it expects a QImage instead of a filename. The input image must not be empty. The format of the PDF/A data stream will be chosen according to the image content.

  • Black and white images are written into the PDF/A as images with depth one.
  • Bitonal images are written into the PDF/A as images with depth 1 and a color table of length two.
  • Grayscale images are written into the PDF/A as images with depth 8.
  • Images with an indexed color palette are written into the PDF/A without change.
  • All other images are written into the PDF/A as 24-bit RGB.

Alpha-channels will be deleted. The images will be compressed using a lossless compressor. The method is therefore slow. Currently, Black and white and bitonal images are compressed using FAX G4 compression, all other images are compressed using state-of-the-art zlib or zopfli compression with heurestic prediction.

Parameters
imageImage that is added to the document
warningsIf non-zero, pointer to a QStringList where warnings will be stored
Returns
Error message, or empty string on success.

◆ addPages() [3/3]

QString PDFAWriter::addPages ( const QString &  imageFileName,
QStringList *  warnings = 0 
)

Add images to the PDF document.

Adds all images contained in 'imageFileName' as individual pages to the PDF document. The method accepts file in JBIG2, JPEG and JPX format, and any other format that Qt can read. The way that the image is encoded in the PDF file depends on the file type.

  • JBIG2 files will be added without re-encoding
  • JPEG files will be added without re-encoding
  • JPEG2000 files in JPX format will be added without re-encoding.
  • JPEG2000 files in JP2 format will be converted to RGB, and encoded losslessly.
  • All all other file types will be converted to RGB, and encoded losslessly in a way that depends on the image characteristics. The documentation of the method addImage() explains this in detail.
Warning
There are two image formats associated with JPEG2000, JP2 (file ending = jp2) and JPX (file ending = jpf or jpx). The PDF standard allows to include JPX files directly into a PDF file, while JP2 files cannot be included.

If a non-empty page size has been set using the method setPageSize(), then the page will be of that size, and the graphics will be centered on their pages. Otherwise, the page size will be chosen to fit the graphic size exactly.

If preprocessed OCR data has been added to the internal HOCRDocument through then method appendToOCRData(), then a text overlay is generated from the first page of the internal HOCRDocument, and the first page is then deleted. If the interal HOCRDocument is empty and the property autoOCR is true, then the tesseract OCR engine is run to create the data needed to generate a text overlay. If autoOCR is false, no text overlay is generated.

This method will never leave the PDFAWriter in any invalid state. It will add as many pages to the document as can be read from the file without errors.

  • At present, the PDF text overlay created by this method can only handle latin characters, or more precisely, the character set supported by the Windows-1252 encoding. If '*hocrdoc' contains text with characters that cannot be encoded by Windows-1252, then these characters are silently deleted.
  • This method expects that the images retrieved vom 'imageFileName' and the pages from the internal HOCRDocument share the same coordinate system. In particular, both documents must use the same resolution. Information about page size and resolution are taken from 'imageFileName' or from the properties resolutionOverrideHorizontal and resolutionOverrideVertical.

The method might or might not return immediately, as most of the computationally intense jobs (image conversion, optical character recognition, compression) are run concurrently in separate worker threads.

Parameters
imageFileNameName of a graphics file whose images are added one-by-one as pages to the PDF/A document.
warningsIf non-zero, warnings that come up while reading the graphics files are added to this list.
Returns
QString::null on success, or an english, human-readble error message otherwise.

◆ appendToOCRData()

void PDFAWriter::appendToOCRData ( const HOCRDocument doc)

Specify pre-processed OCR data.

This method can be used to specify pre-processed OCR data that will be used to generate a text layer whenever pages are added to the document.

To be more precisely: every PDFAWriter keeps an internal HOCRDocument, and this methods appends the given HOCRDocument to the internal one. Whenever pages are added to the document, the first page of the internal document is used to generate a text layer and is then removed from the internal document. If the internal document is empty, then either the tesseract OCR engine is run (if setAutoOCR() has been set to true) or no text layer is generated at all.

Parameters
docHOCRDocument that will be appended to the internal document

◆ author()

QString PDFAWriter::author ( )

Metadata: Author.

Returns
the author string from the PDF/A meta data

◆ autoOCR()

bool PDFAWriter::autoOCR ( )

AutoOCR.

Returns
the value set previously with setAutoOCR()

◆ autoOCRLanguages()

QStringList PDFAWriter::autoOCRLanguages ( )

List of languages used for OCR.

Returns
the list of languages used in by the tesseract OCR engine, as set previously with setAutoOCRLanguages()

◆ clearOCRData()

void PDFAWriter::clearOCRData ( )

Delete all pages from the internal HOCRDocument.

See also
appendToOCRData()

◆ finished

void PDFAWriter::finished ( )
signal

Emitted just before waitForWorkerThreads() returns.

This signal is emitted by the methods waitForWorkerThreads(), immediately before the method returns.

See also
waitForWorkerThreads()

◆ keywords()

QString PDFAWriter::keywords ( )

Metadata: Keywords.

Returns
the keywords string from the PDF/A meta data

◆ OCRData()

HOCRDocument PDFAWriter::OCRData ( )

Return a copy of the internal HOCRDocument.

Returns
copy of the internal HOCRDocument
See also
appendToOCRData()

◆ operator QByteArray()

PDFAWriter::operator QByteArray ( )

Conversion to a QByteArray containing PDF data.

This operator converts the document to a QByteArray holding a PDF/A file. This allows to write a PDFAWriter directly to a QFile, resulting in a valid PDF/A document on the disk.

This method waits for all worker threads to finish and can therefore take considerable time. Just before returning, the signal done() is emitted.

Returns
QByteArray containing the PDF

◆ pageSize()

paperSize PDFAWriter::pageSize ( )

Page Size.

Returns
the page size set with setPageSize()

◆ progress

void PDFAWriter::progress ( qreal  percentage)
signal

Progress indicator.

This signal is emitted at irregular intervals while the method waitForWorkerThreads() is running, in order to provide progress information.

Parameters
percentageNumber in the interval [0.0 .. 1.0] that indicates the fraction of PDF objects that are still being constructed by worker threads among all PDF objects.
See also
waitForWorkerThreads()

◆ resolutionOverrideHorizontal()

resolution PDFAWriter::resolutionOverrideHorizontal ( )

Horizontal resolution.

Returns
horizontal resolution that is currently set

◆ resolutionOverrideVertical()

resolution PDFAWriter::resolutionOverrideVertical ( )

Vertical resolution.

Returns
vertical resolution that is currently set

◆ setAuthor()

void PDFAWriter::setAuthor ( const QString &  author)

Set the author string in the PDF/A meta data.

Parameters
authorName of author

◆ setAutoOCR()

void PDFAWriter::setAutoOCR ( bool  autoOCR)

Specify if the tesseract OCR engine should be run automatically.

Parameters
autoOCRIf set to true, then the PDFAWriter will automatically run the tesseract OCR engine in the background whenever pages are added to the PDF, unless preprocessed ocr data has been specified via the method appendToOCRData().
See also
setAutoOCRLanguages()

◆ setAutoOCRLanguages()

QString PDFAWriter::setAutoOCRLanguages ( const QStringList &  OCRLanguages)

Specify languages used by the tesseract OCR engine.

To improve recognition quality, the tesseract OCR engine needs to know the language(s) of the text. The languages specified here will be passed on to tesseract in future runs.

Parameters
OCRLanguagesList of languages to be used in the OCR process. Tesseract identifies languages by their 3-character ISO 639-2 language codes (e.g. "deu" for German or "fra" for French). The languages specified must be present in the current tesseract installation. If an empty list is provided, English will be used as a default language.
Returns
An empty string in case of success or else a human-readable error message in English.
See also
HOCRDocument::tesseractLanguages()

◆ setKeywords()

void PDFAWriter::setKeywords ( const QString &  keywords)

Set the author string in the PDF/A meta data.

Parameters
keywordsKeyword string

◆ setPageSize() [1/2]

void PDFAWriter::setPageSize ( const paperSize size)

Sets page size, effective for future calls of the methods addPage()

Parameters
sizePaper size

◆ setPageSize() [2/2]

void PDFAWriter::setPageSize ( paperSize::format  size = paperSize::empty)

Sets page size, effective for future calls of the methods addPage()

Parameters
sizePaper size

◆ setResolutionOverride() [1/2]

void PDFAWriter::setResolutionOverride ( resolution  horizontal,
resolution  vertical 
)

Sets graphic resolution for future calls of the methods addPage()

To add a raster graphic to a PDF file, the resolution of the raster graphic needs to be known. This method can be used to manually set resolutions before adding graphic files that either do not specify their resolution, or that contain incorrect information.

Parameters
horizontalHorizontal resolution, which must be either be valid (in other words, horizonal.isValid() must return true), or zero. If zero, this is interpreted as "no override resolution set".
verticalDitto for vertical resolution.

◆ setResolutionOverride() [2/2]

void PDFAWriter::setResolutionOverride ( resolution  res)
inline

Overloaded method that sets horizontal and vertical resolution to the same value.

Parameters
resresolution

Definition at line 280 of file PDFAWriter.h.

◆ setResolutionOverrideHorizontal()

void PDFAWriter::setResolutionOverrideHorizontal ( resolution  horizontal)

Set horizontal resolution.

Parameters
horizontalResolution
See also
setResolutionOverride

◆ setResolutionOverrideVertical()

void PDFAWriter::setResolutionOverrideVertical ( resolution  vertical)

Set vertical resolution.

Parameters
verticalResolution
See also
setResolutionOverride

◆ setSubject()

void PDFAWriter::setSubject ( const QString &  subject)

Set the subject string in the PDF/A meta data.

Parameters
subjectSubject string

◆ setTitle()

void PDFAWriter::setTitle ( const QString &  title)

Set the title string in the PDF/A meta data.

Parameters
titleTitle string

◆ subject()

QString PDFAWriter::subject ( )

Metadata: Subject string.

Returns
the subject string from the PDF/A meta data

◆ title()

QString PDFAWriter::title ( )

Metadata: Title String.

Returns
the title string from the PDF/A meta data

◆ waitForWorkerThreads

void PDFAWriter::waitForWorkerThreads ( )
slot

Waits for all worker threads to finish.

This method blocks until all worker slots finished execution. While this method is running, the signal progress() is emitted at infrequent intervals. The signal finished() is emitted before the method exits, even if there were no running thread at the time that the method was called.

See also
finished()
progress(qreal)

The documentation for this class was generated from the following file: