Simple generator for PDF/A-2b compliant documents. More...

#include <PDFAWriter.h>

Inherits QObject.

Public Slots
void	waitForWorkerThreads ()
	Waits for all worker threads to finish. More...

Signals
void	authorChanged ()
	Emitted when author changes.

void	keywordsChanged ()
	Emitted when keywords change.

void	subjectChanged ()
	Emitted when subject changes.

void	titleChanged ()
	Emitted when title changes.

void	pageSizeChanged ()
	Emitted when pageSize changes.

void	resolutionOverrideHorizontalChanged ()
	Emitted when resolutionOverrideHorizontal changes.

void	resolutionOverrideVerticalChanged ()
	Emitted when resolutionOverrideVertical changes.

void	autoOCRChanged ()
	Emitted when autoOCR changes.

void	autoOCRLanguagesChanged ()
	Emitted when autoOCRLanguages change.

void	finished ()
	Emitted just before waitForWorkerThreads() returns. More...

void	progress (qreal percentage)
	Progress indicator. More...

Public Member Functions
	~PDFAWriter ()
	Destructor. More...

	PDFAWriter (bool bestCompression=false)
	Constructor. More...

QString	author ()
	Metadata: Author. More...

void	setAuthor (const QString &author)
	Set the author string in the PDF/A meta data. More...

QString	keywords ()
	Metadata: Keywords. More...

void	setKeywords (const QString &keywords)
	Set the author string in the PDF/A meta data. More...

QString	subject ()
	Metadata: Subject string. More...

void	setSubject (const QString &subject)
	Set the subject string in the PDF/A meta data. More...

QString	title ()
	Metadata: Title String. More...

void	setTitle (const QString &title)
	Set the title string in the PDF/A meta data. More...

paperSize	pageSize ()
	Page Size. More...

void	setPageSize (const paperSize &size)
	Sets page size, effective for future calls of the methods addPage() More...

void	setPageSize (paperSize::format size=paperSize::empty)
	Sets page size, effective for future calls of the methods addPage() More...

resolution	resolutionOverrideHorizontal ()
	Horizontal resolution. More...

void	setResolutionOverrideHorizontal (resolution horizontal)
	Set horizontal resolution. More...

resolution	resolutionOverrideVertical ()
	Vertical resolution. More...

void	setResolutionOverrideVertical (resolution vertical)
	Set vertical resolution. More...

void	setResolutionOverride (resolution horizontal, resolution vertical)
	Sets graphic resolution for future calls of the methods addPage() More...

void	setResolutionOverride (resolution res)
	Overloaded method that sets horizontal and vertical resolution to the same value. More...

void	clearResolutionOverride ()
	Set horizontal and vertical override resolution to zero.

bool	autoOCR ()
	AutoOCR. More...

void	setAutoOCR (bool autoOCR)
	Specify if the tesseract OCR engine should be run automatically. More...

QStringList	autoOCRLanguages ()
	List of languages used for OCR. More...

QString	setAutoOCRLanguages (const QStringList &OCRLanguages)
	Specify languages used by the tesseract OCR engine. More...

void	appendToOCRData (const HOCRDocument &doc)
	Specify pre-processed OCR data. More...

HOCRDocument	OCRData ()
	Return a copy of the internal HOCRDocument. More...

void	clearOCRData ()
	Delete all pages from the internal HOCRDocument. More...

QString	addPages (const QImage &image, QStringList *warnings=0)
	Add an image to the PDF document. More...

QString	addPages (const JBIG2Document &jbig2doc, QStringList *warnings=0)
	Add JBIG2 images to the PDF document. More...

QString	addPages (const QString &imageFileName, QStringList *warnings=0)
	Add images to the PDF document. More...

	operator QByteArray ()
	Conversion to a QByteArray containing PDF data. More...

Detailed Description

Simple generator for PDF/A-2b compliant documents.

The class takes a number of images and embeds them into a PDF/A file (Conformance level PDF/A-2b), generating one page per image. OCR data can optionally used to create an invisible text overlay, which means that the PDF/A file will be searchable and that text can be copied from the file. The class contains an conversion operator to a QByteArray, which generates the actual PDF/A file. This makes it extremely simple to write the PDF data to a QFile, or any other I/O device. The life cycle of a PDF/A writer object is mostly this

Create a PDF/A writer
Set meta data using the methods setAuthor(), setTitle(), etc.
Add pages to the PDF/A, using one of the methods addPages()
Generate a PDF/A document using the operator QByteArray()
Delete the writer object.

Examples

A minimal example which creates a PDF/A file might look like this.

// Minimal example
 
// Generate document and add JBIG2 images as pages to the document.
// Perform OCR so that the resulting PDF file will have a text layer.
PDFAWriter writer(); 
setAutoOCR(true); // Run tesseract OCR engine using default language English
writer.addPages("Test.jb2");
 
// Write PDF/A file to 'output.pdf'
QFile outFile("output.pdf");
outFile.open(QIODevice::WriteOnly)
outFile.write(doc);  // Implicitly uses conversion operator to create PDF
 
// End

A more sophisticated example which uses preprocessed HOCR files to create a text overlay might look like this.

// Not quite minimal example
 
// Prepare an HOCR document which will be used to create a text overlay
HOCRDocument  hoDoc("Test.hocr");
 
// Generate document
PDFAWriter writer();
 
// Add two image files to the document, using pages from hoDoc to create a text
// overlay.
writer.appendToOCRData(hoDoc);
writer.addPages("Test-1.jb2"); 
writer.addPages("Test-2.tif");
 
// Write PDF/A file to 'output.pdf'
QFile outFile("output.pdf");
outFile.open(QIODevice::WriteOnly)
outFile.write(doc);  // Implicitly uses conversion operator to create PDF
 
// End

Implementation details and limitations

If the input data is a valid, then output files comply with ISO PDF/A-2b, which is the industry standard for scanned documents. Input data is not checked for validity.
When generating a text overlay, Characters that cannot be encoded by the Windows-1252 encoding are silently igored. As a result, OCR data currently works with western languages only.
The PDF file is created in memory. This might be a problem when creating huge files that are several GB in size.
Metadata, which is mandatory in PDF/A files, is stored both as embedded XMP and as a PDF info directory. The PDF info directory, which is optional in the PDF/A standard, might be removed in future versions of this library. For now, info directories are included because many current PDF handling programs do not interpret XMP data correctly.
The creation data contained in the metadata section of the PDF file refers to the time when the constructor was called.

All methods of this class are reentrant and thread-safe.

Definition at line 127 of file PDFAWriter.h.

Constructor & Destructor Documentation

◆ ~PDFAWriter()

PDFAWriter::~PDFAWriter ( )

Destructor.

The destructor waits for all worker threads to finish and can therefore take considerable time.

◆ PDFAWriter()

PDFAWriter::PDFAWriter ( bool bestCompression = false )

explicit

Constructor.

Constructs a PDFAWriter. The following default values are set.

The metadata strings 'author', 'keywords', 'subject', 'title' are set to empty.
The paper size is set to 'empty', so pages will be exactly the size of the graphics added to the PDFAWriter
The override resolutions are set to zero, so all graphic files are expected to contain correct information about their resolutions.
autoOCR is set to 'false', so no OCR engine will be run automatically.

Parameters

bestCompression If true, then use the slow, but very effective zopfli compression algorithm for the lossless compression of bitmap graphics. Once the PDFAWriter is constructed, this property cannot be changed anymore.

Member Function Documentation

◆ addPages() [1/3]

QString PDFAWriter::addPages	(	const JBIG2Document &	jbig2doc,
		QStringList *	warnings = `0`
	)

Add JBIG2 images to the PDF document.

This method differs from the generic method addPages() only in the arguments: it expects a JBIG2Document instead of a filename.

The images contained in jbig2doc will be embedded in the PDF without re-encoding. The method does not check in detail if the file complies with the JBIG2 standard. If invalid input data is fed into this method, then the resulting PDF file might possibly not comply to the PDF/A standard.

Parameters

jbig2doc	Reference to a document whose images will be added to the PDF file
warnings	If non-zero, pointer to a QStringList where warnings will be stored

Returns: Error message, or empty string on success.

◆ addPages() [2/3]

QString PDFAWriter::addPages	(	const QImage &	image,
		QStringList *	warnings = `0`
	)

Add an image to the PDF document.

This method differs from the generic method addPages() only in the arguments: it expects a QImage instead of a filename. The input image must not be empty. The format of the PDF/A data stream will be chosen according to the image content.

Black and white images are written into the PDF/A as images with depth one.
Bitonal images are written into the PDF/A as images with depth 1 and a color table of length two.
Grayscale images are written into the PDF/A as images with depth 8.
Images with an indexed color palette are written into the PDF/A without change.
All other images are written into the PDF/A as 24-bit RGB.

Alpha-channels will be deleted. The images will be compressed using a lossless compressor. The method is therefore slow. Currently, Black and white and bitonal images are compressed using FAX G4 compression, all other images are compressed using state-of-the-art zlib or zopfli compression with heurestic prediction.

Parameters

image	Image that is added to the document
warnings	If non-zero, pointer to a QStringList where warnings will be stored

Returns: Error message, or empty string on success.

◆ addPages() [3/3]

QString PDFAWriter::addPages	(	const QString &	imageFileName,
		QStringList *	warnings = `0`
	)

Add images to the PDF document.

Adds all images contained in 'imageFileName' as individual pages to the PDF document. The method accepts file in JBIG2, JPEG and JPX format, and any other format that Qt can read. The way that the image is encoded in the PDF file depends on the file type.

JBIG2 files will be added without re-encoding
JPEG files will be added without re-encoding
JPEG2000 files in JPX format will be added without re-encoding.
JPEG2000 files in JP2 format will be converted to RGB, and encoded losslessly.
All all other file types will be converted to RGB, and encoded losslessly in a way that depends on the image characteristics. The documentation of the method addImage() explains this in detail.

Warning: There are two image formats associated with JPEG2000, JP2 (file ending = jp2) and JPX (file ending = jpf or jpx). The PDF standard allows to include JPX files directly into a PDF file, while JP2 files cannot be included.

If a non-empty page size has been set using the method setPageSize(), then the page will be of that size, and the graphics will be centered on their pages. Otherwise, the page size will be chosen to fit the graphic size exactly.

If preprocessed OCR data has been added to the internal HOCRDocument through then method appendToOCRData(), then a text overlay is generated from the first page of the internal HOCRDocument, and the first page is then deleted. If the interal HOCRDocument is empty and the property autoOCR is true, then the tesseract OCR engine is run to create the data needed to generate a text overlay. If autoOCR is false, no text overlay is generated.

This method will never leave the PDFAWriter in any invalid state. It will add as many pages to the document as can be read from the file without errors.

At present, the PDF text overlay created by this method can only handle latin characters, or more precisely, the character set supported by the Windows-1252 encoding. If '*hocrdoc' contains text with characters that cannot be encoded by Windows-1252, then these characters are silently deleted.
This method expects that the images retrieved vom 'imageFileName' and the pages from the internal HOCRDocument share the same coordinate system. In particular, both documents must use the same resolution. Information about page size and resolution are taken from 'imageFileName' or from the properties resolutionOverrideHorizontal and resolutionOverrideVertical.

The method might or might not return immediately, as most of the computationally intense jobs (image conversion, optical character recognition, compression) are run concurrently in separate worker threads.

Parameters

imageFileName	Name of a graphics file whose images are added one-by-one as pages to the PDF/A document.
warnings	If non-zero, warnings that come up while reading the graphics files are added to this list.

Returns: QString::null on success, or an english, human-readble error message otherwise.

◆ appendToOCRData()

void PDFAWriter::appendToOCRData ( const HOCRDocument & doc )

Specify pre-processed OCR data.

This method can be used to specify pre-processed OCR data that will be used to generate a text layer whenever pages are added to the document.

To be more precisely: every PDFAWriter keeps an internal HOCRDocument, and this methods appends the given HOCRDocument to the internal one. Whenever pages are added to the document, the first page of the internal document is used to generate a text layer and is then removed from the internal document. If the internal document is empty, then either the tesseract OCR engine is run (if setAutoOCR() has been set to true) or no text layer is generated at all.

Parameters

doc	HOCRDocument that will be appended to the internal document

◆ author()

QString PDFAWriter::author ( )

Metadata: Author.

Returns: the author string from the PDF/A meta data

◆ autoOCR()

bool PDFAWriter::autoOCR ( )

AutoOCR.

Returns: the value set previously with setAutoOCR()

◆ autoOCRLanguages()

QStringList PDFAWriter::autoOCRLanguages ( )

List of languages used for OCR.

Returns: the list of languages used in by the tesseract OCR engine, as set previously with setAutoOCRLanguages()

◆ clearOCRData()

void PDFAWriter::clearOCRData ( )

Delete all pages from the internal HOCRDocument.

See also: appendToOCRData()

◆ finished

void PDFAWriter::finished ( )

signal

Emitted just before waitForWorkerThreads() returns.

This signal is emitted by the methods waitForWorkerThreads(), immediately before the method returns.

See also: waitForWorkerThreads()

◆ keywords()

QString PDFAWriter::keywords ( )

Metadata: Keywords.

Returns: the keywords string from the PDF/A meta data

◆ OCRData()

HOCRDocument PDFAWriter::OCRData ( )

Return a copy of the internal HOCRDocument.

Returns: copy of the internal HOCRDocument

See also: appendToOCRData()

◆ operator QByteArray()

PDFAWriter::operator QByteArray ( )

Conversion to a QByteArray containing PDF data.

This operator converts the document to a QByteArray holding a PDF/A file. This allows to write a PDFAWriter directly to a QFile, resulting in a valid PDF/A document on the disk.

This method waits for all worker threads to finish and can therefore take considerable time. Just before returning, the signal done() is emitted.

Returns: QByteArray containing the PDF

◆ pageSize()

paperSize PDFAWriter::pageSize ( )

Page Size.

Returns: the page size set with setPageSize()

◆ progress

void PDFAWriter::progress ( qreal percentage )

signal

Progress indicator.

This signal is emitted at irregular intervals while the method waitForWorkerThreads() is running, in order to provide progress information.

Parameters

percentage Number in the interval [0.0 .. 1.0] that indicates the fraction of PDF objects that are still being constructed by worker threads among all PDF objects.

See also: waitForWorkerThreads()

◆ resolutionOverrideHorizontal()

resolution PDFAWriter::resolutionOverrideHorizontal ( )

Horizontal resolution.

Returns: horizontal resolution that is currently set

◆ resolutionOverrideVertical()

resolution PDFAWriter::resolutionOverrideVertical ( )

Vertical resolution.

Returns: vertical resolution that is currently set

◆ setAuthor()

void PDFAWriter::setAuthor ( const QString & author )

Set the author string in the PDF/A meta data.

Parameters

author Name of author

◆ setAutoOCR()

void PDFAWriter::setAutoOCR ( bool autoOCR )

Specify if the tesseract OCR engine should be run automatically.

Parameters

autoOCR If set to true, then the PDFAWriter will automatically run the tesseract OCR engine in the background whenever pages are added to the PDF, unless preprocessed ocr data has been specified via the method appendToOCRData().

See also: setAutoOCRLanguages()

◆ setAutoOCRLanguages()

QString PDFAWriter::setAutoOCRLanguages ( const QStringList & OCRLanguages )

Specify languages used by the tesseract OCR engine.

To improve recognition quality, the tesseract OCR engine needs to know the language(s) of the text. The languages specified here will be passed on to tesseract in future runs.

Parameters

OCRLanguages List of languages to be used in the OCR process. Tesseract identifies languages by their 3-character ISO 639-2 language codes (e.g. "deu" for German or "fra" for French). The languages specified must be present in the current tesseract installation. If an empty list is provided, English will be used as a default language.

Returns: An empty string in case of success or else a human-readable error message in English.

See also: HOCRDocument::tesseractLanguages()

◆ setKeywords()

void PDFAWriter::setKeywords ( const QString & keywords )

Set the author string in the PDF/A meta data.

Parameters

keywords Keyword string

◆ setPageSize() [1/2]

void PDFAWriter::setPageSize ( const paperSize & size )

Sets page size, effective for future calls of the methods addPage()

Parameters

size	Paper size

◆ setPageSize() [2/2]

void PDFAWriter::setPageSize ( paperSize::format size = paperSize::empty )

Sets page size, effective for future calls of the methods addPage()

Parameters

size	Paper size

◆ setResolutionOverride() [1/2]

void PDFAWriter::setResolutionOverride	(	resolution	horizontal,
		resolution	vertical
	)

Sets graphic resolution for future calls of the methods addPage()

To add a raster graphic to a PDF file, the resolution of the raster graphic needs to be known. This method can be used to manually set resolutions before adding graphic files that either do not specify their resolution, or that contain incorrect information.

Parameters

horizontal	Horizontal resolution, which must be either be valid (in other words, horizonal.isValid() must return true), or zero. If zero, this is interpreted as "no override resolution set".
vertical	Ditto for vertical resolution.

◆ setResolutionOverride() [2/2]

void PDFAWriter::setResolutionOverride ( resolution res )

inline

Overloaded method that sets horizontal and vertical resolution to the same value.

Parameters

res	resolution

Definition at line 280 of file PDFAWriter.h.

◆ setResolutionOverrideHorizontal()

void PDFAWriter::setResolutionOverrideHorizontal ( resolution horizontal )

Set horizontal resolution.

Parameters

horizontal Resolution

See also: setResolutionOverride

◆ setResolutionOverrideVertical()

void PDFAWriter::setResolutionOverrideVertical ( resolution vertical )

Set vertical resolution.

Parameters

vertical Resolution

See also: setResolutionOverride

◆ setSubject()

void PDFAWriter::setSubject ( const QString & subject )

Set the subject string in the PDF/A meta data.

Parameters

subject Subject string

◆ setTitle()

void PDFAWriter::setTitle ( const QString & title )

Set the title string in the PDF/A meta data.

Parameters

title Title string

◆ subject()

QString PDFAWriter::subject ( )

Metadata: Subject string.

Returns: the subject string from the PDF/A meta data

◆ title()

QString PDFAWriter::title ( )

Metadata: Title String.

Returns: the title string from the PDF/A meta data

◆ waitForWorkerThreads

void PDFAWriter::waitForWorkerThreads ( )

slot

Waits for all worker threads to finish.

This method blocks until all worker slots finished execution. While this method is running, the signal progress() is emitted at infrequent intervals. The signal finished() is emitted before the method exits, even if there were no running thread at the time that the method was called.

See also: finished(); progress(qreal)

The documentation for this class was generated from the following file:

/home/kebekus/Software/projects/scantools/src/libscantools/PDFAWriter.h

Public Slots

Signals

Public Member Functions

Detailed Description

Examples

Implementation details and limitations

Constructor & Destructor Documentation

◆ ~PDFAWriter()

◆ PDFAWriter()

Member Function Documentation

◆ addPages() [1/3]

◆ addPages() [2/3]

◆ addPages() [3/3]

◆ appendToOCRData()

◆ author()

◆ autoOCR()

◆ autoOCRLanguages()

◆ clearOCRData()

◆ finished

◆ keywords()

◆ OCRData()

◆ operator QByteArray()

◆ pageSize()

◆ progress

◆ resolutionOverrideHorizontal()

◆ resolutionOverrideVertical()

◆ setAuthor()

◆ setAutoOCR()

◆ setAutoOCRLanguages()

◆ setKeywords()

◆ setPageSize() [1/2]

◆ setPageSize() [2/2]

◆ setResolutionOverride() [1/2]

◆ setResolutionOverride() [2/2]

◆ setResolutionOverrideHorizontal()

◆ setResolutionOverrideVertical()

◆ setSubject()

◆ setTitle()

◆ subject()

◆ title()

◆ waitForWorkerThreads