Using python-poppler

Quickstart

from poppler import load_from_file, PageRenderer

pdf_document = load_from_file("sample.pdf")
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()

renderer = PageRenderer()
image = renderer.render_page(page_1)
image_data = image.data

The pdf file is loaded into a Document. From the Document, you can extract general infos such as properties and font infos. You can also extract Page objects, using the Document.create_page() method.

From the Page, you get information about transitions and page orientation, and various methods to extract texts.

Using a PageRenderer, you can convert a Page to an Image.

Most used classes and functions are aliased directly in the poppler module. Therefore, you usually do not need to import anything else than poppler.

Working with documents

Loading Document

A poppler Document can be created from a file path using load_from_file(), from binary data using load_from_data(). There is also a more general load() function, which can take either a file path, binary data, or a file-like object as argument.

Working with password protected documents

There are two kinds of passwords than can by applied to a PDF document: User Password and Owner Password.

The User Password, or Document open password, prevents to open or view the document.

The Owner Password, or Permission password, or master password, is used to set document restrictions, such as printing, copying contents, editing, extracting pages, commenting, etc. When this password is set, you need it to modify the document.

A PDF document can have a User Password, a Owner Password, or both. When both passwords are set, you only need one of them to be able to open the document. However, you need the Owner Password to be able to modify the document.

You can provide the password when loading the document, or later using the Document.unlock() method. The Document.is_locked() property tells you if you have the permission to view the document. If you load a document with the wrong password, an error message is printed on the error console.

The possible document restrictions are given by the Permissions enum. You can check each permission using the Document.has_permission() method. If the document was opened with the right owner password, then each permission will be True. Otherwise, it will depend on the permissions set on the document itself.

Document properties

The Document.infos() method is a convenient way to get all the document meta infos as a Python dict. Otherwise, you can follow the poppler-cpp API, and retreive the list of available keys using Document.info_keys(), get individual key values using Document.info_key() or Document.info_date(), and set them using Document.set_info_key() or Document.set_info_date().

The infos are also available via individual properties: Document.author, Document.creation_date, Document.creator, Document.keywords, Document.metadata, Document.modification_date, Document.producer, Document.subject, and Document.title. All those properties can be read or written.

Loading pages

You can query the number of pages a document has using Document.pages. Pages are indexed from 0. You can create a Page object using the Document.create_page() method. This method can take the page index, or a page label, as argument. However, it is more convenient to use an index, since you cannot know the label before the page is created.

Working with pages

Page objects are used to extract text, and to query information about transitions.

The Page.label property gives you the page name; its usually the displayed page number. Page.page_rect() allows you to query the page about its size.

Page transitions are mainly used for presentation software. Page.transition() gives you information about the kind of page transition, and Page.duration gives you the duration of the transition.

Extracting text

The Page.text() method allows to query the Page about all the texts it contains, or about the texts in a given area. For more precise information, Page.text_list() allows to get the position of each text, and the position of each character in a text box. Finally, the Page.search() method allows you to search for a given text in a Page.

Getting font information

You can get the list of fonts in a Document using Document.create_font_iterator(). It returns an object you can iterate to get the list of fonts:

font_iterator = document.create_font_iterator()
for page, fonts in font_iterator:
    print(f"Fonts for page {page}")
    for font in fonts:
        print(f"- {font.name}"

Since Poppler 0.89, yo can also get font information associated with a TextBox. To get the information, you need to pass the text_list_include_font option to the Page.text_list() method.

boxes = pdf_page.text_list(pdf_page.TextListOption.text_list_include_font)
box = boxes[0]

assert box.has_font_info
print(box.get_font_name())
print(box.get_font_size())

Rendering image

Rendering is the process of converting a Page to an Image. To render a Page, you first need to create a PageRenderer object. Then you give the Page to the PageRenderer.render_page() method to obtain an Image object.

Working with images

Given that image object is an instance of Image, you can convert it to different formats, to interact with other libraries. Here are some examples.

Converting to PIL or Tk image

ImageFormat can be converted to a string representation, compatible with the PIL raw importer:

from PIL import Image, ImageTk

pil_image = Image.frombytes(
    "RGBA",
    (image.width, image.height),
    image.data,
    "raw",
    str(image.format),
 )
 tk_image = ImageTk.PhotoImage(pil_image)

Unfortunately, it is not possible to build a PIL image using the buffer interface. A copy of the image data in unavoidable.

If you need to use the image with Tk, you create if from a PIL image.

Converting to QImage

There is no builtin map for the image formats, mainly to avoid introducing a dependency on Qt. However, it is easy to build it if needed, as in the following example:

P2QFormat = {
    ImageFormat.invalid: QtGui.QImage.Format_Invalid,
    ImageFormat.argb32: QtGui.QImage.Format_ARGB32,
    ImageFormat.bgr24: QtGui.QImage.Format_BGR888,
    ImageFormat.gray8: QtGui.QImage.Format_Grayscale8,
    ImageFormat.mono: QtGui.QImage.Format_Mono,
    ImageFormat.rgb24: QtGui.QImage.Format_RGB888,
}
qimg = QtGui.QImage(data, image.width, image.height,
                    image.bytes_per_row,
                    P2QFormat[image.format])

Converting image to numpy array

Image supports buffer protocol through memoryview. It allows to access the image buffer directly from Python, without copying it.

You can create a numpy array using the memoryview() method. If you modify the array, image data will be automatically modified as well.

a = numpy.array(image.memoryview(), copy=False)
print(a[0, 0, 0])
print(image.data[0])  # Value of the first byte of the image

a[0, 0, 0] = 0
print(image.data[0])  # It is now 0