Using python-poppler¶
Quickstart¶
from poppler import load_from_file, PageRenderer
pdf_document = load_from_file("sample.pdf")
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()
renderer = PageRenderer()
image = renderer.render_page(page_1)
image_data = image.data
The pdf file is loaded into a Document
.
From the Document
, you can extract general infos
such as properties and font infos.
You can also extract Page
objects, using the Document.create_page()
method.
From the Page
, you get information about transitions and page orientation,
and various methods to extract texts.
Using a PageRenderer
, you can convert a Page
to an Image
.
Most used classes and functions are aliased directly in the poppler
module.
Therefore, you usually do not need to import anything else than poppler
.
Working with documents¶
Loading Document¶
A poppler Document
can be created from a file path
using load_from_file()
, from binary data using
load_from_data()
. There is also a more general load()
function, which can take either a file path, binary data, or a
file-like object as argument.
Working with password protected documents¶
There are two kinds of passwords than can by applied to a PDF document: User Password and Owner Password.
The User Password, or Document open password, prevents to open or view the document.
The Owner Password, or Permission password, or master password, is used to set document restrictions, such as printing, copying contents, editing, extracting pages, commenting, etc. When this password is set, you need it to modify the document.
A PDF document can have a User Password, a Owner Password, or both. When both passwords are set, you only need one of them to be able to open the document. However, you need the Owner Password to be able to modify the document.
You can provide the password when loading the document, or later using the Document.unlock()
method.
The Document.is_locked()
property tells you if you have the permission to view the document.
If you load a document with the wrong password, an error message is printed on the error console.
The possible document restrictions are given by the Permissions
enum.
You can check each permission using the Document.has_permission()
method.
If the document was opened with the right owner password, then each permission will be True.
Otherwise, it will depend on the permissions set on the document itself.
Document properties¶
The Document.infos()
method is a convenient way to get all the document meta infos as
a Python dict. Otherwise, you can follow the poppler-cpp API, and retreive the list of available
keys using Document.info_keys()
, get individual key values using Document.info_key()
or Document.info_date()
, and set them using Document.set_info_key()
or Document.set_info_date()
.
The infos are also available via individual properties: Document.author
, Document.creation_date
,
Document.creator
, Document.keywords
, Document.metadata
, Document.modification_date
,
Document.producer
, Document.subject
, and Document.title
.
All those properties can be read or written.
Loading pages¶
You can query the number of pages a document has using Document.pages
.
Pages are indexed from 0.
You can create a Page
object using the Document.create_page()
method.
This method can take the page index, or a page label, as argument. However, it is more
convenient to use an index, since you cannot know the label before the page is created.
Working with pages¶
Page
objects are used to extract text, and to query information about
transitions.
The Page.label
property gives you the page name; its usually the displayed page number.
Page.page_rect()
allows you to query the page about its size.
Page transitions are mainly used for presentation software.
Page.transition()
gives you information about the kind of page transition,
and Page.duration
gives you the duration of the transition.
Extracting text¶
The Page.text()
method allows to query the Page
about all the texts it contains, or about the texts in a given area.
For more precise information, Page.text_list()
allows
to get the position of each text, and the position of each character
in a text box. Finally, the Page.search()
method allows you
to search for a given text in a Page
.
Getting font information¶
You can get the list of fonts in a Document
using Document.create_font_iterator()
.
It returns an object you can iterate to get the list of fonts:
font_iterator = document.create_font_iterator()
for page, fonts in font_iterator:
print(f"Fonts for page {page}")
for font in fonts:
print(f"- {font.name}"
Since Poppler 0.89, yo can also get font information associated with a TextBox
.
To get the information, you need to pass the text_list_include_font option
to the Page.text_list()
method.
boxes = pdf_page.text_list(pdf_page.TextListOption.text_list_include_font)
box = boxes[0]
assert box.has_font_info
print(box.get_font_name())
print(box.get_font_size())
Rendering image¶
Rendering is the process of converting a Page
to an Image
.
To render a Page
, you first need to create a PageRenderer
object.
Then you give the Page
to the PageRenderer.render_page()
method to obtain an Image
object.
Working with images¶
Given that image
object is an instance of Image
,
you can convert it to different formats,
to interact with other libraries. Here are some examples.
Converting to PIL or Tk image¶
ImageFormat
can be converted to a string representation,
compatible with the PIL raw importer:
from PIL import Image, ImageTk
pil_image = Image.frombytes(
"RGBA",
(image.width, image.height),
image.data,
"raw",
str(image.format),
)
tk_image = ImageTk.PhotoImage(pil_image)
Unfortunately, it is not possible to build a PIL image using the buffer interface. A copy of the image data in unavoidable.
If you need to use the image with Tk, you create if from a PIL image.
Converting to QImage¶
There is no builtin map for the image formats, mainly to avoid introducing a dependency on Qt. However, it is easy to build it if needed, as in the following example:
P2QFormat = {
ImageFormat.invalid: QtGui.QImage.Format_Invalid,
ImageFormat.argb32: QtGui.QImage.Format_ARGB32,
ImageFormat.bgr24: QtGui.QImage.Format_BGR888,
ImageFormat.gray8: QtGui.QImage.Format_Grayscale8,
ImageFormat.mono: QtGui.QImage.Format_Mono,
ImageFormat.rgb24: QtGui.QImage.Format_RGB888,
}
qimg = QtGui.QImage(data, image.width, image.height,
image.bytes_per_row,
P2QFormat[image.format])
Converting image to numpy array¶
Image
supports buffer protocol through
memoryview.
It allows to access the image buffer directly from Python, without
copying it.
You can create a numpy array using the memoryview()
method.
If you modify the array, image data will be automatically modified as well.
a = numpy.array(image.memoryview(), copy=False)
print(a[0, 0, 0])
print(image.data[0]) # Value of the first byte of the image
a[0, 0, 0] = 0
print(image.data[0]) # It is now 0
Suppressing error messages¶
For some documents Poppler may produce a lot of error messages, by default sent to the stderr. If this is not desirable it’s possible to disable them altogether.
# disable logging
poppler.enable_logging(False)
# enable logging to stderr again
poppler.enable_logging(True)