.. _tutorial:
Tutorial
********
.. figure:: images/pike-cartoon.png
:figwidth: 30%
:align: right
This brief tutorial should give you an introduction and orientation to pikepdf's
paradigm and syntax. From there, we refer to you various topics.
Opening and saving PDFs
-----------------------
In contrast to better known PDF libraries, pikepdf uses a single object to
represent a PDF, whether reading, writing or merging. We have cleverly named
this :class:`pikepdf.Pdf`. In this documentation, a ``Pdf`` is a class that
allows manipulate the PDF, meaning the file.
.. code-block:: python
from pikepdf import Pdf
new_pdf = Pdf.new()
with Pdf.open('sample.pdf') as pdf:
pdf.save('output.pdf')
You may of course use ``from pikepdf import Pdf as ...`` if the short class
name conflicts or ``from pikepdf import Pdf as PDF`` if you prefer uppercase.
:func:`pikepdf.open` is a shorthand for :meth:`pikepdf.Pdf.open`.
The PDF class API follows the example of the widely-used
`Pillow image library `_. For clarity
there is no default constructor since the arguments used for creation and
opening are different. ``Pdf.open()`` also accepts seekable streams as input,
and ``Pdf.save()`` accepts streams as output.
Inspecting pages
----------------
Manipulating pages is fundamental to PDFs. pikepdf presents the pages in a PDF
through the :attr:`pikepdf.Pdf.pages` property, which follows the ``list``
protocol. As such page numbers begin at 0.
Let’s open a simple PDF that contains four pages.
.. ipython::
In [1]: from pikepdf import Pdf
In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')
How many pages?
.. ipython::
In [2]: len(pdf.pages)
pikepdf integrates with IPython and Jupyter's rich object APIs so that you can
view PDFs, PDF pages, or images within PDF in a IPython window or Jupyter
notebook. This makes it to test visual changes.
.. ipython::
:verbatim:
In [1]: pdf
Out[1]: « In Jupyter you would see the PDF here »
In [1]: pdf.pages[0]
Out[1]: « In Jupyter you would see an image of the PDF page here »
You can also examine individual pages, which we’ll explore in the next
section. Suffice to say that you can access pages by indexing them and
slicing them.
.. ipython::
:verbatim:
In [1]: pdf.pages[0]
Out[1]: « In Jupyter you would see an image of the PDF page here »
.. note::
:meth:`pikepdf.Pdf.open` can open almost all types of encrypted PDF! Just
provide the ``password=`` keyword argument.
For more details on document assembly, see
:ref:`PDF split, merge and document assembly `.
Pages are dictionaries
----------------------
In PDFs, the main data structure is the **dictionary**, a key-value data
structure much like a Python ``dict`` or ``attrdict``. The major difference is
that the keys can only be **names**, and can only be PDF types, including
other dictionaries.
PDF dictionaries are represented as :class:`pikepdf.Dictionary`, and names
are of type :class:`pikepdf.Name`. A page is just a dictionary with a few
required files and a reference from the document's "page tree". (pikepdf manages
the page tree for you.)
.. ipython::
In [1]: from pikepdf import Pdf
In [1]: example = Pdf.open('../tests/resources/congress.pdf')
In [1]: page1 = example.pages[0]
repr() output
-------------
Let's example the page's ``repr()`` output:
.. ipython::
In [1]: page1
The angle brackets in the output indicate that this object cannot be constructed
with a Python expression because it contains a reference. When angle brackets
are omitted from the ``repr()`` of a pikepdf object, then the object can be
replicated with a Python expression, such as ``eval(repr(x)) == x``. Pages
typically concern indirect references to themselves and other pages, so they
cannot be represented as an expression.
In Jupyter and IPython, pikepdf will instead attempt to display a preview of the PDF
page, assuming a PDF rendering backend is available.
Item and attribute notation
---------------------------
Dictionary keys may be looked up using attributes (``page1.MediaBox``) or
keys (``page1['/MediaBox']``).
.. ipython::
In [1]: page1.MediaBox # preferred notation for required names
In [1]: page1['/MediaBox'] # also works
By convention, pikepdf uses attribute notation for standard names, and item
notation for names that are set by PDF developers. For example, the images
belong to a page always appear at ``page.Resources.XObject`` but the name
of images is set by the PDF creator:
.. ipython::
:verbatim:
In [1]: page1.Resources.XObject['/Im0']
Item notation here would be quite cumbersome:
``['/Resources']['/XObject]['/Im0']`` (not recommended).
Attribute notation is convenient, but not robust if elements are missing. For
elements that are not always present, you can use ``.get()``, which behaves like
``dict.get()`` in core Python. A library such as `glom
`_ might help when working with complex
structured data that is not always present.
(For now, we'll set aside what a page's ``MediaBox`` and ``Resources.XObject``
are for. See :ref:`Working with pages ` for details.)
Deleting pages
--------------
Removing pages is easy too.
.. ipython::
In [1]: del pdf.pages[1:3] # Remove pages 2-3 labeled "second page" and "third page"
.. ipython::
In [1]: len(pdf.pages)
Saving changes
--------------
Naturally, you can save your changes with :meth:`pikepdf.Pdf.save`.
``filename`` can be a :class:`pathlib.Path`, which we accept everywhere. (Saving
is commented out to avoid upsetting the documentation generator.)
.. ipython::
:verbatim:
In [1]: pdf.save('output.pdf')
You may save a file multiple times, and you may continue modifying it after
saving.
To save an encrypted (password protected) PDF, use a :class:`pikepdf.Encryption`
object to specify the encryption settings. By default, pikepdf selects the strongest
security handler and algorithm (AES-256), but allows full access to modify file contents.
A :class:`pikepdf.Permissions` object can be used to specify restrictions.
.. ipython::
:verbatim:
In [1]: no_extracting = pikepdf.Permissions(extract=False)
In [1]: pdf.save('encrypted.pdf', encryption=pikepdf.Encryption(
...: user="user password", owner="owner password", allow=no_extracting
...: ))
Next steps
----------
Have a look at pikepdf topics that interest you, or jump to our detailed API
reference...