summaryrefslogtreecommitdiff
path: root/docs/tutorial.rst
blob: 9cc6c8a26cdbbabd9e52de43f7647643badd8f26 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
.. _tutorial:

Tutorial
********

.. figure:: images/pike-cartoon.png
       :figwidth: 30%
       :align: right

This brief tutorial should give you an introduction and orientation to pikepdf's
paradigm and syntax. From there, we refer to you various topics.

Opening and saving PDFs
-----------------------

In contrast to better known PDF libraries, pikepdf uses a single object to
represent a PDF, whether reading, writing or merging. We have cleverly named
this :class:`pikepdf.Pdf`. In this documentation, a ``Pdf`` is a class that
allows manipulate the PDF, meaning the file.

.. code-block:: python

   from pikepdf import Pdf
   new_pdf = Pdf.new()
   with Pdf.open('sample.pdf') as pdf:
       pdf.save('output.pdf')

You may of course use ``from pikepdf import Pdf as ...`` if the short class
name conflicts or ``from pikepdf import Pdf as PDF`` if you prefer uppercase.

:func:`pikepdf.open` is a shorthand for :meth:`pikepdf.Pdf.open`.

The PDF class API follows the example of the widely-used
`Pillow image library <https://pillow.readthedocs.io/en/latest/>`_. For clarity
there is no default constructor since the arguments used for creation and
opening are different. ``Pdf.open()`` also accepts seekable streams as input,
and ``Pdf.save()`` accepts streams as output.

Inspecting pages
----------------

Manipulating pages is fundamental to PDFs. pikepdf presents the pages in a PDF
through the :attr:`pikepdf.Pdf.pages` property, which follows the ``list``
protocol. As such page numbers begin at 0.

Let’s open a simple PDF that contains four pages.

.. ipython::

    In [1]: from pikepdf import Pdf

    In [2]: pdf = Pdf.open('../tests/resources/fourpages.pdf')

How many pages?

.. ipython::

    In [2]: len(pdf.pages)

pikepdf integrates with IPython and Jupyter's rich object APIs so that you can
view PDFs, PDF pages, or images within PDF in a IPython window or Jupyter
notebook. This makes it to test visual changes.

.. ipython::
    :verbatim:

    In [1]: pdf
    Out[1]: « In Jupyter you would see the PDF here »

    In [1]: pdf.pages[0]
    Out[1]: « In Jupyter you would see an image of the PDF page here »

You can also examine individual pages, which we’ll explore in the next
section. Suffice to say that you can access pages by indexing them and
slicing them.

.. ipython::
    :verbatim:

    In [1]: pdf.pages[0]
    Out[1]: « In Jupyter you would see an image of the PDF page here »

.. note::

    :meth:`pikepdf.Pdf.open` can open almost all types of encrypted PDF! Just
    provide the ``password=`` keyword argument.

For more details on document assembly, see
:ref:`PDF split, merge and document assembly <docassembly>`.

Pages are dictionaries
----------------------

In PDFs, the main data structure is the **dictionary**, a key-value data
structure much like a Python ``dict`` or ``attrdict``. The major difference is
that the keys can only be **names**, and can only be PDF types, including
other dictionaries.

PDF dictionaries are represented as :class:`pikepdf.Dictionary`, and names
are of type :class:`pikepdf.Name`. A page is just a dictionary with a few
required files and a reference from the document's "page tree". (pikepdf manages
the page tree for you.)

.. ipython::

    In [1]: from pikepdf import Pdf

    In [1]: example = Pdf.open('../tests/resources/congress.pdf')

    In [1]: page1 = example.pages[0]

repr() output
-------------

Let's example the page's ``repr()`` output:

.. ipython::

    In [1]: page1

The angle brackets in the output indicate that this object cannot be constructed
with a Python expression because it contains a reference. When angle brackets
are omitted from the ``repr()`` of a pikepdf object, then the object can be
replicated with a Python expression, such as ``eval(repr(x)) == x``. Pages
typically concern indirect references to themselves and other pages, so they
cannot be represented as an expression.

In Jupyter and IPython, pikepdf will instead attempt to display a preview of the PDF
page, assuming a PDF rendering backend is available.

Item and attribute notation
---------------------------

Dictionary keys may be looked up using attributes (``page1.MediaBox``) or
keys (``page1['/MediaBox']``).

.. ipython::

    In [1]: page1.MediaBox      # preferred notation for required names

    In [1]: page1['/MediaBox']  # also works

By convention, pikepdf uses attribute notation for standard names, and item
notation for names that are set by PDF developers. For example, the images
belong to a page always appear at ``page.Resources.XObject`` but the name
of images is set by the PDF creator:

.. ipython::
    :verbatim:

    In [1]: page1.Resources.XObject['/Im0']

Item notation here would be quite cumbersome:
``['/Resources']['/XObject]['/Im0']`` (not recommended).

Attribute notation is convenient, but not robust if elements are missing. For
elements that are not always present, you can use ``.get()``, which behaves like
``dict.get()`` in core Python.  A library such as `glom
<https://github.com/mahmoud/glom>`_ might help when working with complex
structured data that is not always present.

(For now, we'll set aside what a page's ``MediaBox`` and ``Resources.XObject``
are for. See :ref:`Working with pages <work_with_pages>` for details.)

Deleting pages
--------------

Removing pages is easy too.

.. ipython::

    In [1]: del pdf.pages[1:3]  # Remove pages 2-3 labeled "second page" and "third page"

.. ipython::

    In [1]: len(pdf.pages)

Saving changes
--------------

Naturally, you can save your changes with :meth:`pikepdf.Pdf.save`.
``filename`` can be a :class:`pathlib.Path`, which we accept everywhere. (Saving
is commented out to avoid upsetting the documentation generator.)

.. ipython::
    :verbatim:

    In [1]: pdf.save('output.pdf')

You may save a file multiple times, and you may continue modifying it after
saving.

To save an encrypted (password protected) PDF, use a :class:`pikepdf.Encryption`
object to specify the encryption settings. By default, pikepdf selects the strongest
security handler and algorithm (AES-256), but allows full access to modify file contents.
A :class:`pikepdf.Permissions` object can be used to specify restrictions.

.. ipython::
    :verbatim:

    In [1]: no_extracting = pikepdf.Permissions(extract=False)

    In [1]: pdf.save('encrypted.pdf', encryption=pikepdf.Encryption(
       ...:      user="user password", owner="owner password", allow=no_extracting
       ...: ))

Next steps
----------

Have a look at pikepdf topics that interest you, or jump to our detailed API
reference...