summaryrefslogtreecommitdiff
path: root/docs/encoding.rst
blob: 88d2e15c62b2eff3709bb3bb0f84b04f9256d0a6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Character encoding
******************

In most circumstances, pikepdf performs appropriate encodings and
decodings on its own, or returns :class:`pikepdf.String` if it is not clear
whether to present data as a string or binary data.

``str(pikepdf.String)`` is performed by inspecting the binary data. If the
binary data begins with a UTF-16 byte order mark, then the data is
interpreted as UTF-16 and returned as a Python ``str``. Otherwise, the data
is returned as a Python ``str``, if the binary data will be interpreted as
PDFDocEncoding and decoded to ``str``. Again, in most cases this is correct
behavior and will operate transparently.

Some functions are available in circumstances where it is necessary to force
a particular conversion.

PDFDocEncoding
==============

The PDF specification defines PDFDocEncoding, a character encoding used only
in PDFs. It is quite similar to ASCII but not equivalent.

When pikepdf is imported, it automatically registers ``"pdfdoc"`` as a codec
with the standard library, so that it may be used in string and byte
conversions.

.. code-block:: python

    "•".encode('pdfdoc') == b'\x81'

Other codecs
============

Two other codecs are commonly used in PDFs, but they are already part of the
standard library.

**WinAnsiEncoding** is identical Windows Code Page 1252, and may be converted
using the ``"cp1251"`` codec.

**MacRomanEncoding** may be converted using the ``"macroman"`` codec.