blob: 88d2e15c62b2eff3709bb3bb0f84b04f9256d0a6 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
Character encoding
******************
In most circumstances, pikepdf performs appropriate encodings and
decodings on its own, or returns :class:`pikepdf.String` if it is not clear
whether to present data as a string or binary data.
``str(pikepdf.String)`` is performed by inspecting the binary data. If the
binary data begins with a UTF-16 byte order mark, then the data is
interpreted as UTF-16 and returned as a Python ``str``. Otherwise, the data
is returned as a Python ``str``, if the binary data will be interpreted as
PDFDocEncoding and decoded to ``str``. Again, in most cases this is correct
behavior and will operate transparently.
Some functions are available in circumstances where it is necessary to force
a particular conversion.
PDFDocEncoding
==============
The PDF specification defines PDFDocEncoding, a character encoding used only
in PDFs. It is quite similar to ASCII but not equivalent.
When pikepdf is imported, it automatically registers ``"pdfdoc"`` as a codec
with the standard library, so that it may be used in string and byte
conversions.
.. code-block:: python
"•".encode('pdfdoc') == b'\x81'
Other codecs
============
Two other codecs are commonly used in PDFs, but they are already part of the
standard library.
**WinAnsiEncoding** is identical Windows Code Page 1252, and may be converted
using the ``"cp1251"`` codec.
**MacRomanEncoding** may be converted using the ``"macroman"`` codec.
|