summaryrefslogtreecommitdiff
path: root/doc/pdfgrep.txt
blob: 5e13f69eafa5edd72ea55f52e2c0c2d9e8bc4729 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
= pdfgrep(1)
:doctype: manpage
:man source: Pdfgrep
:man manual: Pdfgrep Manual
:man version: {pdfgrepversion}

== NAME
pdfgrep - search PDF files for a regular expression

== SYNOPSIS
[verse]
*pdfgrep* ['OPTION'...] 'PATTERN' ['FILE'...]
*pdfgrep* ['OPTION'...] [*-e* 'PATTERN' | *-f* 'FILE'] ['FILE'...]

== DESCRIPTION

Search for 'PATTERN' in each PDF 'FILE' and print matching lines. By
default, 'PATTERN' is an extended regular expression.

*pdfgrep* tries to be mostly compatible with *GNU grep* with some
 PDF-specific distinctions and additional options. Most notably, *-n*
 prints page instead of line numbers.

== OPTIONS
=== General Information

*--help* :: Print a short summary of the options.

*-V*, *--version* :: Show version information.

=== Pattern Interpretation

*-F*, *--fixed-strings* :: Interpret 'PATTERN' as a list of fixed
  strings separated by newlines, any of which is to be matched.

*-P*, *--perl-regexp* :: Interpret 'PATTERN' as a Perl compatible
  regular expression (PCRE). See 'pcresyntax'(3) for a quick overview.

=== Matching Control

*-e* 'PATTERN', *--regexp=*'PATTERN' :: Use 'PATTERN' as the pattern
  to search for. If this option is specified multiple times or
  combined with *--file*, all patterns are tried in turn until one of
  them matches.

*-f* 'FILE', *--file=*'FILE' :: Read patterns from 'FILE', one per
  line. If 'FILE' contains multiple patterns or if this option is
  applied multiple times or combined with *-e*, all patterns are tried
  in turn until one of them matches. An empty pattern list matches
  nothing.

*-i*, *--ignore-case* :: Ignore case distinctions in both the
  'PATTERN' and the input files.

=== General Output Control

*-c*, *--count* :: Suppress normal output. Instead print the number of
  matches for each input file. Note that unlike grep, multiple matches
  on the same page will be counted individually.

*-p*, *--page-count* :: Like *-c*, but prints the number of matches
  per page. Implies *-n*.

*--color* 'WHEN' :: Surround file names, page numbers and matched text
  with escape sequences to display them in color on the terminal.
  'WHEN' can be:
+
[horizontal]
  *always* ;; Always use colors, even when stdout is not a terminal.
  *never* ;; Do not use colors.
  *auto* ;; Use colors only when stdout is a terminal (this is the
   default).

*-L*, *--files-without-match* :: Suppress normal output. Instead print
  the name of each input file that doesn't contain a match. This works
  well with *-Z*, but many other output options like *-n* or *-c* are
  ignored when *-L* is specified.

*-l*, *--files-with-matches* :: Suppress normal output. Instead print
  the name of each input file that contains a match. This works well
  with *-Z*, but many other output options like *-n* or *-c* are
  ignored when *-l* is specified.

*-m*, *--max-count* 'NUM' :: Stop reading a file after 'NUM' matches.
  When the -c or --count option is also used, pdfgrep does not output
  a count greater than 'NUM'.

*-o*, *--only-matching* :: Print only the matched part of a line
  without any surrounding context.

*-q*, *--quiet* :: Suppress all normal output to stdout. Exit
  immediately with exit status 0 if a match is found, even in case of
  errors. Use this if you only care about the presence of matches, not
  their number or content.

=== Line Prefix Control

*-H*, *--with-filename* :: Print the file name for each match. This is
  the default setting when there is more than one file to search.

*-h*, *--no-filename* :: Suppress the prefixing of file name on
  output. This is the default setting when there is only one file to
  search.

*-n*, *--page-number* :: Prefix each match with the number of the page
  where it was found.

*-Z*, *--null* :: Output a null byte (called 'NUL' in ASCII and \'\0'
  in C) instead of the colon that usually separates a filename from
  the rest of the line. This option makes the output unambiguous in
  the presence of colons, spaces or newlines in the filename. It can
  be used in conjunction with commands such as 'xargs -0' or
  'perl -0'.

*--match-prefix-separator* 'SEP' :: Changes the colon used to separate
   filename, line number and text in the output to 'SEP', which can be
   an arbitrary string. This is useful when filenames contain colons,
   but only for interactive usage. For scripting, *--null* should be
   used.

=== Context Control

*-A* 'NUM', *--after-context=NUM*:: Print 'NUM' lines of context after
  matching lines. Contiguous groups of matches are separated by a line
  containing *--*. With *-o*, this option has no effect.

*-B* 'NUM', *--before-context=NUM*:: Print 'NUM' lines of context
  before matching lines. Contiguous groups of matches are separated by
  a line containing *--*. With *-o*, this option has no effect.

*-C* 'NUM', *--context=NUM*:: Print 'NUM' lines of context before and
  after matching lines. Contiguous groups of matches are separated by
  a line containing *--*. With *-o*, this option has no effect.

=== File Selection

*-r*, *--recursive*:: Recursively search all files (restricted by
  *--include* and *--exclude*) under each directory, following symlinks
  only if they are on the command line.

*-R*, *--dereference-recursive*:: Same as *-r*, but follows all
  symlinks.

*--exclude=*'GLOB' :: Skip files whose base name matches 'GLOB'. See
  'glob'(7) for wildcards you can use. You can use this option
  multiple times to exclude more patterns. It takes precedence over
  *--include*. Note, that in- and excludes apply only to files found
  via *--recursive* and not to the argument list.

*--include=*'GLOB' :: Only search files whose base name matches
  'GLOB'. See *--exclude* for details. The default is '*.pdf'.

=== Other Options

*--cache* :: Use a cache for the rendered text to speed up the
  operation on large files.

*--password=*'PASSWORD' :: Use PASSWORD to decrypt the PDF-files. Can
  be specified multiple times; all passwords will be tried on all
  PDFs.
  *Note* that this password will show up in your command history and
  the output of 'ps'(1). So please do not use this if the security of
  'PASSWORD' is important.

*--page-range=*'RANGE' :: Limit search to a specified set of pages.
   'RANGE' is a comma separated list of either a single page number or
   a range expression of the form `PAGE1-PAGE2`. Example:
   `2-3,5,7-10`.

*--debug* :: Enable debug output. *Note*: Due to limitations of
   poppler before version 0.30.0, some debug output is also printed
   without *--debug* when using such a poppler version.

*--warn-empty* :: Print a warning to 'stderr' if a PDF contains no
   searchable text. This is the case for PDFs that consist only of
   images, for example scanned documents.

*--unac* :: Remove accents and ligatures from both the search pattern
  and the PDF documents. This is useful if you want to search for a
  word containing "ae", but the PDF uses the single character "æ"
  instead. See *unac(3)* and *unaccent(1)* for details.
+
*This option is experimental and only available if pdfgrep is
compiled with unac support.*

== EXIT STATUS
Normally, the exit status is 0 if at least one match is found, 1 if no
match is found and 2 if an error occurred. But if the *--quiet* or
*-q* option is used and a match was found, *pdfgrep* will return 0
regardless of errors.

== ENVIRONMENT VARIABLES
The behavior of *pdfgrep* is affected by the following environment
variable.

*GREP_COLORS* :: Specifies the colors and other attributes used to
  highlight various parts of the output. The syntax and values are
  like *GREP_COLORS* of *grep*. See 'grep'(1) for more details.
  Currently only the capabilities *mt*, *ms*, *mc*, *fn*, *ln* and
  *se* are used by *pdfgrep*, where *mt*, *ms* and *mc* have the same
  effect.

== FILES

*$\{XDG_CACHE_HOME\}/pdfgrep/** :: Cache files written and used when
  *--cache* is enabled. At most 200 cache entries older than a day are
  retained.

== Examples
*Print the first ten lines matching 'pattern' and print their page number:* ::
+
--------------------------------------------------
pdfgrep -n --max-count 10 pattern foo.pdf
--------------------------------------------------

*Search all .pdf files whose names begin with 'foo' recursively in the current directory:* ::
+
--------------------------------------------------
pdfgrep -r --include "foo*.pdf" pattern
--------------------------------------------------

*Search all PDFs in the current directory for 'foo' that also contain 'bar':*::
+
--------------------------------------------------
pdfgrep -Z --files-with-matches "bar" *.pdf | xargs -0 pdfgrep -H foo
--------------------------------------------------

*Search all .pdf files that are smaller than 12M recursively in the current directory:* ::
+
--------------------------------------------------
find . -name "*.pdf" -size -12M -print0 | xargs -0 pdfgrep pattern
--------------------------------------------------
+
Note that in contrast to the previous examples, this task could not be
solved with pdfgrep alone, but the Unix tools *find(1)* and *xargs(1)*
had to be used. That's because pdfgrep itself doesn't include options
to exclude files by their size. But as you see, it doesn't have to!

== BUGS
=== Reporting Bugs
Bugs can either be reportet to the mailing list
(pdfgrep-users@pdfgrep.org) or to the bugtracker on gitlab
(https://gitlab.com/pdfgrep/pdfgrep/issues).

== AUTHORS
*pdfgrep* is maintained by Hans-Peter Deifel.

See the 'AUTHORS' file in the source for a full list of contributors.


== SEE ALSO
grep(1), pcre(3), regex(7)

See pdfgrep's website https://pdfgrep.org for more information,
downloads, git repository and more.