summaryrefslogtreecommitdiff
path: root/doc/pdfgrep.1
blob: c314f58ffd42b1f09c52e33c415cabc4d03c5458 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
'\" t
.\"     Title: pdfgrep
.\"    Author: [see the "AUTHORS" section]
.\" Generator: DocBook XSL Stylesheets vsnapshot <http://docbook.sf.net/>
.\"      Date: 11/19/2018
.\"    Manual: Pdfgrep Manual
.\"    Source: Pdfgrep 2.1.1
.\"  Language: English
.\"
.TH "PDFGREP" "1" "11/19/2018" "Pdfgrep 2\&.1\&.1" "Pdfgrep Manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
pdfgrep \- search PDF files for a regular expression
.SH "SYNOPSIS"
.sp
.nf
\fBpdfgrep\fR [\fIOPTION\fR\&...] \fIPATTERN\fR [\fIFILE\fR\&...]
\fBpdfgrep\fR [\fIOPTION\fR\&...] [\fB\-e\fR \fIPATTERN\fR | \fB\-f\fR \fIFILE\fR] [\fIFILE\fR\&...]
.fi
.SH "DESCRIPTION"
.sp
Search for \fIPATTERN\fR in each PDF \fIFILE\fR and print matching lines\&. By default, \fIPATTERN\fR is an extended regular expression\&.
.sp
\fBpdfgrep\fR tries to be mostly compatible with \fBGNU grep\fR with some PDF\-specific distinctions and additional options\&. Most notably, \fB\-n\fR prints page instead of line numbers\&.
.SH "OPTIONS"
.SS "General Information"
.PP
\fB\-\-help\fR
.RS 4
Print a short summary of the options\&.
.RE
.PP
\fB\-V\fR, \fB\-\-version\fR
.RS 4
Show version information\&.
.RE
.SS "Pattern Interpretation"
.PP
\fB\-F\fR, \fB\-\-fixed\-strings\fR
.RS 4
Interpret
\fIPATTERN\fR
as a list of fixed strings separated by newlines, any of which is to be matched\&.
.RE
.PP
\fB\-P\fR, \fB\-\-perl\-regexp\fR
.RS 4
Interpret
\fIPATTERN\fR
as a Perl compatible regular expression (PCRE)\&. See
\fIpcresyntax\fR(3) for a quick overview\&.
.RE
.SS "Matching Control"
.PP
\fB\-e\fR \fIPATTERN\fR, \fB\-\-regexp=\fR\fIPATTERN\fR
.RS 4
Use
\fIPATTERN\fR
as the pattern to search for\&. If this option is specified multiple times or combined with
\fB\-\-file\fR, all patterns are tried in turn until one of them matches\&.
.RE
.PP
\fB\-f\fR \fIFILE\fR, \fB\-\-file=\fR\fIFILE\fR
.RS 4
Read patterns from
\fIFILE\fR, one per line\&. If
\fIFILE\fR
contains multiple patterns or if this option is applied multiple times or combined with
\fB\-e\fR, all patterns are tried in turn until one of them matches\&. An empty pattern list matches nothing\&.
.RE
.PP
\fB\-i\fR, \fB\-\-ignore\-case\fR
.RS 4
Ignore case distinctions in both the
\fIPATTERN\fR
and the input files\&.
.RE
.SS "General Output Control"
.PP
\fB\-c\fR, \fB\-\-count\fR
.RS 4
Suppress normal output\&. Instead print the number of matches for each input file\&. Note that unlike grep, multiple matches on the same page will be counted individually\&.
.RE
.PP
\fB\-p\fR, \fB\-\-page\-count\fR
.RS 4
Like
\fB\-c\fR, but prints the number of matches per page\&. Implies
\fB\-n\fR\&.
.RE
.PP
\fB\-\-color\fR \fIWHEN\fR
.RS 4
Surround file names, page numbers and matched text with escape sequences to display them in color on the terminal\&.
\fIWHEN\fR
can be:
.TS
tab(:);
lt lt
lt lt
lt lt.
T{
\fBalways\fR
T}:T{
Always use colors, even when stdout is not a terminal\&.
T}
T{
\fBnever\fR
T}:T{
Do not use colors\&.
T}
T{
\fBauto\fR
T}:T{
Use colors only when stdout is a terminal (this is the default)\&.
T}
.TE
.sp 1
.RE
.PP
\fB\-L\fR, \fB\-\-files\-without\-match\fR
.RS 4
Suppress normal output\&. Instead print the name of each input file that doesn\(cqt contain a match\&. This works well with
\fB\-Z\fR, but many other output options like
\fB\-n\fR
or
\fB\-c\fR
are ignored when
\fB\-L\fR
is specified\&.
.RE
.PP
\fB\-l\fR, \fB\-\-files\-with\-matches\fR
.RS 4
Suppress normal output\&. Instead print the name of each input file that contains a match\&. This works well with
\fB\-Z\fR, but many other output options like
\fB\-n\fR
or
\fB\-c\fR
are ignored when
\fB\-l\fR
is specified\&.
.RE
.PP
\fB\-m\fR, \fB\-\-max\-count\fR \fINUM\fR
.RS 4
Stop reading a file after
\fINUM\fR
matches\&. When the \-c or \-\-count option is also used, pdfgrep does not output a count greater than
\fINUM\fR\&.
.RE
.PP
\fB\-o\fR, \fB\-\-only\-matching\fR
.RS 4
Print only the matched part of a line without any surrounding context\&.
.RE
.PP
\fB\-q\fR, \fB\-\-quiet\fR
.RS 4
Suppress all normal output to stdout\&. Exit immediately with exit status 0 if a match is found, even in case of errors\&. Use this if you only care about the presence of matches, not their number or content\&.
.RE
.SS "Line Prefix Control"
.PP
\fB\-H\fR, \fB\-\-with\-filename\fR
.RS 4
Print the file name for each match\&. This is the default setting when there is more than one file to search\&.
.RE
.PP
\fB\-h\fR, \fB\-\-no\-filename\fR
.RS 4
Suppress the prefixing of file name on output\&. This is the default setting when there is only one file to search\&.
.RE
.PP
\fB\-n\fR, \fB\-\-page\-number\fR
.RS 4
Prefix each match with the number of the page where it was found\&.
.RE
.PP
\fB\-Z\fR, \fB\-\-null\fR
.RS 4
Output a null byte (called
\fINUL\fR
in ASCII and \*(Aq\e0\*(Aq in C) instead of the colon that usually separates a filename from the rest of the line\&. This option makes the output unambiguous in the presence of colons, spaces or newlines in the filename\&. It can be used in conjunction with commands such as
\fIxargs\ \&\-0\fR
or
\fIperl\ \&\-0\fR\&.
.RE
.PP
\fB\-\-match\-prefix\-separator\fR \fISEP\fR
.RS 4
Changes the colon used to separate filename, line number and text in the output to
\fISEP\fR, which can be an arbitrary string\&. This is useful when filenames contain colons, but only for interactive usage\&. For scripting,
\fB\-\-null\fR
should be used\&.
.RE
.SS "Context Control"
.PP
\fB\-A\fR \fINUM\fR, \fB\-\-after\-context=NUM\fR
.RS 4
Print
\fINUM\fR
lines of context after matching lines\&. Contiguous groups of matches are separated by a line containing
\fB\-\-\fR\&. With
\fB\-o\fR, this option has no effect\&.
.RE
.PP
\fB\-B\fR \fINUM\fR, \fB\-\-before\-context=NUM\fR
.RS 4
Print
\fINUM\fR
lines of context before matching lines\&. Contiguous groups of matches are separated by a line containing
\fB\-\-\fR\&. With
\fB\-o\fR, this option has no effect\&.
.RE
.PP
\fB\-C\fR \fINUM\fR, \fB\-\-context=NUM\fR
.RS 4
Print
\fINUM\fR
lines of context before and after matching lines\&. Contiguous groups of matches are separated by a line containing
\fB\-\-\fR\&. With
\fB\-o\fR, this option has no effect\&.
.RE
.SS "File Selection"
.PP
\fB\-r\fR, \fB\-\-recursive\fR
.RS 4
Recursively search all files (restricted by
\fB\-\-include\fR
and
\fB\-\-exclude\fR) under each directory, following symlinks only if they are on the command line\&.
.RE
.PP
\fB\-R\fR, \fB\-\-dereference\-recursive\fR
.RS 4
Same as
\fB\-r\fR, but follows all symlinks\&.
.RE
.PP
\fB\-\-exclude=\fR\fIGLOB\fR
.RS 4
Skip files whose base name matches
\fIGLOB\fR\&. See
\fIglob\fR(7) for wildcards you can use\&. You can use this option multiple times to exclude more patterns\&. It takes precedence over
\fB\-\-include\fR\&. Note, that in\- and excludes apply only to files found via
\fB\-\-recursive\fR
and not to the argument list\&.
.RE
.PP
\fB\-\-include=\fR\fIGLOB\fR
.RS 4
Only search files whose base name matches
\fIGLOB\fR\&. See
\fB\-\-exclude\fR
for details\&. The default is
\fI*\&.pdf\fR\&.
.RE
.SS "Other Options"
.PP
\fB\-\-cache\fR
.RS 4
Use a cache for the rendered text to speed up the operation on large files\&.
.RE
.PP
\fB\-\-password=\fR\fIPASSWORD\fR
.RS 4
Use PASSWORD to decrypt the PDF\-files\&. Can be specified multiple times; all passwords will be tried on all PDFs\&.
\fBNote\fR
that this password will show up in your command history and the output of
\fIps\fR(1)\&. So please do not use this if the security of
\fIPASSWORD\fR
is important\&.
.RE
.PP
\fB\-\-page\-range=\fR\fIRANGE\fR
.RS 4
Limit search to a specified set of pages\&.
\fIRANGE\fR
is a comma separated list of either a single page number or a range expression of the form
PAGE1\-PAGE2\&. Example:
2\-3,5,7\-10\&.
.RE
.PP
\fB\-\-debug\fR
.RS 4
Enable debug output\&.
\fBNote\fR: Due to limitations of poppler before version 0\&.30\&.0, some debug output is also printed without
\fB\-\-debug\fR
when using such a poppler version\&.
.RE
.PP
\fB\-\-warn\-empty\fR
.RS 4
Print a warning to
\fIstderr\fR
if a PDF contains no searchable text\&. This is the case for PDFs that consist only of images, for example scanned documents\&.
.RE
.PP
\fB\-\-unac\fR
.RS 4
Remove accents and ligatures from both the search pattern and the PDF documents\&. This is useful if you want to search for a word containing "ae", but the PDF uses the single character "æ" instead\&. See
\fBunac(3)\fR
and
\fBunaccent(1)\fR
for details\&.
.sp
\fBThis option is experimental and only available if pdfgrep is compiled with unac support\&.\fR
.RE
.SH "EXIT STATUS"
.sp
Normally, the exit status is 0 if at least one match is found, 1 if no match is found and 2 if an error occurred\&. But if the \fB\-\-quiet\fR or \fB\-q\fR option is used and a match was found, \fBpdfgrep\fR will return 0 regardless of errors\&.
.SH "ENVIRONMENT VARIABLES"
.sp
The behavior of \fBpdfgrep\fR is affected by the following environment variable\&.
.PP
\fBGREP_COLORS\fR
.RS 4
Specifies the colors and other attributes used to highlight various parts of the output\&. The syntax and values are like
\fBGREP_COLORS\fR
of
\fBgrep\fR\&. See
\fIgrep\fR(1) for more details\&. Currently only the capabilities
\fBmt\fR,
\fBms\fR,
\fBmc\fR,
\fBfn\fR,
\fBln\fR
and
\fBse\fR
are used by
\fBpdfgrep\fR, where
\fBmt\fR,
\fBms\fR
and
\fBmc\fR
have the same effect\&.
.RE
.SH "FILES"
.PP
\fB${XDG_CACHE_HOME}/pdfgrep/\fR*
.RS 4
Cache files written and used when
\fB\-\-cache\fR
is enabled\&. At most 200 cache entries older than a day are retained\&.
.RE
.SH "EXAMPLES"
.PP
\fBPrint the first ten lines matching \fR\fB\fIpattern\fR\fR\fB and print their page number:\fR
.RS 4
.sp
.if n \{\
.RS 4
.\}
.nf
pdfgrep \-n \-\-max\-count 10 pattern foo\&.pdf
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fBSearch all \&.pdf files whose names begin with \fR\fB\fIfoo\fR\fR\fB recursively in the current directory:\fR
.RS 4
.sp
.if n \{\
.RS 4
.\}
.nf
pdfgrep \-r \-\-include "foo*\&.pdf" pattern
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fBSearch all PDFs in the current directory for \fR\fB\fIfoo\fR\fR\fB that also contain \fR\fB\fIbar\fR\fR\fB:\fR
.RS 4
.sp
.if n \{\
.RS 4
.\}
.nf
pdfgrep \-Z \-\-files\-with\-matches "bar" *\&.pdf | xargs \-0 pdfgrep \-H foo
.fi
.if n \{\
.RE
.\}
.RE
.PP
\fBSearch all \&.pdf files that are smaller than 12M recursively in the current directory:\fR
.RS 4
.sp
.if n \{\
.RS 4
.\}
.nf
find \&. \-name "*\&.pdf" \-size \-12M \-print0 | xargs \-0 pdfgrep pattern
.fi
.if n \{\
.RE
.\}
.sp
Note that in contrast to the previous examples, this task could not be solved with pdfgrep alone, but the Unix tools
\fBfind(1)\fR
and
\fBxargs(1)\fR
had to be used\&. That\(cqs because pdfgrep itself doesn\(cqt include options to exclude files by their size\&. But as you see, it doesn\(cqt have to!
.RE
.SH "BUGS"
.SS "Reporting Bugs"
.sp
Bugs can either be reportet to the mailing list (pdfgrep\-users@pdfgrep\&.org) or to the bugtracker on gitlab (https://gitlab\&.com/pdfgrep/pdfgrep/issues)\&.
.SH "AUTHORS"
.sp
\fBpdfgrep\fR is maintained by Hans\-Peter Deifel\&.
.sp
See the \fIAUTHORS\fR file in the source for a full list of contributors\&.
.SH "SEE ALSO"
.sp
grep(1), pcre(3), regex(7)
.sp
See pdfgrep\(cqs website https://pdfgrep\&.org for more information, downloads, git repository and more\&.