summaryrefslogtreecommitdiff
path: root/man/man1/html2markdown.1.md
blob: 1db37cf47b95f95c246849742565c92de30dfb31 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
% HTML2MARKDOWN(1) Pandoc User Manuals
% John MacFarlane and Recai Oktas
% January 8, 2008

# NAME

html2markdown - converts HTML to markdown-formatted text

# SYNOPSIS

html2markdown [*pandoc-options*] [\-- *special-options*] [*input-file* or
*URL*]

# DESCRIPTION

`html2markdown` converts *input-file* or *URL* (or text
from STDIN) from HTML to markdown-formatted plain text. 
If a URL is specified, `html2markdown` uses an available program
(e.g. wget, w3m, lynx or curl) to fetch its contents.  Output is sent
to STDOUT unless an output file is specified using the `-o`
option.

`html2markdown` uses the character encoding specified in the
"Content-type" meta tag.  If this is not present, or if input comes
from STDIN, UTF-8 is assumed.  A character encoding may be specified
explicitly using the `-e` special option.

# OPTIONS

`html2markdown` is a wrapper for `pandoc`, so all of
`pandoc`'s options may be used.  See `pandoc`(1) for
a complete list.  The following options are most relevant:

-s, \--standalone
:   Include title, author, and date information (if present) at the
    top of markdown output.

-o *FILE*, \--output=*FILE*
:   Write output to *FILE* instead of STDOUT. 

\--strict
:   Use strict markdown syntax, with no extensions or variants.

\--reference-links
:   Use reference-style links, rather than inline links, in writing markdown
    or reStructuredText.

-R, \--parse-raw
:   Parse untranslatable HTML codes as raw HTML.

\--no-wrap
:   Disable text wrapping in output.  (Default is to wrap text.)

\--sanitize-html
:   Sanitizes HTML using a whitelist. Unsafe tags are replaced by HTML
    comments; unsafe attributes are omitted.

-H *FILE*, \--include-in-header=*FILE*
:   Include contents of *FILE* at the end of the header.  Implies
    `-s`.

-B *FILE*, \--include-before-body=*FILE*
:   Include contents of *FILE* at the beginning of the document body.

-A *FILE*, \--include-after-body=*FILE*
:   Include contents of *FILE* at the end of the document body.

-C *FILE*, \--custom-header=*FILE*
:   Use contents of *FILE*
    as the document header (overriding the default header, which can be
    printed using `pandoc -D markdown`).  Implies `-s`.

# SPECIAL OPTIONS

In addition, the following special options may be used.  The special
options must be separated from the `html2markdown` command and any
regular `pandoc` options by the delimiter \``--`', as in

    html2markdown -o foo.txt -- -g 'curl -u bar:baz' -e latin1  \
    www.foo.com

-e *encoding*, \--encoding=*encoding* 
:   Assume the character encoding *encoding* in reading HTML.
    (Note: *encoding* will be passed to `iconv`; a list of
    available encodings may be obtained using `iconv -l`.)
    If this option is not specified and input is not from
    STDIN, `html2markdown` will try to extract the character encoding
    from the "Content-type" meta tag.  If no character encoding is
    specified in this way, or if input is from STDIN, UTF-8 will be
    assumed.

-g *command*, \--grabber=*command*
:   Use *command* to fetch the contents of a URL.  (By default,
    `html2markdown` searches for an available program or text-based
    browser to fetch the contents of a URL.)

# SEE ALSO

`pandoc`(1), `iconv`(1)