diff options
author | Andrew Shadura <andrewsh@debian.org> | 2017-08-30 15:33:58 +0300 |
---|---|---|
committer | Andrew Shadura <andrewsh@debian.org> | 2017-08-30 15:33:58 +0300 |
commit | ec06787d2bdda608b8b24ca8b29d9edbea81ac6a (patch) | |
tree | b535394603d7bf1b87e2c2dc83883ae61c449e91 /README.md |
Import original source of HTML-Restrict 2.2.4
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 349 |
1 files changed, 349 insertions, 0 deletions
diff --git a/README.md b/README.md new file mode 100644 index 0000000..0796222 --- /dev/null +++ b/README.md @@ -0,0 +1,349 @@ +# NAME + +HTML::Restrict - Strip unwanted HTML tags and attributes + +# VERSION + +version 2.2.4 + +# SYNOPSIS + + use HTML::Restrict; + + my $hr = HTML::Restrict->new(); + + # use default rules to start with (strip away all HTML) + my $processed = $hr->process(' <b>i am bold</b> '); + + # $processed now equals: 'i am bold' + + # Now, a less restrictive example: + use HTML::Restrict; + + my $hr = HTML::Restrict->new( + rules => { + b => [], + img => [qw( src alt / )] + } + ); + + my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>]; + my $processed = $hr->process( $html ); + + # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" /> + +# DESCRIPTION + +This module uses [HTML::Parser](https://metacpan.org/pod/HTML::Parser) to strip HTML from text in a restrictive +manner. By default all HTML is restricted. You may alter the default +behaviour by supplying your own tag rules. + +# CONSTRUCTOR AND STARTUP + +## new() + +Creates and returns a new HTML::Restrict object. + + my $hr = HTML::Restrict->new() + +HTML::Restrict doesn't require any params to be passed to new. If your goal is +to remove all HTML from text, then no further setup is required. Just pass +your text to the process() method and you're done: + + my $plain_text = $hr->process( $html ); + +If you need to set up specific rules, have a look at the params which +HTML::Restrict recognizes: + +- `rules => \%rules` + + Sets the rules which will be used to process your data. By default all HTML + tags are off limits. Use this argument to define the HTML elements and + corresponding attributes you'd like to use. Essentially, consider the default + behaviour to be: + + rules => {} + + Rules should be passed as a HASHREF of allowed tags. Each hash value should + represent the allowed attributes for the listed tag. For example, if you want + to allow a fair amount of HTML, you can try something like this: + + my %rules = ( + a => [qw( href target )], + b => [], + caption => [], + center => [], + em => [], + i => [], + img => [qw( alt border height width src style )], + li => [], + ol => [], + p => [qw(style)], + span => [qw(style)], + strong => [], + sub => [], + sup => [], + table => [qw( style border cellspacing cellpadding align )], + tbody => [], + td => [], + tr => [], + u => [], + ul => [], + ); + + my $hr = HTML::Restrict->new( rules => \%rules ) + + Or, to allow only bolded text: + + my $hr = HTML::Restrict->new( rules => { b => [] } ); + + Allow bolded text, images and some (but not all) image attributes: + + my %rules = ( + b => [ ], + img => [qw( src alt width height border / ) + ); + my $hr = HTML::Restrict->new( rules => \%rules ); + + Since [HTML::Parser](https://metacpan.org/pod/HTML::Parser) treats a closing slash as an attribute, you'll need to + add "/" to your list of allowed attributes if you'd like your tags to retain + closing slashes. For example: + + my $hr = HTML::Restrict->new( rules =>{ hr => [] } ); + $hr->process( "<hr />"); # returns: <hr> + + my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } ); + $hr->process( "<hr />"); # returns: <hr /> + + HTML::Restrict strips away any tags and attributes which are not explicitly + allowed. It also rebuilds your explicitly allowed tags and places their + attributes in the order in which they appear in your rules. + + So, if you define the following rules: + + my %rules = ( + ... + img => [qw( src alt title width height id / )] + ... + ); + + then your image tags will all be built like this: + + <img src=".." alt="..." title="..." width="..." height="..." id=".." /> + + This gives you greater consistency in your tag layout. If you don't care about + element order you don't need to pay any attention to this, but you should be + aware that your elements are being reconstructed rather than just stripped + down. + + As of 2.1.0, you can also specify a regex to be tested against the attribute + value. This feature should be considered experimental for the time being: + + my $hr = HTML::Restrict->new( + rules => { + iframe => [ + qw( width height allowfullscreen ), + { src => qr{^http://www\.youtube\.com}, + frameborder => qr{^(0|1)$}, + } + ], + img => [ qw( alt ), { src => qr{^/my/images/} }, ], + }, + ); + + my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">'; + my $processed = $hr->process( $html ); + + # $processed now equals: <img alt="Alt Text"> + +- `trim => [0|1]` + + By default all leading and trailing spaces will be removed when text is + processed. Set this value to 0 in order to disable this behaviour. + +- `uri_schemes => [undef, 'http', 'https', 'irc', ... ]` + + As of version 1.0.3, URI scheme checking is performed on all href and src tag + attributes. The following schemes are allowed out of the box. No action is + required on your part: + + [ undef, 'http', 'https' ] + + (undef represents relative URIs). These restrictions have been put in place to + prevent XSS in the form of: + + <a href="javascript:alert(document.cookie)">click for cookie!</a> + + See [URI](https://metacpan.org/pod/URI) for more detailed info on scheme parsing. If, for example, you + wanted to filter out every scheme barring SSL, you would do it like this: + + uri_schemes => ['https'] + + This feature is new in 1.0.3. Previous to this, there was no schema checking + at all. Moving forward, you'll need to whitelist explicitly all URI schemas + which are not supported by default. This is in keeping with the whitelisting + behaviour of this module and is also the safest possible approach. Keep in + mind that changes to uri\_schemes are not additive, so you'll need to include + the defaults in any changes you make, should you wish to keep them: + + # defaults + irc + mailto + uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ] + +- allow\_declaration => \[0|1\] + + Set this value to true if you'd like to allow/preserve DOCTYPE declarations in + your content. Useful when cleaning up your own static files or templates. This + feature is off by default. + + my $html = q[<!doctype html><body>foo</body>]; + + my $hr = HTML::Restrict->new( allow_declaration => 1 ); + $html = $hr->process( $html ); + # $html is now: "<!doctype html>foo" + +- allow\_comments => \[0|1\] + + Set this value to true if you'd like to allow/preserve HTML comments in your + content. Useful when cleaning up your own static files or templates. This + feature is off by default. + + my $html = q[<body><!-- comments! -->foo</body>]; + + my $hr = HTML::Restrict->new( allow_comments => 1 ); + $html = $hr->process( $html ); + # $html is now: "<!-- comments! -->foo" + +- replace\_img => \[0|1|CodeRef\] + + Set the value to true if you'd like to have img tags replaced with + `[IMAGE: ...]` containing the alt attribute text. If you set it to a + code reference, you can provide your own replacement (which may + even contain HTML). + + sub replacer { + my ($tagname, $attr, $text) = @_; # from HTML::Parser + return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>}; + } + + my $hr = HTML::Restrict->new( replace_img => \&replacer ); + + This attribute will only take effect if the img tag is not included + in the allowed HTML. + +- strip\_enclosed\_content => \[0|1\] + + The default behaviour up to 1.0.4 was to preserve the content between script + and style tags, even when the tags themselves were being deleted. So, you'd be + left with a bunch of JavaScript or CSS, just with the enclosing tags missing. + This is almost never what you want, so starting at 1.0.5 the default will be to + remove any script or style info which is enclosed in these tags, unless they + have specifically been whitelisted in the rules. This will be a sane default + when cleaning up content submitted via a web form. However, if you're using + HTML::Restrict to purge your own HTML you can be more restrictive. + + # strip the head section, in addition to JS and CSS + my $html = '<html><head>...</head><body>...<script>JS here</script>foo'; + + my $hr = HTML::Restrict->new( + strip_enclosed_content => [ 'script', 'style', 'head' ] + ); + + $html = $hr->process( $html ); + # $html is now '<html><body>...foo'; + + The caveat here is that HTML::Restrict will not try to fix broken HTML. In the + above example, if you have any opening script, style or head tags which don't + also include matching closing tags, all following content will be stripped + away, regardless of any parent tags. + + Keep in mind that changes to strip\_enclosed\_content are not additive, so if you + are adding additional tags you'll need to include the entire list of tags whose + enclosed content you'd like to remove. This feature strips script and style + tags by default. + +# SUBROUTINES/METHODS + +## process( $html ) + +This is the method which does the real work. It parses your data, removes any +tags and attributes which are not specifically allowed and returns the +resulting text. Requires and returns a SCALAR. + +## get\_rules + +Accessor which returns a hash ref of the current rule set. + +## get\_uri\_schemes + +Accessor which returns an array ref of the current valid uri schemes. + +# CAVEATS + +Please note that all tag and attribute names passed via the rules param must be +supplied in lower case. + + # correct + my $hr = HTML::Restrict->new( rules => { body => ['onload'] } ); + + # throws a fatal error + my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } ); + +# MOTIVATION + +There are already several modules on the CPAN which accomplish much of the same +thing, but after doing a lot of poking around, I was unable to find a solution +with a simple setup which I was happy with. + +The most common use case might be stripping HTML from user submitted data +completely or allowing just a few tags and attributes to be displayed. With +the exception of URI scheme checking, this module doesn't do any validation on +the actual content of the tags or attributes. If this is a requirement, you +can either mess with the parser object, post-process the text yourself or have +a look at one of the more feature-rich modules in the SEE ALSO section below. + +My aim here is to keep things easy and, hopefully, cover a lot of the less +complex use cases with just a few lines of code and some brief documentation. +The idea is to be up and running quickly. + +# SEE ALSO + +[HTML::TagFilter](https://metacpan.org/pod/HTML::TagFilter), [HTML::Defang](https://metacpan.org/pod/HTML::Defang), [MojoMojo::Declaw](https://metacpan.org/pod/MojoMojo::Declaw), [HTML::StripScripts](https://metacpan.org/pod/HTML::StripScripts), +[HTML::Detoxifier](https://metacpan.org/pod/HTML::Detoxifier), HTML::Sanitizer, [HTML::Scrubber](https://metacpan.org/pod/HTML::Scrubber) + +# ACKNOWLEDGEMENTS + +Thanks to Raybec Communications [http://www.raybec.com](http://www.raybec.com) for funding my +work on this module and for releasing it to the world. + +Thanks also to the following for patches, bug reports and assistance: + +Mark Jubenville (ioncache) + +Duncan Forsyth + +Rick Moore + +Arthur Axel 'fREW' Schmidt + +perlpong + +David Golden + +Graham TerMarsch + +Dagfinn Ilmari Mannsåker + +Graham Knop + +Carwyn Ellis + +# AUTHOR + +Olaf Alders <olaf@wundercounter.com> + +# COPYRIGHT AND LICENSE + +This software is copyright (c) 2013-2017 by Olaf Alders. + +This is free software; you can redistribute it and/or modify it under +the same terms as the Perl 5 programming language system itself. |