Re: HTML Strippers

John Labovitz (johnl@ora.com)
Wed, 26 Apr 1995 01:46:40 +0500


rmesa@best.com (Robert A. Mesa) said:

> Is there a utility to strip away HTML tags.

if you can't find anything else, the following
perl script (which i call 'unhtml') will work ok:

#!/usr/bin/perl

$* = 1; # turn on multi-line string matching
undef($/); # turn off paragraph-mode reading
$_ = <>; # read in entire file
s/<[^>]+>//g; # remove <...>'s in the entire string
print; # print the file

this would be run like:

unhtml file.html >file.txt

it's not by any means perfect -- angle brackets
within quoted strings will be munged, and nothing
is done with entities (like &amp;).

another option, especially if you want the html
code to be formatted, is to use the lynx browser
in 'dump' mode:

% lynx -dump file.html >file.txt

hope this helps.

--
John Labovitz
Technical Services Manager, Global Network Navigator <http://gnn.com/>
O'Reilly & Associates, Sebastopol, California, USA (+1 707 829 0515)