Announcing fwais.pl v0.9: an HTTP to WAIS CGI script

Jonny Goldman (jonny@Synopsys.COM)
Wed, 27 Apr 94 09:55:10 PDT


I'd like to formally announce "[nph-]fwais.pl, an HTTP to WAIS CGI script.

This is a perl script that uses the functionality of the "waisq" unix shell
level interface to perform WAIS style queries using an HTTP/HTML+ (forms)
compatible browser (like Mosaic 2.0).

The gateway allows the searching of multiple sources in a single query,
simple management of the selectable sources, and the ability to search
new WAIS sources found as a result of searching a WAIS source server. It
supports multiple-type results and attempts to highlight (bold) matched
keywords in TEXT documents.

This is my first general release of the code. It works for me, but it
might not work for you. Please send me any fixes you might come up with.

Documentation is in the code. 8^)

- Jonny G

# This is a shell archive. Remove anything before this line, then
# unpack it by saving it in a file and typing "sh file". (Files
# unpacked will be owned by you and have default permissions.)
#
# This archive contains:
# fwais.pl

echo x - fwais.pl
cat > "fwais.pl" << '//E*O*F fwais.pl//'
#!/usr/local/bin/perl
#
# [nph-]fwais.pl -- WAIS search/retrieval interface
#
# Copyright Jonny Goldman April 25 1994
#
# Copyright Notice and Limitation of Liability:
#
# As with most freely distributed code on the net, this is provided
# as-is, with no warranty or support. This code may be modified and
# freely distributed provided this notice is retained in-tact on all
# copies or derived works. Commercial use of this code is freely
# granted! All provisions of the GNU copyleft apply, but if there are
# any conflicts between the above notice and the GNU agreement, the
# above notice prevails.
#
# Above all, the "<author...>" tag must not be removed from the HTML
# documents created by this script. If you add something to this,
# add your name to the tag, but DO NOT DELETE MY NAME.
#
# Under no conditions shall I be held liable for damages due to use of
# this code.
#
# loosely based on wais.pl from Tony Sanders <sanders@bsdi.com>
#
# Documentation: (yes, that's right, self documenting code!)
#
# This script is designed to work with the NCSA CGI (Common Gateway Interface)
# It uses a GET method form, so all the input to this script is
# in the environment variable "QUERY_STRING"
#
# If you name this script as nph-<whatever>, it will return
# more appropriate error codes.
#
# This script has three modes of operation:

# 1. No input (QUERY_STRING emtpy, but really no Sources specified)
# with no input, the script builds a dialog form with all sources in
# the $waisd directory displayed. The user then selects source and
# sets the keywords, and go.
#
# 2. keyword search (keywords=...)
# with a list of sources and keywords, the script performs a search and
# returns a list of results. Multitype results are supported. HREFs
# are built with this script and sufficient state to retrieve the
# document.
#
# A special case of this is when the source has an @ in it. This is
# interpreted as a database spec, and a temporary source is created from
# it. This is how you can search a WSRC that you find from searching
# some other source (like the directory-of-servers).
#
# 3. retrieval (keywords_used=...)
# with a list of keywords_used, the script retrieves a document.
# Depending on the $type, it either builds an HTML result (for TEXT and
# WSRC) or it returns the content-type associated with the $type, and
# the raw bytes.
#
# If you must test this, take a look at the conversation it has with the HTTP server
# and set the QUERY_STRING environment variable accordingly.
#
# For the future:
#
# Although I believe this does just about everything you would ever want to do,
# in a WAIS search, there are still a few things left for the adventursome
# perl hackers to add:
#
# 1. Relevance Feedback:
# This one's the hardest of the bunch - how would you express a
# relevant document using an HTML form?
#
# Volunteers?
#
# - Jonny G

$fwais_version = "0.9, Jonny Goldman Sun Apr 25 1994";

# common configuration variables

# this is where you put waisq
$waisq = "/usr/local/bin/waisq";

# this is the WAIS source directory
$waisd = "/usr/local/wais/wais-sources/";

# this fully qualifies a host name from a WAIS source file (if needed)
$defaultdomain = "synopsys.com";

# these variables are supplied by the http server
# and are used to construct the form
$servername = $ENV{'SERVER_NAME'};
$portname = $ENV{'SERVER_PORT'};
$scriptname = $ENV{'SCRIPT_NAME'};

# when run from a shell
if ($scriptname eq "") {
$scriptname = $0;
}

# according to the CGI, if the script begins with "nph-"
# then it must generate its own HTTP headers
if ($scriptname =~ /nph-/) {
$nph = 1;
} else {
$nph = 0;
}

# this controls whether the sources are initially selected
# if the search is performed. This value is set to 1 if a Source
# is provided in the environment
$sources = 0;

# these are the names for the temporary files
$tname = "fwais.$$";
$tmpname = "/tmp/$tname.tmp";
$stmpname = "$tname.src";
$outfile = "/tmp/$tname.out";
$errfile = "/tmp/$tname.err";

# these associative arrays are used during retrieval to map WAIS types
# to MIME types. Add pairs as you see fit.
# if a WAIS type cannot be found, the retrieval engine returns
# Content-type: application/binary and the raw bytes

%typemap = ("TEXT", "text/html", # TEXT is really a special case
"PS", "application/postscript",
"HTML", "text/html",
"GIF", "image/gif",
"JPEG", "image/jpeg",
"MPEG", "image/mpeg");

%iconmap = ("TEXT", "/icons/text.xbm",
"PS", "/icons/text.xbm",
"HTML", "/icons/text.xbm",
"WSRC", "/icons/index.xbm",
"GIF", "/icons/image.xbm",
"JPEG", "/icons/image.xbm",
"MPEG", "/icons/movie.xbm");

# get the various parameters from the environment

@query_strings = split("&", $ENV{"QUERY_STRING"});

foreach $q (@query_strings) {
($attr, $val) = split("=", $q);
$query{$attr} = $query{$attr}." ".$val;
}

if ($query{"Source"} ne "") {
$sources = 1;
@srcs = split(' ',$query{"Source"});
} else {
opendir(WAISD, "$waisd");
while($ent = readdir(WAISD)) {
if ($ent =~ s/\.src$//) {
push(@srcs, $ent);
}
}
}

if ($query{"Numres"}) {
$numres = $query{"Numres"};
} else {
$numres = 40;
}

@srcs = sort(@srcs);

$code_src = '';

if ($#srcs == 0 && ($srcs[0] =~ /(.*)@(.*)%3A(.*)/)) {
local($db, $host, $port) = ($1,$2,$3);
$code_src = $srcs[0];
$code_src =~ s/%3A/:/g;
$code_src =~ s/%2F/\//g;
$db =~ s/%2F/\//g;
&write_src("/tmp/$tname.src", $db, $host, $port);
$srcs[0] = "$tname";
}

if ($query{"keywords_used"} ne "") {
$keys = $query{"keywords_used"};
$keys =~ s/%22/"/g; # "
$keys =~ s/%27/"/g; # "
$keys =~ s/%3F/?/g;
$keys =~ s/%26/&/g;
$keys =~ s/%3D/=/g;
$keys =~ s/\+/%%%%/g;
$keys =~ s/^\s//;

@keywords_used = split('%%%%', $keys);

$keys =~ s/%%%%/ /g;

# fix phrases

@kwds = @keywords_used;
@kwds2 = ();

for ($i = 0; $i <= $#kwds; $i++) {
$w = $kwds[$i];
if ($w =~ /^"/) { #"
for($j = $i+1, $w2 = $kwds[$j];
($j <= $#kwds);
$j++, $w2 = $kwds[$j]) {
if ($w2 =~ /"$/) { #"
$w = $w." ".$w2;
$i = $j+1;
break;
}
}
}
$w =~ s/"//g; #"
push(@kwds2, $w);
}

@keywords_used = @kwds2;

$ipname = $query{"hostname"};
$ipname =~ s/^\s//;
$tcpport = $query{"port"};
$tcpport =~ s/^\s//;
$database = $query{"database"};
$database =~ s/^\s//;
$docurl = $query{"docid"};
$docurl =~ s/^\s//;
$type = $query{"type"};
$type =~ s/^\s//;
$title = $query{"headline"};
$title =~ s/^\s//;
$title =~ s/%20/ /g;

} else {
$keys = $query{"keywords"};
$keys =~ s/\+/ /g;
$keys =~ s/%22/\\"/g; #"
$keys =~ s/%3F/?/g;
$keys =~ s/%26/&/g;
$keys =~ s/%3D/=/g;

if ($sources) {
$title = join(" ", @srcs);
} else {
$title = "Multiple Sources";
}
}

# why not?

sub min {
local($a, $b) = @_;
if ($a < $b) {
return $a;
}
else {
return $b;
}
}

# functions to deal with any's

sub anytostring {
local($any) = @_;
$res = '';
if ($any =~ /:bytes #\((.*)\)(.*)\)/ && ($string = $1)) {
@chars = split(' ', $string);
foreach $c (@chars) {
$res = $res.sprintf("%c", $c);
}
}
$res;
}

sub stringtoany {
local($str) = @_;
$len = length($str);
$res = sprintf("(:any :size %d :bytes #( ", $len);
for ($i = 0; $i < $len; $i++) {
$res = $res.sprintf("%d ", ord(substr($str,$i,1)));
}
$res = $res.") )";
$res;
}

# convert a date

sub datefrom {
local($date) = @_;
if (length($date) == 6) {
return substr($date,2,2)."/".substr($date,4,2)."/".substr($date,0,2);
} else {
return substr($date,4,2)."/".substr($date,6,2)."/".substr($date,0,4);
}
}

# write a docid given the URL encoding

sub print_docurl {
local($database, $docid) = @_;
$docid =~ s/%20/ /g;
print TMP " (:doc-id \n";
print TMP " :original-database ".&stringtoany($database)."\n";
print TMP " :original-local-id ".&stringtoany($docid)."\n";
print TMP " )\n";
}

# write the question that waisq will read.
# You need to do this to handle querying multiple sources.

sub write_q {
local($tmpname,$keys, $docurl, $type, @srcs) = @_;
open(TMP, ">$tmpname");

print TMP "(:question\n :version 2\n";
if ($keys ne "") {
print TMP " :seed-words \"$keys\"\n";
}
if ($docurl ne "") {
print TMP " :result-documents \n";
print TMP " ( (:document-id :document (:document :doc-id\n";
&print_docurl($database, $docurl);
print TMP " :number-of-bytes -1 :type \"$type\"\n";
printf TMP " :source (:source-id :filename \"$stmpname\") ) ) )\n";
} else {
print TMP " :sources (\n";
foreach $src (@srcs) {
print TMP " (:source-id\n";
print TMP " :filename \"$src.src\"\n";
print TMP " )\n";
}
}
print TMP " ) )\n";
close (TMP);
}

# write a source file

sub write_src {
local($stmpname, $database, $ipname, $tcpport) = @_;
if ($stmpname =~ /\//) {
open (SRC, ">$stmpname");
} else {
open (SRC, ">/tmp/$stmpname");
}
print SRC "(:source :version 3 \n";
print SRC " :database-name \"$database\"";
if ($ipname ne "" && $ipname ne "localhost") {
print SRC " :ip-name \"$ipname\" :tcp-port $tcpport";
}
print SRC ")\n";
close(SRC);
}

# parse a WAIS source file

%srcports = ();
%srcdbs = ();
%srcipnames = ();

sub getwaissrc {
local ($sourcename) = @_;
local ($ipname, $databasename, $tcpport);
if (!($sourcename =~ /^\//)) {
$sourcename = "$waisd/$sourcename";
}
if($srcipnames{$sourcename} eq '') {
$srcipnames{$sourcename} = "localhost";
open(SRC, "$sourcename");
while(<SRC>) {
/:ip-name\s"(.*)"/ && ($srcipnames{$sourcename} = $1);
/:database-name\s"(.*)"/ && ($srcdbs{$sourcename} = $1);
/:tcp-port\s"(.*)"/ && ($srcports{$sourcename} = $1);
/:tcp-port(.*)/ && ($srcports{$sourcename} = $1);
}
close(SRC);
}
return($srcipnames{$sourcename},
$srcdbs{$sourcename},
$srcports{$sourcename});
}

# encode a localDocID for a URL

sub ldocid2url {
local ($docid) = @_;
$docid =~ s/ /%20/g;
return $docid;
}

# encode a local DB name for a URL

sub db2url {
local ($db) = @_;
$db =~ s/ /%20/g;
$db =~ s#/#%2F#g;
return $db;
}

# turn a WAIS DocID into a URL
# (doesn't somebody have a library routine for this!!!)

sub docid2surl {
local ($ipname, $databasename, $tcpport) = &getwaissrc($sourcename);
local ($url) = &ldocid2url($docid);
local ($dbname) = &db2url($databasename);
local ($lname);
if ($ipname eq "localhost") {
$lname = '';
} else {
if (!($ipname =~ /\./)) {
$ipname = $ipname.".".$defaultdomain;
}
$lname = $ipname.":".$tcpport;
}
$sourcename =~ s/\.src$//;
local($kws) = $keys;
$kws =~ s/^\s//;
$kws =~ s/ /\+/g;
$kws =~ s/\\"/%22/g; # "
$kws =~ s/\?/%3F/g;
$kws =~ s/&/%26/g;
$kws =~ s/=/%3D/g;
local($hl) = $headline;
$hl =~ s/ /%20/g;

$res = sprintf("http://$servername:$portname/$scriptname?keywords_used=%s&hostname=%s&port=%d&database=%s&docid=%s&type=%s&headline=%s",
$kws, $ipname, $tcpport, $database, $url, $type,$hl);
return $res;
}

# dump a file to stdout. If $tr is 1, the <> are translated

sub dumpfile {
local($filename, $tr) = @_;
open(FD, "$filename");
if($tr) {
while (<FD>) {
s/>/&gt\;/g;
s/</&lt\;/g;
print $_;
}
} else {
while (<FD>) {
print $_;
}
}
close(FD);
}

# write HTTP headers

sub http_headers {
local($code, $msg) = @_;
local($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) =
localtime();

print "HTTP/1.0 $code $msg\n";
print "Date: ".$year."/".$mon."/".$mday."\n";
print "Server: ".$query{'$SERVER_SOFTWARE'}."\n";
print "CGI-Script: FWAIS/$fwais_version\n";
print "MIME-version: 1.0\n";
}

# check error output for trouble. Returns 1 if there was an error.

sub check_err {
local($res) = 0;
open(ERR, "$errfile");
while (<ERR>) {
if (/Missing DocID in request|Could not find Source/) {
s/done//g;
print STDERR "\n$scriptname: URL not found: $docurl $_\n";
if ($nph) {
&http_headers(404, "Not Found");
}
print "Content-type: text/html\n\n<HEAD><TITLE>404 Not Found</TITLE></HEAD>\n";
print "<BODY><H1>404 Not Found</H1>\n";
print "The requested URL was not found on this server.<P>\n";
print "Server: <B>$ipname</B> Port:<B> $tcpport</B><P>\n";
print "Docid:<B> $docurl</B><P>\n";
print "Error getting document:<P>\n <B>$_</B><P>\n";
print "(This is usually a bad DocID,";
print "or the server has deleted the document since you ran the search)<p>\n";
print "</BODY>\n";
$res = 1;
}
}
close(ERR);
unlink($errfile);
return $res;
}

sub printform {
local($size, $src) = @_;
print "<hr><form method=\"GET\" action=\"http://$servername:$portname$scriptname>\n";
print "A search of the \n";
if ($code_src ne '') {
print "<SELECT SIZE=2 NAME=\"Source\">\n";
print "<OPTION SELECTED> $code_src\n";
} else {
print "<SELECT MULTIPLE SIZE=$size NAME=\"Source\">\n";
foreach $src (@srcs) {
if($sources) {
print "<OPTION SELECTED> $src\n";
} else {
print "<OPTION> $src\n";
}
}
}
print "</SELECT>\n";
print " database";
print "s" if $#srcs > 0;
print "<p> with the words ";
print "<input Name=keywords value=\"$skeys\" size=40><p>";
print "Maximum Results: ";
print "<SELECT SIZE=1 NAME=\"Numres\">\n";
for $i (10,20,40,100,200) {
if ($numres == $i) {
print "<OPTION SELECTED> $i\n";
} else {
print "<OPTION> $i\n";
}
}
print "</SELECT><p>\n";
print "<PRE> <input TYPE=\"submit\" value=\"Submit\"> ";
print "<input TYPE=\"reset\" value=\"Clear Form\"></PRE><p> ";
print "For a new search enter keyword(s) and hit <b>return</b> ";
print "or press the \"Submit\" button<p>\n";
print "(meta-click or shift-click to select multiple sources)<p>\n";
print "</form>\n";
}

# process a retrieval request

sub getdoc {
&write_src($stmpname, $database, $ipname, $tcpport);
&write_q($tmpname, "", $docurl, $type, $stmpname);
if ($type eq "" || $type eq "TEXT" || $type eq "WSRC") {
$exres = system("$waisq -s /tmp/ -f $tmpname -v 1 >> $outfile 2> $errfile");
if (! &check_err) {
if ($nph) {
&http_headers(200, "OK");
}
print "Content-type: text/html\n\n";
print "<HEAD>\n<TITLE>", $title, "</TITLE>\n</HEAD>\n";
print "<BODY>\n<H1>", $title, "</H1>\n";
print "<H2>Keywords:</H2><P>\n<H3>$keys</H3>\n<hr>\n";
open (OUT, "$outfile");
if ($title =~ /^Catalog for database:/) {
local($type,$headline,$docid,@types);
$sourcename = "/tmp/$stmpname.src";
while(<OUT>) {
(/^Document \# (\d+) Type: (.*)/ && (@types=split(',', $2)));
(/^Headline: (.*)/ && ($headline = $1));
if (/^DocID: (.*)/ && ($docid = $1)) {
$date = 0;
&docdone;
}
}
} elsif ($type eq "WSRC") {
local($ff,$tcport) = (0,210);
while(<OUT>) {
if($ff) {
$desc .= $_;
/"$/ && ($ff = 1); #"
} else {
/:ip-name\s"(.*)"/ && ($ipname = $1);
/:database-name\s"(.*)"/ && ($databasename = $1);
/:tcp-port\s"(.*)"/ && ($tcpport = $1);
/:tcp-port\s(\d+)/ && ($tcpport = $1);
/:description\s"(.*)/ && ($ff = 1) && ($desc = $1); #"
}
}
$src = $databasename."@".$ipname.":".$tcport;
&printform(2, $src);
} else { # its a text file.
print "<PRE>";
while(<OUT>) {
s/</&lt;/g; s/>/&gt;/g;
foreach $w (@keywords_used) {
if ($w ne "" && $w ne "?") {
local($dlms) = '\s!-/:-@[-`'; # ]
s#([$dlms])($w)([$dlms])#$1<b>$2</b>$3#ig;
s#(^$w)([$dlms])#<b>$1</b>$2#ig;
s#([$dlms])($w$)#$1<b>$2</b>#ig;
}
}
print $_;
}
print "</PRE>\n";
}
close(OUT);
print "</BODY>\n";
}
print "<HR><B>FWAIS: HTTP to WAIS gateway by";
print "<address>Jonny Goldman &lt;jonny@synopsys.com&gt;</address></B><p>\n";
}
elsif ($typemap{$type} ne "") {
# this is where we can do more with content-type!
$exres = system("($waisq -s /tmp/ -f $tmpname -v 1 >> $outfile) 2> $errfile");
if (! &check_err) {
if ($nph) {
&http_headers(200, "OK");
}
print "Content-type: $typemap{$type}\n";
print "Content-length: ". (-s $outfile)."\n\n";
&dumpfile($outfile, 0);
}
}
else {
# this is where we can do more with content-type!
$exres = system("($waisq -s /tmp/ -f $tmpname -v 1 >> $outfile) 2> $errfile");
if (! &check_err) {
if ($nph) {
&http_headers(200, "OK");
}
print "Content-type: application/binary\n";
print "Content-length: ". (-s $outfile)."\n\n";
&dumpfile($outfile, 0);
}
}
unlink("/tmp/$stmpname");
unlink ("$outfile");
unlink ("$tmpname");
}

# process the result docid once you've got it

sub docdone {
local($href);
if ($docid =~ /^\(:any/) {
$docid = &anytostring($docid);
}
# create the URL from the docid
$_ = $docid;
/(\d+)\s(\d+)\s(.*)/ && ($fname = $3);
$_ = $fname;
s#\\"#"#;
if ($headline =~ /<TITLE>/) {
$headline =~ s/<TITLE>//;
$headline =~ s/<\/TITLE>//;
}
$headline =~ s#\\"#"#g;
print "\n<UL><LI><b>$headline</b> ";
if ($date) {
print "Date: ".&datefrom($date)." ";
}
print "Score: $score, Lines: $lines, Bytes: $bytes\n" if $score;
print "</UL>\n<DL><DT><DD>";
foreach $type (@types) {
$href = &docid2surl;
if ($iconmap{$type} ne "") {
$image = $iconmap{$type};
} else {
$image = "/icons/binary.xbm";
}
print "\n <IMG SRC=\"".$image."\" alt=\"\"><A HREF=\"".$href."\">$type</A>";
}
print "\n</DL>";
$score = $headline = $lines = $bytes = $type = $date = '';
@types = ();
}

# Process a WAIS search

sub do_wais {
local($size) = &min($#srcs+1, 5);
local($skeys) = $keys;
$skeys =~ s/\\"/'/g; #'"
$skeys =~ s/%27/'/g; #'
$skeys =~ s/^\s//;
if ($nph) {
&http_headers(200, "OK");
}
if ($code_src ne '') {
$title = $code_src;
$sourcename = "/tmp/$tname.src";
}

print "Content-type: text/html\n\n";
print "<HEAD>\n<TITLE>Search of ", $title, "</TITLE>\n</HEAD>\n";
print "<BODY>\n<H1>Search of ", $title, "</H1>\n";
print "<i>Note: This service can only be used from a forms-";
print "capable browser.</i><p>\n";

&printform($size, @srcs);

if($sources) { # Sources were set in the environment
print "<HR><DL>\n";
&write_q($tmpname,$keys,"","",@srcs); # do the search
if ($code_src) {
open (WAISQ, "-|") || system ("cat $tmpname | $waisq -m $numres -c /tmp -g 2>&1");
} else {
open (WAISQ, "-|") || system ("cat $tmpname | $waisq -m $numres -c $waisd -g 2>&1");
}
local($score, $headline, $lines, $bytes, $type, $date);
unlink ("$outfile");
while (<WAISQ>) {
if (/Found (\d+) items/ && ($hits = $1)) {
if ($hits == 0) {
print "Nothing found.\n";
} else {
print "Found <b>$hits</b> result";
print "s" if $hits > 1;
}
print ".<p>\n";
}
/:original-local-id "(.*)"/ && ($docid = $1);
/:original-local-id (\(:any.*\))/ && ($docid = &anytostring($1));
/:original-database "(.*)"/ && ($database = $1);
/:original-database (\(:any.*\))/ && ($database = &anytostring($1));
/:score\s+(\d+)/ && ($score = $1);
/:number-of-lines\s+(\d+)/ && ($lines = $1);
/:number-of-bytes\s+(\d+)/ && ($bytes = $1);
/:type "(.*)"/ && (push(@types,$1));
/:headline "(.*)"/ && ($headline = $1); # XXX
if ($code_src eq '') {
/:filename "(.*)"/ && ($sourcename = $1);
}
/:date "(\d+)"/ && ($date = $1, &docdone);
}
close(WAISQ);

unlink($tmpname);

print "</DL>\n";

}

print "<HR><B>FWAIS: HTTP to WAIS gateway by";
print "<address>Jonny Goldman &lt;jonny@synopsys.com&gt;</address></B><p>\n";
print "</BODY>\n";
}

# here's where we actually choose what to do.

if ($#keywords_used > 0 || $keywords_used[0] ne "") {
# if keywords_used is not empty, then retrieve
&getdoc;
} else { # otherwise search
&do_wais;
}

# just in case...

unlink("/tmp/$tname.src");

exit 0;
//E*O*F fwais.pl//

exit 0