Semantic Tagging in Web Objects (was: Comments on HTML+ Request For Comments)

Steve Waterbury (waterbug@epims1.gsfc.nasa.gov)
Tue, 1 Feb 1994 06:02:04 +0500


Jonathan Abbey wrote:

> Definite agreement on the semantic markings.. this is one of the single
> most important things that we should be attending to now.. devising ways
> to support things like the Interpedia project from within the WWW framework.
>
> I would actually hope to see a richer set of semantic tags..
> DOCUMENT_TYPE is essential (and it's good to see it here), but I tend
> to think that KEYWORDS is inadequate. What about some kind of
> hierarchical categorization coding, like dewey decimal or library of
> congress numbers?

I don't think it's a good idea to burden HTML+ with semantic tagging.
Categorizations, and their cousins, attributes, in general need
separate support. I have been in this discussion before (on the other
side, in fact!) and, IMHO, orthogonality is needed here.

I think it would be cleaner and more flexible, and would preserve the
focus of HTML+, to do semantic tagging with SGML tags from outside the
HTML+ tag set. These semantic tags would be invisible to an HTML+
browser, but would be known to a set of specialized indexing engines,
SGML editor/parsers, knowbots, etc., whose purpose in life would be to
record:

1. the locations {URLs/URNs} of all "objects" that contain certain tags
2. the tag contents for those tags in those objects

and to maintain indexes of them on semantic data servers specialized
to the various semantic domains (granularity TBD).

This would enable direct querying to find the set of objects on the
net with a specified tag and with the contents of that tag containing
a certain string or a value within a certain range of values, etc.
Of course the objects retrieved can be arbitrary: documents, binaries,
images, product catalogs or specific "data sheets", organizational
directories, technical standards, specifications, whatever. The
specialized semantic data servers would be the grandchildren of whois++,
x.500, WAIS, and SQL servers.

The BIG project, and there is lots of work being done on this as we
speak, is to achieve consensus on technically sound information models
for the various semantic domains. Of course, the mapping of the
information models of various sorts into DTD's is non-trivial, but I
believe it is technically feasible, and I would rate it much easier
than the original modeling task itself.

As for categorizations, I think it is a fallacy to believe that
universal consensus on them is either possible or necessary.
Categorization schemes become important only for access to otherwise
uncharacterized objects, but are not nearly as important when query
access directly to the objects' attributes is available.

Categories will always be with us, but to be created properly, they
must derive from a consensual set of attributes (semantic tags) --
i.e., they should sit on top of the information models, and will
probably come in several different flavors for each semantic domain.
Even within a domain, different groups like to slice things a little
differently ("around here we call that an _extra_ large!!")
... but that's no problem -- as long as the basic information models
and their attribute sets are agreed to, the categorization-du-jour
can be selected by the end-user.

Steve Waterbury

=====================================================================
Stephen C. Waterbury Phone: 301-286-7557
NASA Parts Project Office FAX: 301-286-1695
Code 310.A email: waterbug@epims1.gsfc.nasa.gov
NASA/GSFC "Sometimes you're the windshield;
Greenbelt, MD 20771 sometimes you're the bug."
=====================================================================