Re: Client-side highlighting; tag proposal

Phillip M. Hallam-Baker (hallam@dxal18.cern.ch)
Tue, 14 Mar 1995 12:13:29 +0500


>As a result, I am now looking at a way of specifying both the start and
>ends of highlighted region separately from the document body, e.g. using
> single element in the document head, e.g. something like:
>
> <highlight from=3096 until=4013>

I would like to second this proposal as being much more flexible all round.
In fact I would like to suggest that we have a completely separate annotations
section this is because of the need to handle multiple annotations on the same
document.

Let us consider scenarios:

1) Simple Annotation, a group of text is highlighted. Note that cut n' paste is
a special case of such annotation.

2) Group annotation, multiple users add multiple annotations to the same
document. These annotations may overlap.

3) A user is editing a program code and running a compiler over it. The compiler
spits out annotations on the source. It is MUCH easier to handle such
annotations entirely separately becuase the compiler element that handles the
error reporting probably has access only to a token stream, not the original
text. In addition there is the intermediate edit problem where the user carries
on editing.

4) A filter produces annotation on a document, eg converting text to hypertext.
It is most convenient to do this in two stages, first building the annotations,
then doing a merge.

There are two distinct types of annotation:

Simple highlight
Hypertext link

It is essential that hypertext links be allowed. This seems to point to using
two tags eg, <ANN> and <ANNANCHOR HREF="", KEY= START=, LENGTH=>.

On the positioning problem there are two approaches, using the parse tree and
using absolute byte offsets. I would propose we combine both. Clients should be
able to handle a byte offset from within an element. This is mainly for ease of
annotation building tools. Given a choice of complexity its best to load it onto
browser writers than onto tool writers. This is because a browser is inevitably
a large group effort wheras tool building should be feasible by `privateers'.

The normal method for specifying an annotation would be as a character offset
from the character following the close angle bracket of a tag. Note that
character does not imply byte since we have to consider UTF. The simplest
convention would be to give an offset relative to the body. This allows
annotations to be added into the head element thus allowing one pass parsers to
work:-

START=/body/345

Does someone know the Hytime mechanism for this???

Support for fully implemented trees would be very usefull, ie to offset from the
second level 2 heading within the third H1 :

START=/h1.3/h2.2/23

I prefer using LENGTH istead of END since its easier to calculate and shorter.
It might be usefull to allow either END or LENGTH.

If no offset if given it should default to 0, If no end point is defined (ie no
length or end) it should default to the close tag of the structure defined in
the start. This allows easy identification of sections.

The tree based annotation would be most usefull in collaborative work tool
environments. I know we can't build these on HTTP/1.0 but I do not accept as an
argument that we should only think about our current needs. The IETF standard
process has a lead time of about two years. We will be needing the more
sophisticated feature set long before we will get agreement on HTTP 3.0.

I don't think the programming demands would be too onerous. Basically its an
addition into the FSR and tag translation components of the SGML module. Its not
that hard a job to do both tree based and absolute offset based annotation.

We should also consider (yes there is more!) adding annotation TEXT into the
body of a document. This could be displayed by callouts ie

<ann START="">This is annotation text</ANN>

And why not allow annotation on other documents? In Hyper-G annotation and
documents are entirely separate. Why not have a model in which an annotated link
may be made to another document? This is a very easy to implement and powerfull
feature. Essentially it means that the page one travels from can annotate the
next. The simplest use of this would be a a link to an annotated copy of a
document, ie one clicks on the error log of a compilation and gets returned the
source code annotated with errors. There are a wide range of other uses:

1) An annotated index to a Web is created. This has its own previous/next
operations which may be very different to the previous/next operations of the
documents themselves. Consider searching for the occurrences of "frying pan" in
a large database. It turns up 60 odd refferences to hypertexts on the Web. It is
helpfull for the index to be abole to annotate the location of the search item
and also provide a previous/next facility. This cannot be stored in the
documents themselves because they have no knowledge of being part a search
operation for frying pans.

2) Judge Lance Itoh has decided to go 100% electronic. He is reviewing his
transcript of the O.J. Simpson trial which is being produced in real time. CNN
wish to provide an annotated commentary of this transcript. There are two
models, either the transcript and annotation are fed into a junction box and the
result served or the browser independently collects both the transcript and
annotation.

The second model is vastly more powerfull. It allows annotations to be performed
in batch on realtime events. Consider that the annotations are issued once an
hour. A reader does not want to have the annotated feed separate from the
realtime feed. CNN do not want the hassle of providing a realtime server. They
provide only annotation so that is what they want to distribute.

In a charging model this is very important. The transcript feed might cost $10
an hour while the opinions of CNN may be worth only a few cents. Alice, who is
an OJ Simpson trial junkie subscribes to both the CNN and ABC annotation feeds
but does not want to pay two lots of $10 for the transcript itself. This is much
more important when one considers that Alice is also an IRC junkie and wants to
sit on an IRC/WWW transcript annotation channel in addition.

Summary :-
* Need links and annotations
* Start, end and length attributes, using tree structure of text with offsets
* Normally stored inside the Head element.
* May apply to documents referenced FROM a document.
* Should consider extreemes of the model to get the right structure.
* Easy to implement.

* Someone should look at HyTime and see IF its usefull and grab the
good ideas.

Phill