Information Overload and What We Can Do About It

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.ARPA
Path: utzoo!linus!decvax!decwrl!ucbvax!fair
From: f...@ucbvax.ARPA (Erik E. Fair)
Newsgroups: net.news,net.news.notes
Subject: Information Overload and What We Can Do About It
Message-ID: <10381@ucbvax.ARPA>
Date: Sat, 14-Sep-85 08:58:18 EDT
Article-I.D.: ucbvax.10381
Posted: Sat Sep 14 08:58:18 1985
Date-Received: Sun, 15-Sep-85 05:16:46 EDT
Organization: University of California at Berkeley
Lines: 200
Summary: information structuring and filtering mechanisms needed

Have you ever wondered why the notesfiles people are so smug about
the superiority of their system over netnews?

Or why has `rn' been such a big hit with the USENET user community?
(of course, if you're using it, you probably know, but bear with me
for the moment anyway).

The USENET user community as a whole is suffering from information
overload; that is, there are more items coursing the paths of the
network than any single individual can read in a reasonable period
of time.

As the volume of messages in the newsgroups that I choose to read
increases, there are two steps I can take to be more efficient:

1) I can arrange to read netnews at a higher baud rate
	(instead of 1200 baud, how about 9600 or 19200?).
	This will allow me to make my article selections faster,
	and hopefully be able to handle more articles per unit time
	than I did at 1200 baud.

2) I can prioritize the list of newsgroups that I read and
	remove some newsgroups from the bottom of the list,
	until the volume is manageable again.

However, these traditional mechanisms for limiting time spent reading
netnews are no longer sufficient, because they're not specific enough.
What I need now is a set of automatic structuring and filtering
mechanisms for articles.

Remember my original questions about notesfiles & rn? The reason that
these two user interfaces are popular is that in addition to providing
the usual amenities (screen oriented interface &c), they also structure
the information presented to the user, and `rn' provides the first of
many possible filtering mechanisms for removing from view articles that
the user is not interested in.

If you were to grep for the Subject line in any high volume newsgroup,
my observation is that you would find 80% or more of the articles are
responses, rather than original articles. To the notesfiles user, the
`base note' (the first article) and all the responses appear as one
item in the presentation menu.

It is considerably more daunting to hit `=' in rn, in a newsgroup you
haven't read in many weeks and see the list of hundreds of individual
articles that have accumulated. Fortunately, `rn' provides you with the
facility to `kill' (remove from the list of unread articles) all of the
articles with a specific subject (including the `Re:' subjects). This
brings us to:

	I N F O R M A T I O N   S T R U C T U R E

Right now (with the exception of rn & notes) netnews articles are
presented to the user in the order they arrived on the system. This is
not optimal. To create structure in the way that netnews articles are
presented, we can start (as rn does) with the Subject line, and follow
that along, presenting articles whose subjects match. This gives us the
thread of a discussion.

However since responses can and frequently do arrive on a system out of
order, we should sort by date of submission (i.e. the contents of the
`Date:' field). This will give us the discussion in the chrological
order in which it occurred.

There is even more information in the header that we can use to order
the articles into a discussion more accurately than with `Subject:'
and `Date:'.  I mean the `References:' line.

Presently, the only use that any of the user interfaces make of this
field is for finding the `parent' article of the current article (that
is, the article to which the current article is a response).

We can use this information for following discussions by building the
tree that discussions form:

			  a
			 /|\
			b c d
			   / \
			  e   f

If this information is put into a database that is easily used by the
various user-interfaces, the following things are possible:

1) accurate ordering and presentation of the discussions that take
	place on the network

2) differentiation between the various sub-branches of the tree of
	discussion (one branch goes off discussing foo from foobar,
	the other discussing bar from foobar)

3) change of subjects to reflect actual message content to facilitate #2,
	without affecting #1 (i.e. no more `Re: foo (really bar)')

4) delay posting of responses until the user has read the entire
	tree (or at least as much of it as is online at his site).
	We have a problem with users asking a trivial question, to
	which everyone knows the answer (and everyone immediately
	responds!). If the user-interface holds the followup until the
	user has read all the articles in the tree, and asks again
	whether the submitted response is still appropriate, the
	incidence of this problem should drop significantly. This
	should also cause a drop in network traffic.

5) lessen the necessity of including the text of the article to which
	one is responding. (the `parent' command of vnews, and ^P in
	rn also provide some of this functionality).

It is this particular structure that makes the netnews data storage
structure superior to notesfiles.

However, we still have the problem of too much information to read and
understand, which leads into:

		F I L T E R I N G   M E C H A N I S M S 

As I mentioned, rn provides for removing articles with subjects you are
disinterested in, from your view. However, given the proclivity of users
to change the subject line, for a less than titanic change of subject
(in which you probably still have no interest), rn's current mechanism
for killing discussions misses the mark. Given the database described
above, rn would never miss.

A subject, however, is not the only criterion that you might wish to
filter with. Consider the following information that might be useful
to filter by:

author		(also known as the `bozo' filter)
site		(they're all bozos on that bus)
date		(kill articles that are four days old)
time		(kill articles composed between 0000 and 0600?)
transit-time	(kill articles that took more than x days to get here)
length		(anything too small or too big)
newsgroups	(in a multiple group posting,
		  skip if `net.flame' is one of the other groups)
keywords	(suppose that postnews mungs up a set of keywords
		  from the body of the article when it was first posted...)

Consider also that any of these criteria can be used for article
selection (i.e. to *find* articles) as well as in article de-selection.

Finally, one more mechanism: we use moderators as a filtering
mechanism, in that they select appropriate articles to broadcast to the
network.  In our electronic publishing medium, they are the editors.

With the appropriate statistical information gathered by the
user-interfaces on the system, other users on your system can act as
editors for you. Ideally, I should be able to tell the user-interface,
`show me all the articles that John Smith <jsmith> thought were
interesting'. In this way, John Smith becomes my editor. Alternately,
`show me everything that John Smith and Jane A. Nonymous did not look
at' should also be a valid filter.

		W H A T   D O   W E   D O   N O W ?

The structuring of netnews articles should be easy to implement; all of
the necessary hooks are there, we're just not using the information
contained in the header as yet. Clearly this is a database function
that should go into rnews and expire for update & maintainance, rather
than in the user-interfaces.

The more mundane filtering mechanisms that I suggested should also be
relatively easy to implement, given `rn' as a base. The `other local
users as editors' idea will take some work.

With the volume of network traffic increasing, there is no doubt in my
mind that we will have a test of fire (site death by network byte?).
However, I think that the mechanisms I have outlined, coupled with
sensible naming of groups (and management of that namespace as a whole)
will `save' the network that we know as USENET. The key is getting this
software implemented, and distributed network wide as soon as possible,
so that the peak of the deluge of information will be that much sooner,
and that much lower, than if we do nothing.

	your comments and observations are solicited,

	Erik E. Fair	ucbvax!fair	f...@ucbarpa.BERKELEY.EDU

	S U G G E S T E D   R E A D I N G S

DRAGONMAIL: A Prototype Conversation-Based Mail System
	Douglas E. Comer, Larry L. Peterson, Purdue University
	SLC USENIX Conference Proceedings, June 1984, p. 42

The Readers Workbench -  A System for Computer Assisted Reading
	Evan L. Ivie, Brigham Young University
	SLC USENIX Conference Proceedings, June 1984, p. 270

Structuring Computer-Mediated Communication Systems
	to Avoid Information Overload

	Starr Roxanne Hiltz, Murray Turoff
	CACM, July 1985, Vol 28, #7, p. 680

Conversation-Based Mail
	DRAFT TR August 26, 1985

	Douglas E. Comer, Purdue University
	Larry L. Peterson, University of Arizona

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 alpha 5/22/85; site cbosgd.UUCP
Path: utzoo!linus!gatech!cbosgd!mark
From: m...@cbosgd.UUCP (Mark Horton)
Newsgroups: net.news,net.news.notes
Subject: Re: Information Overload and What We Can Do About It
Message-ID: <1482@cbosgd.UUCP>
Date: Sun, 15-Sep-85 17:28:07 EDT
Article-I.D.: cbosgd.1482
Posted: Sun Sep 15 17:28:07 1985
Date-Received: Mon, 16-Sep-85 20:44:44 EDT
References: <10381@ucbvax.ARPA>
Organization: AT&T Bell Laboratories, Columbus, Oh
Lines: 14

These are some good points.  I'd like to expand on one of them.

If, within a newsgroup, for each article you form
	concat(references, message-id)
and sort by the result, you'll have all the discussions in order.
Ties are broken by date of submission.  There is no need to look
at the subject line anymore (is there?)

Also, it sure would be nice if the discussions (identifiable by
having the same prefix in the above concatenation) were grouped
when you asked "what's next" it showed one line per conversation,
possibly with an article count.

	Mark

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site sdcrdcf.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!vax135!cornell!uw-beaver!tektronix!hplabs!sdcrdcf!lwall
From: lw...@sdcrdcf.UUCP (Larry Wall)
Newsgroups: net.news,net.news.notes
Subject: Re: Information Overload and What We Can Do About It
Message-ID: <2355@sdcrdcf.UUCP>
Date: Wed, 18-Sep-85 18:44:29 EDT
Article-I.D.: sdcrdcf.2355
Posted: Wed Sep 18 18:44:29 1985
Date-Received: Mon, 23-Sep-85 00:31:04 EDT
References: <10381@ucbvax.ARPA> <28f87e10.1de6@apollo.uucp>
Reply-To: lw...@sdcrdcf.UUCP (Larry Wall)
Organization: System Development Corp. R+D, Santa Monica
Lines: 15
Summary: 

I'd love to make rn run off of a multi-key dbm file.  Who'll rewrite inews?
I wouldn't mind, but I don't think I have the time.  I can't even keep up
with my mail on rn, and I'm also maintaining patch and warp.  In a few days
I plan to post an automatic Configure script generator, and that will take
all the more time.  Every now and then I have to do some "real" work too.

Is this multi-key dbms that was mentioned public-domain, portable, reliable,
and efficient?  How much memory would it steal from a process on a dinky
machine?  How does it do on disk space?  Does it rely on "holes" in files?
How can I get a copy?

Partly yours,

Larry Wall
{allegra,burdvax,cbosgd,hplabs,ihnp4,sdcsvax}!sdcrdcf!lwall