Re: caching dilemma

Shel Kaphan (sjk@netcom.com)
Sat, 27 May 1995 14:12:28 +0500


>However, I also think it is worth considering for browser writers that
>history stacks (that can be re-viewed with browser navigation
>controls) are in a class of their own when it comes to caching.

I agree, I recently said the same thing in another thread (Client
handling of Expires:) on www-talk. There, I said that

|In my opinion, for this problem to be really solved, a client should
|maintain two stores:
|
| a) a resource content cache that handles Expires, for reducing network
| traffic when a link is clicked for a second time
| b) a `contents that were previously displayed' store, for use by the
| history function and `back/forward' buttons.
|
|Of course, stores a) and b) can share memory for most information.
|Typically, store b) will only be able to hold information for the
|recent history.

It would be wrong to call store b) a (special kind of) cache. Calling
it a `history log' would be more appropriate.

Good decomposition of the problem. I'll go along with this.

Shel Kaphan writes:
>Kee Hinckley writes:
> > Automatic reloading of a page in my history stack seems rather
> > user-unfriendly.

Yes. The main requirement for `history browsing' is that it is fast,
not that it provides up to date results.

One HTTP-spec related issue here is that the current draft HTTP spec
encourages writers of forms whose response messages can change through
time, e.g. a search form on a dynamic database, to set the expires:
field to a date in the past. From section 7.1.8 of the draft:

# If a resource is dynamic by nature,
# as is the case with many data-producing processes, copies of that
# resource should be given an appropriate Expires value which
# reflects that dynamism.

Thus, if a properly programmed (expires header generating) dynamic
search form is accessed with a browser that *does* automatically
reload expired responses in the history, browsing a 20-link search
result will be both slow and resource-intensive.

And with the "other kind" of browser, the current behavior is to
display error messages as the user revisits expired pages in the history
stack. I think we now agree (you and I anyway) that both behaviors are
wrong.

The browser author has almost no choice but to make the history
function ignore the expires: field.

And the CGI script writer has almost no choice but NOT TO USE expires:
if such "Data Missing" error messages would confuse users.

One could argue that the HTTP spec is broken because of this; a
history function that would ignore expires only for search scripts and
the like, not for normal dynamic information, would be preferable.
But currently, there is no safe way of telling the difference between
`search' and `non-search'.

> > I expect history loading to be fast and not go off over
> > the net. I guess I could see it as a user-specified option, but...
> >
>I definitely see your point -- as I see it we're talking about a
>"lesser of evils" situation. When you "back up" to an expired page,
>there are only three things I can think of that could happen:
>1. you see the expired document.
>2. you see an error message and (if you interpret the message correctly)
> you can reload the page manually
>3. the browser reloads the page behind your back.

Well, 3. usually involves animating icons and flashing http transaction
progress messages, so 3. will never be completely `behind your back'
if you pay attention to the screen.

>Well, as Lori Anderson would put it, "?Que es mas macho?"
>I guess I'd pick door number 1 -- but only for the case where you view
>the page with browser navigation commands, not explicit links.

I agree 1. is best, but of course only for `history browsing'. There
us a subtle point here, however: as the `history log' store b) I
talked about earlier cannot be infinite, the browser is sometimes
forced to do 3. to satisfy a history browsing effect (2. is not really
an option IMO).

The point is that the user never knows beforehand if 1. or 3. will be
done for an older item in the history list, and this is bad if the
item was the result of a non-idempotent POST operation (i.e. a form
submission that `did' something, like order a pizza). If 3. is done
on such an item, this means reposting the form; and this means (unless
the form author is paranoid, and luckily many are) inadvertently
ordering a second pizza. Thus, not having enough RAM in your computer
will be bad for your health :)

There's a solution to this: if the browser needs to flush something
from its cache (which contains the union of pages from the "resource
cache" and "history log"), it should first try flushing pages from the
resource cache. If there are none left to flush, (i.e. if the history
log has become dominant), then the history log should be flushed
"oldest first" (or possibly LRU), and the user should not be allowed
to revisit that page, or only to be allowed to visit it with a warning
and then an explicit reload. I.e. this is the same as the behavior
I'm complaining about above, but with the big difference that it is
under proper cache control logic, and so would apply only to the
least-likely-to-be-visited pages in the history log. I would not even
object to having all record of the oldest pages simply removed from
the history.

This is an important problem that, in my opinion, can only be solved
by putting extra stuff in the HTTP-spec. A paranoid form author can
provide a 70% solution to this problem within the current HTTP-spec,
but nothing beyond that.

For a further discussion of this problem, see my article `HTTP and
statefull services' in the www-talk archive.

Shel Kaphan writes:
>I realize these considerations may have no role in the HTTP spec,

The more I think about it, the more I am convinced that these
considerations _do_ have a role in the HTTP spec:

1) parts of the solution to these problems involve HTTP extensions.

2) Also, as long as we only have the HTTP spec and the HTML spec to
specify the behavior of browsers, the HTTP spec is the most likely
place to solve this problem, even though the issue goes beyond data
transfer.

I've been meaning to submit some report/proposal to the http-wg
mailing list about this, but I have not yet had the time to write one.
If anyone wants to help putting together such a report, please mail
me.

Count me in.

>however I feel there are serious problems in this area, which can only
>be resolved by coordinating the behavior of browsers and servers.

I agree there are serious problems. Besides browser and server
authors, CGI script authors is also involved.

I feel www-talk would be a good place to discuss these problems and
possible solutions. I have tried to get a discussion going a number
of times, but so far, little has happened :(

Well, I think we have the basis for a pretty concrete proposal above,
and would be happy to work with you to put it in the right form and
before a forum that can act on it.

[...]
>Another thing that might help: perhaps there should be a way for
>servers to "force" the URL (the *name*) handled by clients to something other
>than the requested URL.

I believe the redirection (3xx) codes in the HTTP spec could be used
(abused?) for this purpose.

[...]
>To explain this a little more, if there were two GET requests, one for
>/cgi-bin/food/hamburgers and one for /cgi-bin/food/french-fries, which
>would result in a single page that ought to be cached as one page,
>then the server ought to be able to say, "you asked for
>/food/french-fries, but the page is called /food/generic-junk-food",
>and to have the browser use that info to uniquely identify a cache
>entry and update it with the newly fetched data. This might not help
>to avoid fetching documents extra times, but it would help on cache
>coherence if the intent was to display a dynamically generated document.

I don't think this would help with cache coherence at all, for proper
definitions of `coherence'. There is no reason for the `cache for
real browsing commands' ever to become incoherent (contain expired
page contents). It seems to me you are proposing to automatically
update old versions of a page in the `history log'. If new contents
for that URL are received, `history browsing' back would then display
the new, changed price of hamburgers I assume.

Hmmm -- I think perhaps I didn't explain this very clearly.
Now, I think all this is unneeded if you have 'expires' working
properly, nonetheless I should explain what I meant:

Let's consider another example, this time a more realistic one.
Actually this occurred to me because of another browser, uh, mis-feature.
This is the way Lynx (and possibly other browsers) improperly displays hidden
fields on forms. My workaround for that was to put what I would have
put in hidden fields into the URL -- i.e. encode the same information
that would have been in a hidden field into PATH_INFO instead.

So suppose you have two ways of viewing a certain page. One way
involves a form submission which changes the state somewhere in
CGI-land, and would display a new "shopping basket" for the user.
The purpose of the form was add something to the shopping basket and
then display the contents. So the URL might contain something like
/cgi-bin/shopping-basket/product-code=WHOPPER+FRIES+SHAKE.
The product code is in the URL because of the hidden field bug.

Later on, the user wants simply to view the shopping basket, so they
click on a link to /cgi-bin/shopping-basket.

In the current world, the two pages would be cached separately, so if
the buggy browser ignores the expires field (and you should probably
ignore this discussion if that can be made to work...) then even after you
change the state of the shopping basket by submitting the form,
following the link to /cgi-bin/shopping-basket might well show obsolete
data, *even if it's really the same page*. The problem is that the
caching is not based on the identity of the displayed page, but on the
identity of the requesting URL. So I am suggesting some additional
header field which, if present (and it's of course optional!) should be
used as the document identifier in the client's cache, instead of the
requestor's URL.

While such a scheme would be great for some applications, it should
not be the default, or it should at least be possible for _the service
author_ to switch off. I can imagine plenty of cases where the user
wants to see _the old_ version of the page (e.g. the chess board 3
moves ago, the gold price 10 minutes ago), if at all possible.

Koen.

It would not only be switch-off-able, but you wouldn't get this
behavior at all unless you put in the extra, optional, header field.

--Shel
sjk@amazon.com, or sjk@netcom.com.