Re: Accurate user-based log file analysis

Brian Behlendorf (
Mon, 17 Jul 1995 17:59:15 -0700 (PDT)

On Mon, 17 Jul 1995, Terry Myerson wrote:
> The first thing Interse' market focus does is group the requests on
> "Differentiating Characteristics." (DC's). These DC's are entries within
> the log files that will be constant throughout a user session, but different
> among absolutely different sessions.

Could you elaborate on these DC's? What can you key off of except
hostnames from CLFF data?

> Next, we walk through the request stream within each DC group. New sessions
> are demarcated when objects are requested and not cached, when they should be,

So a request for a previously fetched item that resulted in a 200 instead
of a 304 is considered a new user? That doesn't compute.

> and there is a large time gap in the request stream within a DC group.

Which might be sufficient for lightly loaded sites, but sites with many
simultaneous visitors coming from behind large proxies are indistriguishable.

It sounds like you can count "sessions" within a 10% accuracy, but that's
much different than "users". One person visiting 20 times is
largely indistinguishable from 20 people visiting once.

> >There's going to be a whole lotta hits coming from Vienna, Virginia,
> >White Plains, NY, and Columbus, Ohio!
> Indeed, the online services due lead all other organizations in bringing users
> to the web. Of course, this software will confirm if this is true of your
> web site's user community.

I think you're missing the point - there is very often (most often?) *no*
connection between the physical location of a web visitor and the
location listed on their Internic registration. Heuristics can only go
so far to assuage this. And when comes on line, half the traffic
could be coming from Redmond, Washington. Perhaps the team at GVU who
did the most recent internet survey could provide some analysis of the
where-people-are-really-located vs. where-the-nic-says-they-are question.

> We've busted our buts to put together a software package which can answer these
> questions, conveniently and cost-effectively.

I don't doubt you spent a lot of effort on this, and that there is a need for
this. You know the line: "there are lies, there are damn lies, and then
there are statistics". It's very important to get the answers to these
questions *right*, and not base them on assumptions and heuristics which just
aren't true, and make promises about what the numbers mean.

I'd welcome comments on some thoughts on this topic I collected together
a little while ago. It's at


--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-- http://www.[hyperreal,organic].com/