Why Agent Logs Undercount The Blind and Others (was: Basic style sheet question)

by Kynn Bartlett <kynn-hwg(at)idyllmtn.com>

 Date:  Wed, 31 Mar 1999 13:41:00 -0800
 To:  hwg-theory(at)hwg.org
  todo: View Thread, Original
[This message was originally posted to the hwg-plus-stylesheets
mailing list; I'm resending it here to hwg-theory as it's more
appropriate for this list.  --Kynn]

At 03:56 p.m. 03/31/99 -0500, Ann Navarro wrote:
>At 12:12 PM 3/31/99 -0800, Kynn Bartlett wrote:
>>:( Me too...often time the statistics that people are provided with
>>are deceptive and VASTLY UNDERCOUNT people with disabilities and
>>other folks.

>Perhaps you could expand on *how* such filtered statistics can undercount
>those individuals? The mechanism that returns such a result isn't
>necessarily apparent to some developers. 

Hi, Ann, thanks for asking.  I'm going to send this reply both to
this list (hwg-stylesheets) and hwg-theory; followups should go to
hwg-theory as it's not really about stylesheets anymore.  For sub
info, see http://www.hwg.org/lists/hwg-theory/

The reason that agent logs vastly undercount the disabled and other
people who might not be using images is because of how the typical
user agent log is implemented.  On most every system, the default
operation is as follows:

* A request comes in from a browser, for a file.

* The web server records one entry in the access log per file
  requested.  This includes the address of the remote machine (with
  the web browser on it), and the name of the file.

* The web server also records -- if agent logs are turned on --
  the value of the "user agent string" provided by the browser,
  as part of the HTTP header.  Usually this will be Mozilla 3.0
  or 4.0 (compatible; "something something"), if it's not just
  Mozilla (Netscape) itself.

The problem comes from the fact that one line is recorded per FILE,
and it is NOT stored with the name of the file being accessed.
Let's look at the implications of this.

Let's say you have a web site with 3 frames.  In the left frame
there are 6 images (a navigation bar); in the upper right frame
there is one image (a banner); and in the lower right, big frame
you have your content, which includes 4 other pictures.  You have
one stylesheet for the navigation bar and a separate one for the
content window, but not one for the simple banner window.

I come to your site using Internet Explorer 4.0.  My browser 
requests the following files:

* The base frameset
  * The left frame
    * The 6 images in the left frameset
    * The external stylesheet for the left frameset
  * The banner frame
    * The banner image
  * The content frame
    * The 4 images in the content frame
    * The external stylehseet for the content frame

So your webserver dutifully marks down:

Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)
Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)

(That's 17 times, one for each file used.)

Now let's say that instead, I came to look at your site using Opera
3.2, with the images turned off.  (This is my usual mode of surfing;
I only load images if I think there's something worth seeing.)  You
_can_ turn frames off in Opera, but let's assume I have them on.
Also, Opera doesn't do CSS in version 3.2, only 3.5 onward.

Therefore, my browser requests the following files:

* The base frameset
  * The left frame
  * The banner frame
  * The content frame

The server writes the following in the agent log file:

Mozilla/3.0 (compatible; Opera/3.0; Windows 95/NT4) 3.2
Mozilla/3.0 (compatible; Opera/3.0; Windows 95/NT4) 3.2
Mozilla/3.0 (compatible; Opera/3.0; Windows 95/NT4) 3.2
Mozilla/3.0 (compatible; Opera/3.0; Windows 95/NT4) 3.2

If I click on my toggle bar to load images, it will pick up the 11
image files, and add an extra 11 lines to the agent log file; if
I don't do this, it won't be added.  Let's pretend I don't do this
(you designed your website properly, using ALT text and whatnot.)

Now, finally I go and look at it in lynx:

* The base frameset

Hopefully, you had a useable NOFRAMES section (one that didn't just
insult me and/or my browser!).  But only one file was downloaded
using lynx, and so my agent log contains:

Lynx/2.6  libwww-FM/2.14

So now you run your log analysis program.  You have 17 hits from
MSIE, 4 hits from Opera, and 1 hit from lynx.  This comes out to the
following usage stats:

77% Internet Explorer
18% Opera
 4% Lynx

Therefore, you conclude that only 4% of your users are using lynx.

What's wrong here?

Statistics lie.  Or at least, misapplied statistics lied.  That 4%
figure actually represented ONE THIRD OF YOUR SAMPLE.  As many people
in the example above use Lynx (1 person) as used MSIE (1 person)
or Opera (1 person)!  But as you can see, they were vastly undercounted
in the user agent stats!

The situation described above is _very_ common.  This is (or was, if
they've changed it recently) the default way in which apache and other
major servers are shipped.  This is the way my webserver is configured.

Now, you _can_ reconfigure your log files, if you're willing to (a)
write your own processing scripts for them, and (b) mess with the
webserver configuration.  Neither of these is particularly easy, but
it's doable.  The most obvious thing to do is to log the name of the
file with the agent string; then you could filter out those hits that
are not to HTML pages.  (Note that this will still undercount browsers
that don't use frames!)

Someday soon I hope to put up a reference page on why this problem
exists and how to correct it; I'll try to remember to drop a note
to anyone who's interested, when I get it done.


--
Kynn Bartlett  <kynn(at)idyllmtn.com>                   http://www.kynn.com/
Chief Technologist, Idyll Mountain Internet      http://www.idyllmtn.com/
Professional ALT-text author                     http://www.kynn.com/+alt
Spring 1999 Virtual Dog Show!                     http://www.dogshow.com/
WWTBLD?  Validate your HTML!                     http://validator.w3.org/

HWG hwg-theory mailing list archives, maintained by Webmasters @ IWA