Access and Accountability Tracking and Reporting Web Site Usage Data
R. Daniel Lineberger
Professor of Horticulture and Webmaster of Aggie Horticulture
dan-lineberger@tamu.edu; http://aggie-horticulture.tamu.eduJerry Parsons
Professor and Extension Vegetable Specialist
jerryparsons@tamu.edu; http://www.plantanswers.com
Department of Horticultural Sciences
Texas A&M University and Texas Agricultural Extension Service
College Station, TX 77843-2133
Abstract
The World Wide Web has been adopted by most land grant universities and state agencies as the preferred mode of information delivery to their clientele. Publications are still useful, but increasing costs of preparation, printing and distribution make them the choice for a decreasing number of applications. Printed documents were visible measures of demand for information. Those that answered clientele needs were always out of print, and others accumulated in publication rooms, their pages turning yellow. Analysis of Web site access statistics provides a similar barometer of program health. We have tracked access statistics for the Aggie Horticulture Web site since its inception in October, 1994 and will discuss how we have altered program emphasis, site management, and contributor expectations during the growth from circa 10,000 hits/month to over 2,000,000 hits/month. We have "interpreted" these data to Extension and College administrators and have been successful in garnering increased support to fund needed improvements to the Web site. Important components of these analyses include being able to distinguish user sessions from "hits," location of the users, and type of information requested.
Introduction
Designing, preparing and delivering effective programming has never been enough for the Extension Service. Administrators have always demanded to have an estimate of the impact of our programs some tangible way of determining who came, how they learned, and whether or not they used the information we presented in some pattern-changing manner. These data always have been illusive, often have been mere "guesses," and sometimes have been merely our grateful clientele telling us what they thought we wanted to hear!
As the World Wide Web has evolved into a major component of all Extension programs, we are faced with yet another dilemma in interpreting the massive amount of data that can be collected so effortlessly by our Web server log files. The log file for Aggie Horticulture for April, 2000 was 232 mb in a simple flat text file. How does one find meaning from those data in a realistic time frame with reasonable effort?
Data Collection
Aggie Horticulture is the comprehensive Web site of the Texas Horticulture program serving the needs of the teaching, research and Extension missions of faculty and specialists at College Station, various research and Extension centers, and some county offices. The site has grown considerably from its genesis in October, 1994 to nearly 2 gb of information. The current server is a 600 mHz Pentium III operating under WindowsNT 4.0 serving out of Netscape Enterprise Server 3.61 and has been in place since October, 1999. It is mirrored on a 300 mHz Pentium II with the same operating system and server software specs, the mirror serving as a "live backup" rather than an active mirror site.
Netscape Enterprise server 3.61 has the capability to write a log file in common log format containing the following data:
- Client hostname
- Authenticate user name
- System date
- Full request
- Status
- Content length
- HTTP header, "referer"
- HTTP header, "user-agent"
- Method
- URI (the URL path)
- Query string of the URI (anything in the URL after the question mark)
- Protocol
On our system, we log by client IP address rather than forcing client hostname lookup, and we only log the following:
- Client hostname
- Authenticate user name
- System date
- Full request
- Status
- Content length
Server operation is much more efficient without client hostname lookup (Lipschutz et al.).
Web logs are removed while the server is stopped so that a new log file is written upon restarting the server. Logs are analyzed biweekly and monthly using the current version of Analog (Analog 4.1/Win32 as of May, 2000). Analog is an extremely efficient freeware log analyzer written by Stephen Turner at the University of Cambridge Statistical Laboratory ( http://www.statslab.cam.ac.uk/~sret1/analog/ ). Analog analyzed the two biweekly server logs for April, 2000 (a total of 232 mb) and prepared the summary html report in 35 seconds on a 600 mHz Pentium III (WindowsNT 4.0) equipped with 256 mb RAM.
The two biweekly server logs are analyzed together monthly using WebTrends. Analysis of the two biweekly logs (a total of 232 mb) took approximately 12 hours on a 300 mHz Pentium II. WebTrends required about 70 mb of virtual memory over and above the 132 mb of system memory (only 40 mb of which was unoccupied at the time the program was started). The WebTrends log analyzer is set to perform reverse domain lookup to convert IP addresses to client names.
Self-accesses were excluded from the log analysis. All log analyses since October, 1994 are available for public inspection from a link on the Aggie Horticulture homepage to the Web statistics page at: http://aggie-horticulture.tamu.edu/webstat/webstat.html.
Interpretation of Web Server Statistics
By any measure, the increased activity of the Aggie Horticulture Web site has exceeded our expectations. In its first 4 months of operation, Aggie Horticulture served just over 11,000 files (hits). In April, 2000 the site accumulated over 2.3 million hits!
In April, 1995 the sum of the .net and .com clients totaled 12% of site activity (.edu was 56%). In the early stages of the development of the Web, we were a bunch of academics talking to each other. In April, 2000 the sum of .net and .com clients was over 86% (.edu was 11%). We clearly are not talking to ourselves anymore.
Sorting through the Data
One of the most important sections of the Web log analysis is the directory report. The directory report sorts the data by top level directory and can be used to estimate the relative interest in different subsections of the Web site.
Aggie Horticulture represents the information of several specialists and county agents. Some provide the information to us for coding for the Web, and others have learned to write html code or have learned to use editing software and format the material for Web delivery themselves. One of the more meaningful uses of our Web statistics is that the person responsible for providing the information can follow the increased use of his or her materials, can report this increased use in accountability reports, and can compare the use of their information to that of other providers.
Hits versus User Sessions
One compelling reason to spend $500 on a commercial software package is the ability to extract "value added" information from the numbers. The ability to place cookies for specific client tracking, the ability to accumulate "click throughs" on advertising banners, and the ability to determine the most common entering and exiting points is powerful data.
Perhaps the most powerful data from an Extension point of view is the ability to calculate "user sessions." A user session is defined as "A session of activity (all hits) for one user of a Web site. By default, a user session is terminated when a user is inactive for more than 30 minutes." (4). Analog computes a somewhat different number in terms of the "number of distinct hosts served" which does not account for repeated visits by the same client. The user session is the virtual analogy of an office visit or a phone call to obtain information and is believed by some to be a more accurate estimator of impact than "hits" since hits can be inflated by the incorporation of non-information laden images (bullets, sprites and other graphical elements) into the Web site design. No fewer than 12 information providers are currently monitoring and reporting on their Web statistics, and some, like Jerry Parsons, are widely distributing a very detailed report as will be seen later.
Regardless of the pitfalls associated with using "hits" or "user sessions" to make interpretations of program strength, the fact remains that people (including administrators) now understand what they are and are paying increasing attention to them as a measure of information delivery.
Gleaning out Other Gems
Most Requested Pages
WebTrends summarizes the hits, user sessions and time viewed statistics for the most frequently accessed pages on the Web site. These data can be used for "data mining" by helping to get a picture of the general types of information most commonly used on the Web site. As an aggregate, for example, these data tell us that Aggie Horticulture is used more by home owners than by producers, and that our wildflower information, home landscaping information, and children's programs are the backbone of our Web site.
Top Entry Pages
Top entry pages are the "windows" to our information. In our case, the wildflower information, our PLANTanswers archives, and our Extension publications are the top entry points.
Most Active Cities
We are not able to determine where our users live by looking at the "most active cities," merely where their Internet service provider is located. However, one can deduce that America Online is an important force on the Web as related to users of Aggie Horticulture, since the number of logins from Reston, Virginia represented about 25% of the user sessions in April, 2000.
Links to the Web Site as An Indicator of Use
While imitation may be the most sincere form of flattery, having someone link to your information is certainly flattering. One can approximate the number of links to their Web site by using the free utility, http://www.linkpopularity.com. Aggie Horticulture is well ahead of other comparable Web sites, including the horticulture Web sites at Ohio State University and the University of Florida.
Using a tool like LinkPopularity must be interpreted with some caution, but it can provide an additional measure of effectiveness especially when used in a peer-comparative sense.
Web Log Analysis Does Not Address Impact!
In the early days of the Web, much of the activity involved exchanges within the .edu domain. We were comparing ourselves with other similar institutions and the first group to develop a "new" way of displaying Extension information was emulated and linked to extensively. I remember having a conversation with a colleague about why I was investing so much time and energy putting information on our Web site, when that information was already there at various other places on the Web and all we had to do was to link to it. My response was that if we wanted to be "information providers" in the tradition of what our Extension Horticulture Program had always stood for, then we had to assume responsibility for formatting our own information for Web delivery no one else would do it for us!
As early as 1996, the true nature of the Web began to emerge. The .net and .com domains began to grow explosively, and consumers and agricultural producers were being connected in increasing numbers. Our clientele had the capability for reaching our Web sites, but how could we document that they were using the information and not just casually browsing?
It is very difficult to determine whether clients are using the information you provide without asking them directly. This is where an online survey is critical. In June, 1999 we started a procedure that will become an annual event. We substituted a user questionnaire for our home page, and "forced " users to view the form for 1 week as they connected to Aggie Horticulture. The survey instrument was patterned after a general user survey designed by Howard Ladewig for use by Extension specialists in evaluating program effectiveness. Users had the option of skipping the survey and going to the homepage instead by clicking an appropriately placed link. Data were accumulated directly into a FileMaker Pro database.
Only about 16% of the hits received by the homepage turned into completed surveys. This underestimates the actual percentage response, since many of the hits were likely a result of computers whose browers were set to Aggie Horticulture as the browser homepage. In this year's survey, we will substitute the survey form for our "top entry pages" since a fairly small percentage of our hits actually come through our main homepage (7.3%).
Important points derived from the user survey were:
- 61% of the respondents were Texas residents
- the average number of weekly visits was over 4
- 73-82% of the respondents graded the Web site as good or outstanding on accuracy, timliness, clarity, problem solving, and overall rating
- 91% of the respondents indicated they would use the information obtained.
The fact that 91% of those responding to our survey told us they would use the information we provided is the best indication that we have that our Web site is working.
Summary
If it is true that Web statistics fall within the same category as "lies and damn lies," then we must be careful about interpreting them at face value. We certainly can inflate the numbers by using non-information laden graphical elements (bullets, sprites, etc.) liberally. If we use them as our only measure of effectiveness, then we probably fall into the same trap. If we use them as a way to build upon our strengths and correct some of our weaknesses, then they will have served a useful purpose. If we use them as a vehicle for getting more people involved in information providing, then they will have value. If we use them as a way of showing our supervisors and others in a position of authority that the Web has arrived as a bona fide delivery vehicle for Extension programs, then Web statistics will have been extremely useful.
- References
- 1. Lipschutz, R. P., L. Gilbert, K. Heard, J. Kent, M. Nguyen, K. Smith and A. Soofi. 1997. Mastering Netscape SuiteSpot 3 Servers. Sybex Network Press, San Francisco, 1027 Pp.
- 2. Turner, S. 2000. Analog Web Log Analysis Program. http://www.statslab.cam.ac.uk/~sret1/analog/
- 3. WebTrends Log Analyzer 4.2. 1998.
http://www.webtrends.com
- 4. Anonymous. 1998. WebTrends Log Analyzer (user's guide). WebTrends Corporation, 183 Pp.
- 5. Anonymous.http://www.linkpopularity.com/
- Quotable Quotes
- "There are three kinds of lies: lies, damn lies, and statistics." So reportedly said Benjamin Disraeli, prime minister of Great Britain from 1874 to 1880, as quoted in Mark Twain's autobiography.
- "Why Web usage statistics are (worse than) meaningless"
http://www.cranfield.ac.uk/docs/stats- "Web statistics may give the user a false sense of knowledge which is worse than being knowingly ignorant."
Jeff Goldberg, Cranfield Computer Centre,
- "Log file analysis is perhaps best viewed as an art disguised as a science."
Susan Haigh and Janette Megarity, Network Notes #57, ISSN 1201-4338, Information Technology Services, National Library of Canada
http://www.nlc-bnc.ca/pubs/netnotes/notes57.htm