Re: Log web page TITLE to access.log

Amos Jeffries Wed, 04 Jan 2012 14:43:49 -0800

On Wed, 28 Dec 2011 13:02:14 +0300, bsl wrote:

Hello.

I want to add page title to squid log for view the user's surfinghistory.

Thank's to Henrik Nordstrom and his reply at 2006 about this :-)

http://www2.tr.squid-cache.org/mail-archive/squid-dev/200603/0009.html

Following his idea I parse web page content in function sendMoreData
of client side routines (client_side_reply.cc)
I found the page title and log it to access.log using new logformat
token (for example "<tp").

But I have the problem:
The page title is not always logged.
For example I visit www.godaddy.com - I see in log his page title.
I visit www.nasa.gov - I don't see title in log :(
What I was wrong? Maybe not all pages are given to the client through
the client_side_reply::sendMoreData function?

Doing things with the body in the middle is not as easy as you seem tothink...


 1) the body is often compressed for transfer.

2) the body may be missing entirely on HTTP/1.1 revalidationtransfers.

3) Consider what you would have to do to display the TITLE tag whenthe response body is bytes 20-50 of a 150 byte compressed object. Thosebytes could very well be the "<title>hello</title>" part of an HTMLpage, but to decompress it while printing the log line is a difficultproblem.

4) The <title> and </title> may be split between packets, either eachin separate packets, or a packet boundary inside the tag itself (ie"<ti" then "tle>" as two packets). Squid does its best not to buffer anddelay the body contents. This type of response will not be detected byyour scanner routines.

5) There are some compression types (SDCH done by Chrome for example)which are binary diff patches on top of a particular representation ofan object, which itself may be a result of applying a series of previouspatches.

6) There are object which are not HTML transferred which content TITLEtag look-alikes. XML, JSON, and AJAX responses for example. All of thesewill add false entries in your log unless you are careful to check forcontent types.** www.nasa.gov is sending out XML objects which contain severalnested copies of an HTML page. TITLE appears multiple times. They havethe nasty bug of calling it "text/html" though, so even content typechecks will fail here.

7) TITLE can contain anything. Including binary codes. This will screwyour log unless you have URL-encoding defined as the quoting style.

  ** www.godaddy.com pages wrap their title in binary bytes.

I suggest making an eCAP adapter instead of a patch against Squid.Squid is being architectured in such a way as to make access to the bodycontent through eCAP/ICAP easy. They still receive body data in snippetsas described in (4), but have the option of buffering it if they needto.You can also do things like scan the first N bytes then speedily skipthe rest of the object by instructing Squid bypass the scanner for therest of the object.

NOTE: there is a registered header "Title:" you can log. Or, ifmissing, add with an adapter scanning for details in the body.


Thank for any idea.

I made the following changes: (squid 3.1.10, freebsd 8.2 stable,amd64)


AccessLogEntry.h:
+ added char *title; to AccessLogEntry class definition (public
section, line 54);

access_log.cc:

+ added LFT_REPLY_PAGE_TITLE to end of enum logformat_bcode_tdefinition

+ added element "<tp" for LFT_REPLY_PAGE_TITLE to struct
logformat_token_table
+ added new case to function accessLogCustom():
     case LFT_REPLY_PAGE_TITLE:
       if (al->title) {
          out = al->title;
       quote = 1;
       dofree = 1;
       }
       break;

client_side_reply.cc:

In function sendMoreData() line 2078 I added block for parsingbuffer:

  if (http->al.title == NULL) {
    // search TITLE tag
    const char *tag1 = "<title>";
    const char *tag2 = "</title>";
    char *ans1 = strstr(buf, (char *)tag1, result.length-7);  //
search open tag in buf (length in result.length minus length of tag)
    if (ans1) {
      char *ans2 = strstr(ans1+7, (char *)tag2, result.length -
(ans1-buf)-7);  // search close tag in rest of buffer
      if (ans2) {
         int titlelen = ans2 - ans1 - 7;  // title length
         http->al.title = (char *)xcalloc(titlelen + 1,1);
         xstrncpy(http->al.title, &ans1[7], titlelen);
      }
    }
  }

  Realisation of strstr function:
  char * strstr (char *haystack, char *needle, int strlen)


What you define here is an implementation of strnstr(), *not* strstr().

Your search is also case-sensitive. HTML tags are case agnostic bydefinition. <TITLE> and <Title> are two common variations you will miss.


Amos

Re: Log web page TITLE to access.log

Reply via email to