On Wed, 28 Dec 2011 13:02:14 +0300, bsl wrote:
Hello.

I want to add page title to squid log for view the user's surfing history.
Thank's to Henrik Nordstrom and his reply at 2006 about this :-)

http://www2.tr.squid-cache.org/mail-archive/squid-dev/200603/0009.html

Following his idea I parse web page content in function sendMoreData
of client side routines (client_side_reply.cc)
I found the page title and log it to access.log using new logformat
token (for example "<tp").

But I have the problem:
The page title is not always logged.
For example I visit www.godaddy.com - I see in log his page title.
I visit www.nasa.gov - I don't see title in log :(
What I was wrong? Maybe not all pages are given to the client through
the client_side_reply::sendMoreData function?


Doing things with the body in the middle is not as easy as you seem to think...

 1) the body is often compressed for transfer.

2) the body may be missing entirely on HTTP/1.1 revalidation transfers.

3) Consider what you would have to do to display the TITLE tag when the response body is bytes 20-50 of a 150 byte compressed object. Those bytes could very well be the "<title>hello</title>" part of an HTML page, but to decompress it while printing the log line is a difficult problem.

4) The <title> and </title> may be split between packets, either each in separate packets, or a packet boundary inside the tag itself (ie "<ti" then "tle>" as two packets). Squid does its best not to buffer and delay the body contents. This type of response will not be detected by your scanner routines.

5) There are some compression types (SDCH done by Chrome for example) which are binary diff patches on top of a particular representation of an object, which itself may be a result of applying a series of previous patches.

6) There are object which are not HTML transferred which content TITLE tag look-alikes. XML, JSON, and AJAX responses for example. All of these will add false entries in your log unless you are careful to check for content types. ** www.nasa.gov is sending out XML objects which contain several nested copies of an HTML page. TITLE appears multiple times. They have the nasty bug of calling it "text/html" though, so even content type checks will fail here.

7) TITLE can contain anything. Including binary codes. This will screw your log unless you have URL-encoding defined as the quoting style.
  ** www.godaddy.com pages wrap their title in binary bytes.



I suggest making an eCAP adapter instead of a patch against Squid. Squid is being architectured in such a way as to make access to the body content through eCAP/ICAP easy. They still receive body data in snippets as described in (4), but have the option of buffering it if they need to. You can also do things like scan the first N bytes then speedily skip the rest of the object by instructing Squid bypass the scanner for the rest of the object.

NOTE: there is a registered header "Title:" you can log. Or, if missing, add with an adapter scanning for details in the body.


Thank for any idea.

I made the following changes: (squid 3.1.10, freebsd 8.2 stable, amd64)

AccessLogEntry.h:
+ added char *title; to AccessLogEntry class definition (public
section, line 54);

access_log.cc:
+ added LFT_REPLY_PAGE_TITLE to end of enum logformat_bcode_t definition
+ added element "<tp" for LFT_REPLY_PAGE_TITLE to struct
logformat_token_table
+ added new case to function accessLogCustom():
     case LFT_REPLY_PAGE_TITLE:
       if (al->title) {
          out = al->title;
       quote = 1;
       dofree = 1;
       }
       break;

client_side_reply.cc:
In function sendMoreData() line 2078 I added block for parsing buffer:
  if (http->al.title == NULL) {
    // search TITLE tag
    const char *tag1 = "<title>";
    const char *tag2 = "</title>";
    char *ans1 = strstr(buf, (char *)tag1, result.length-7);  //
search open tag in buf (length in result.length minus length of tag)
    if (ans1) {
      char *ans2 = strstr(ans1+7, (char *)tag2, result.length -
(ans1-buf)-7);  // search close tag in rest of buffer
      if (ans2) {
         int titlelen = ans2 - ans1 - 7;  // title length
         http->al.title = (char *)xcalloc(titlelen + 1,1);
         xstrncpy(http->al.title, &ans1[7], titlelen);
      }
    }
  }

  Realisation of strstr function:
  char * strstr (char *haystack, char *needle, int strlen)

What you define here is an implementation of strnstr(), *not* strstr().

Your search is also case-sensitive. HTML tags are case agnostic by definition. <TITLE> and <Title> are two common variations you will miss.

Amos

Reply via email to