Extract data form URL before normalization

Alexander Fahlke Tue, 20 Sep 2011 11:08:30 -0700

Hi!

I am trying to write additional metadata to my CrawlDB. This data has to be
extracted from the URLs BEFORE they get normalized (via regex
urlnormalizer).


Here are some sample URLs:

1.
https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@002@@ZF
2.
https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@101@@ZF
3.
https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@024@@ZF

You can see three slightly different URLs. The only difference is the
SERVERID. The n-param and SERVERID-param are getting normalized to prevent
duplicates.
I want to save the n-param before normalization as a metadata for later use
(create canonical URLs, the n-param controls the page menu).


There are some problems I have with this.

First: At which point in the crawl are the URLs getting normalized? I was
not able to figure that out since there are so many places where urls get
normalized.
Second: How to add a custom Meta-Date correctly to my CrawlDB and WHEN to do
this?
    e.g. save Metadata: _pst_: success(1),
lastModified=0,XNPARAMETER=%2Fcustomers%2Fcards%2F


Any ideas where it would be able to extract BEFORE normalization and also
getting access to the specific CrawlDatum to add metadata?

BR
-- 
Alexander Fahlke
Software Development
www.informera.de

Extract data form URL before normalization

Reply via email to