Re: Extract data form URL before normalization

Markus Jelsma Fri, 23 Sep 2011 05:37:34 -0700

Incoming URL's are first normalized and then filtered in ParseOutputFormat. 
Parse filters are executing in the parser impl which happens in the mapper.


On Tuesday 20 September 2011 20:07:58 Alexander Fahlke wrote:
> Hi!
> 
> I am trying to write additional metadata to my CrawlDB. This data has to be
> extracted from the URLs BEFORE they get normalized (via regex
> urlnormalizer).
> 
> Here are some sample URLs:
> 
> 1.
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> &SERVERID=ZF@@002@@ZF 2.
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> &SERVERID=ZF@@101@@ZF 3.
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> &SERVERID=ZF@@024@@ZF
> 
> You can see three slightly different URLs. The only difference is the
> SERVERID. The n-param and SERVERID-param are getting normalized to prevent
> duplicates.
> I want to save the n-param before normalization as a metadata for later use
> (create canonical URLs, the n-param controls the page menu).
> 
> 
> There are some problems I have with this.
> 
> First: At which point in the crawl are the URLs getting normalized? I was
> not able to figure that out since there are so many places where urls get
> normalized.
> Second: How to add a custom Meta-Date correctly to my CrawlDB and WHEN to
> do this?
>     e.g. save Metadata: _pst_: success(1),
> lastModified=0,XNPARAMETER=%2Fcustomers%2Fcards%2F
> 
> 
> Any ideas where it would be able to extract BEFORE normalization and also
> getting access to the specific CrawlDatum to add metadata?
> 
> BR

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Extract data form URL before normalization

Reply via email to