Incoming URL's are first normalized and then filtered in ParseOutputFormat. Parse filters are executing in the parser impl which happens in the mapper.
On Tuesday 20 September 2011 20:07:58 Alexander Fahlke wrote: > Hi! > > I am trying to write additional metadata to my CrawlDB. This data has to be > extracted from the URLs BEFORE they get normalized (via regex > urlnormalizer). > > Here are some sample URLs: > > 1. > https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F > &SERVERID=ZF@@002@@ZF 2. > https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F > &SERVERID=ZF@@101@@ZF 3. > https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F > &SERVERID=ZF@@024@@ZF > > You can see three slightly different URLs. The only difference is the > SERVERID. The n-param and SERVERID-param are getting normalized to prevent > duplicates. > I want to save the n-param before normalization as a metadata for later use > (create canonical URLs, the n-param controls the page menu). > > > There are some problems I have with this. > > First: At which point in the crawl are the URLs getting normalized? I was > not able to figure that out since there are so many places where urls get > normalized. > Second: How to add a custom Meta-Date correctly to my CrawlDB and WHEN to > do this? > e.g. save Metadata: _pst_: success(1), > lastModified=0,XNPARAMETER=%2Fcustomers%2Fcards%2F > > > Any ideas where it would be able to extract BEFORE normalization and also > getting access to the specific CrawlDatum to add metadata? > > BR -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

