Hi! I am trying to write additional metadata to my CrawlDB. This data has to be extracted from the URLs BEFORE they get normalized (via regex urlnormalizer).
Here are some sample URLs: 1. https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@002@@ZF 2. https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@101@@ZF 3. https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F&SERVERID=ZF@@024@@ZF You can see three slightly different URLs. The only difference is the SERVERID. The n-param and SERVERID-param are getting normalized to prevent duplicates. I want to save the n-param before normalization as a metadata for later use (create canonical URLs, the n-param controls the page menu). There are some problems I have with this. First: At which point in the crawl are the URLs getting normalized? I was not able to figure that out since there are so many places where urls get normalized. Second: How to add a custom Meta-Date correctly to my CrawlDB and WHEN to do this? e.g. save Metadata: _pst_: success(1), lastModified=0,XNPARAMETER=%2Fcustomers%2Fcards%2F Any ideas where it would be able to extract BEFORE normalization and also getting access to the specific CrawlDatum to add metadata? BR -- Alexander Fahlke Software Development www.informera.de

