Hi Marco, On Wed, Dec 12, 2012 at 4:07 PM, Marco Crivellaro <[email protected]> wrote:
> I don't really need to run javascript server side, I'd just need to replace > the havascript with a canonical link. Sounds reasonable > I believe I should use regexp normalize but it doesn't seem to work. Is > this correct? Well what have you been doing? Have you been trying to write rules for regex-normalize? What do they look like? What results are you getting? > Is there any way I can test how the crawled content looks like once > normalized? You can look at the ParserChecker tool [0] this will give you all outlinks for a given URL. I think this should enable you to see how such links have been normalized on the fly. I would also advise you to have a look at the parse-js plugin. Not many people are keen on it as its consistency seems rather unpredictable but I would certainly give it a shot. [0] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java -- Lewis

