Hi Marco,

On Wed, Dec 12, 2012 at 4:07 PM, Marco Crivellaro <[email protected]> wrote:

> I don't really need to run javascript server side, I'd just need to replace
> the havascript with a canonical link.

Sounds reasonable

> I believe I should use regexp normalize but it doesn't seem to work. Is
> this correct?

Well what have you been doing? Have you been trying to write rules for
regex-normalize? What do they look like? What results are you getting?

> Is there any way I can test how the crawled content looks like once
> normalized?

You can look at the ParserChecker tool [0] this will give you all
outlinks for a given URL. I think this should enable you to see how
such links have been normalized on the fly.
I would also advise you to have a look at the parse-js plugin. Not
many people are keen on it as its consistency seems rather
unpredictable but I would certainly give it a shot.

[0] 
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java



-- 
Lewis

Reply via email to