Did you try already to switch off the regexp in crawl-urlfilter.txt?

if you use
bin/nutch crawl...
for crawling crawl-urlfilter.txt must be changed.

compare other lines, too. see "# skip everything else" and "# accept anything else"

Am 31.08.2010 10:32, schrieb jitendra rajput:
Hi,

I am trying to write XpathBasedLinkExtractor which extracts links out of
html page using xpaths.
But all the extracted links which contains characters like [? , = ] are
being filtered out. I am not able to nail it down where it is happening.
They are not going into segments.
I have also commented out regular expression -[...@=] in
regex-urlfilter.txt. Still It is showing same behaviour.

Can any one give me idea about this. Where am I going wrong. I am stuck at
this for last day.

Any help would be highly appreciated.

Thanks
Jitendra

Reply via email to