Did you try already to switch off the regexp in
crawl-urlfilter.txt?
if you use
bin/nutch crawl...
for crawling crawl-urlfilter.txt must be changed.
compare other lines, too. see "# skip everything else" and
"# accept anything else"
Am 31.08.2010 10:32, schrieb jitendra rajput:
Hi,
I am trying to write XpathBasedLinkExtractor which extracts links out of
html page using xpaths.
But all the extracted links which contains characters like [? , = ] are
being filtered out. I am not able to nail it down where it is happening.
They are not going into segments.
I have also commented out regular expression -[...@=] in
regex-urlfilter.txt. Still It is showing same behaviour.
Can any one give me idea about this. Where am I going wrong. I am stuck at
this for last day.
Any help would be highly appreciated.
Thanks
Jitendra