Hi Leo,
From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL, however this may not be the case as I
have nothing to benchmark it on. Unfortuantely on the occasion the URL
http://wiki.apache.org actually redirects to
Hi Leo, hi Lewis,
From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL,
This may be the reason. Empty segments may break some of the crawler steps.
But if I'm not wrong it looks like the updatedb-command
is not quite correct:
Hi Lewis,
Will try your suggestion shortly, but am still puzzled why the crawl
command works. Isn't it using the same filter, etc?
Cheers,
Leo
On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:
Hi Leo,
From the times both the fetching and parsing took, I suspecting that
maybe
Hi Lewis,
Following are the things I tried ans the relevant source/logs
1. ran 'crawl' without ending / in the url http://www.seek.com.au ;
Result OK
2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ;
Result OK
3. Had a look at the regex-urlfilter.txt and the relevant entries
Hi Sebastian,
I think the problem is with the fetch not returning any results. I
checked your suggestion, but it did not work.
Cheers,
Leo
On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote:
Hi Leo, hi Lewis,
From the times both the fetching and parsing took, I suspecting that
Found the issue! plugin.xml defined extension id which didn't match id inside
mimeType=application/xhtml+xml tag parse-plugins.xml.
i.e.: below bold highlighted should match.
plugin.xml:
?xml version=1.0 encoding=UTF-8?
plugin
id=food
name=Food Parser.
version=1.0.0
6 matches
Mail list logo