Re: skipping invalid segments nutch 1.3

2011-07-21 Thread lewis john mcgibbney
Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, however this may not be the case as I have nothing to benchmark it on. Unfortuantely on the occasion the URL http://wiki.apache.org actually redirects to

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Sebastian Nagel
Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, This may be the reason. Empty segments may break some of the crawler steps. But if I'm not wrong it looks like the updatedb-command is not quite correct:

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis, Will try your suggestion shortly, but am still puzzled why the crawl command works. Isn't it using the same filter, etc? Cheers, Leo On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote: Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Lewis, Following are the things I tried ans the relevant source/logs 1. ran 'crawl' without ending / in the url http://www.seek.com.au ; Result OK 2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ; Result OK 3. Had a look at the regex-urlfilter.txt and the relevant entries

Re: skipping invalid segments nutch 1.3

2011-07-21 Thread Leo Subscriptions
Hi Sebastian, I think the problem is with the fetch not returning any results. I checked your suggestion, but it did not work. Cheers, Leo On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote: Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that

Re: Configuration issue: Custom parser not being recognised.

2011-07-21 Thread amrutbudi...@gmail.com
Found the issue! plugin.xml defined extension id which didn't match id inside mimeType=application/xhtml+xml tag parse-plugins.xml. i.e.: below bold highlighted should match. plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Food Parser. version=1.0.0