Re: skipping invalid segments nutch 1.3
Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, however this may not be the case as I have nothing to benchmark it on. Unfortuantely on the occasion the URL http://wiki.apache.org actually redirects to http://wiki.apache.org/general/so I'm going to post my log output from last URL you specified in an attempt to clear this one up. The following confirms that you are accurate with your observations that not only does this produce invalid segments but also nothing is fetched in the process. Therefore the reason that we are getting the - skipping invalid segment message is that we are not actually fetching any content. My initial thoughts were that your urlfilters were not set properly and I think that this is part of the case. Please follow the syntax very carefully and it will work perfectly for you as follows regex-urlfilter.txt -- # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # crawl URLs in the following domains. +^http://([a-z0-9]*\.)*seek.com.au/ # accept anything else #+. seed file -- http://www.seek.com.au It sounds really trivial but I think that the trailing '/' in in your seed file may have been making all of the difference. Please try, test with readdb and readseg and comment back. Sorry for the delayed posts on this one I have not had much time to get to it. Hope all goes to plan. Evidence can be seen below lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb crawldb -stats CrawlDb statistics start: crawldb Statistics for CrawlDb: crawldb TOTAL urls:48 retry 0:48 min score:0.017 avg score:0.041125 max score:1.175 status 1 (db_unfetched):47 status 2 (db_fetched):1 CrawlDb statistics: done On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: Following are the suggested commands and the result as suggested I left the redirect as 0 as 'crawl' works without any issues. The problem only occurs when running the individual commands. --- nutch-site.xml --- ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuelisters spider/value /property property namefetcher.verbose/name valuetrue/value descriptionIf true, fetcher will log more verbosely./description /property property namehttp.verbose/name valuetrue/value descriptionIf true, HTTP will log more verbosely./description /property /configuration --- -- Individual commands and results- llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls Injector: starting at 2011-07-21 12:24:52 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb Injector: urlDir: /home/llist/nutchData/seed/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100 Generator: starting at 2011-07-21 12:25:16 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519 Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03 llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch fetch /home/llist/nutchData/crawl/segments/20110721122519 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2011-07-21 12:26:36 Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519 Fetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 -finishing thread FetcherThread, activeThreads=1 fetching http://wiki.apache.org/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0,
Re: skipping invalid segments nutch 1.3
Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, This may be the reason. Empty segments may break some of the crawler steps. But if I'm not wrong it looks like the updatedb-command is not quite correct: llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110721122519 CrawlDb update: starting at 2011-07-21 12:28:03 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb CrawlDb update: segments: [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text, file:/home/llist/nutchData/crawl/segments/20110721122519/content, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse, file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate] CrawlDb update: additions allowed: true As for other commands reading segments there are two ways two add segments as arguments: 1) all segments enumarated or 2) via -dir the parent directory of all segments. See: % $NUTCH_HOME/bin/nutch updatedb Usage: CrawlDb crawldb (-dir segments | seg1 seg2 ...) [-force] [-normalize] [-filter] [-noAdditions] crawldb CrawlDb to update -dir segments parent directory containing all segments to update from seg1 seg2 ... list of segment names to update from Try your updatedb command without -dir, it should work. Sebastian
Re: skipping invalid segments nutch 1.3
Hi Lewis, Will try your suggestion shortly, but am still puzzled why the crawl command works. Isn't it using the same filter, etc? Cheers, Leo On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote: Hi Leo, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, however this may not be the case as I have nothing to benchmark it on. Unfortuantely on the occasion the URL http://wiki.apache.org actually redirects to http://wiki.apache.org/general/ so I'm going to post my log output from last URL you specified in an attempt to clear this one up. The following confirms that you are accurate with your observations that not only does this produce invalid segments but also nothing is fetched in the process. Therefore the reason that we are getting the - skipping invalid segment message is that we are not actually fetching any content. My initial thoughts were that your urlfilters were not set properly and I think that this is part of the case. Please follow the syntax very carefully and it will work perfectly for you as follows regex-urlfilter.txt -- # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # crawl URLs in the following domains. +^http://([a-z0-9]*\.)*seek.com.au/ # accept anything else #+. seed file -- http://www.seek.com.au It sounds really trivial but I think that the trailing '/' in in your seed file may have been making all of the difference. Please try, test with readdb and readseg and comment back. Sorry for the delayed posts on this one I have not had much time to get to it. Hope all goes to plan. Evidence can be seen below lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb crawldb -stats CrawlDb statistics start: crawldb Statistics for CrawlDb: crawldb TOTAL urls:48 retry 0:48 min score:0.017 avg score:0.041125 max score:1.175 status 1 (db_unfetched):47 status 2 (db_fetched):1 CrawlDb statistics: done On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: Following are the suggested commands and the result as suggested I left the redirect as 0 as 'crawl' works without any issues. The problem only occurs when running the individual commands. --- nutch-site.xml --- ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuelisters spider/value /property property namefetcher.verbose/name valuetrue/value descriptionIf true, fetcher will log more verbosely./description /property property namehttp.verbose/name valuetrue/value descriptionIf true, HTTP will log more verbosely./description /property /configuration --- -- Individual commands and results- llist@LeosLinux:~/nutchData $ /usr/share/nutch/runtime/local/bin/nutch inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls Injector: starting at 2011-07-21 12:24:52 Injector: crawlDb: /home/llist/nutchData/crawl/crawldb Injector: urlDir: /home/llist/nutchData/seed/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02 llist@LeosLinux:~/nutchData $ /usr/share/nutch/runtime/local/bin/nutch generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100 Generator: starting at 2011-07-21 12:25:16 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519 Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03 llist@LeosLinux:~/nutchData $ /usr/share/nutch/runtime/local/bin/nutch
Re: skipping invalid segments nutch 1.3
Hi Lewis, Following are the things I tried ans the relevant source/logs 1. ran 'crawl' without ending / in the url http://www.seek.com.au ; Result OK 2. ran 'crawl' with ending / in the url http://www.seek.com.au/ ; Result OK 3. Had a look at the regex-urlfilter.txt and the relevant entries are as follows --- regex-urlfilter.txt - # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. -- 4. I think you are correct in that fetch does not actually fetch anything. Following are the relevant sections from the hadoop.log. First the log when 'crawl' was running and then the log for 'inject, generate, fetch'. The rest of the log up to the fetch is pretty much identical. One thing I did notice is that the QueueFeeder returns 10 records for 'crawl' and 1 record for 'fetch' - hadoop.log for 'crawl' --- 2011-07-22 10:02:27,226 INFO crawl.Generator - Generator: finished at 2011-07-22 10:02:27, elapsed: 00:00:03 2011-07-22 10:02:27,227 WARN fetcher.Fetcher - Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 2011-07-22 10:02:27,228 INFO fetcher.Fetcher - Fetcher: starting at 2011-07-22 10:02:27 2011-07-22 10:02:27,228 INFO fetcher.Fetcher - Fetcher: segment: /home/llist/nutchData/crawl/segments/20110722100225 2011-07-22 10:02:27,910 INFO fetcher.Fetcher - Fetcher: threads: 10 2011-07-22 10:02:27,918 INFO fetcher.Fetcher - QueueFeeder finished: total 10 records + hit by time limit :0 2011-07-22 10:02:27,926 INFO fetcher.Fetcher - fetching http://www.seek.com.au/sales-jobs 2011-07-22 10:02:27,940 INFO http.Http - http.proxy.host = null 2011-07-22 10:02:27,940 INFO http.Http - http.proxy.port = 8080 2011-07-22 10:02:27,940 INFO http.Http - http.timeout = 1 2011-07-22 10:02:27,940 INFO http.Http - http.content.limit = 65536 2011-07-22 10:02:27,940 INFO http.Http - http.agent = listers spider/Nutch-1.3 2011-07-22 10:02:27,940 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2011-07-22 10:02:28,929 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=9 2011-07-22 10:02:29,929 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=9 2011-07-22 10:02:30,930 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 2011-07-22 10:02:31,930 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 2011-07-22 10:02:32,931 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 2011-07-22 10:02:33,931 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 2011-07-22 10:02:34,932 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=9 2011-07-22 10:02:35,091 INFO fetcher.Fetcher - fetching http://www.seek.com.au/mining-resources-energy-jobs/ 2011-07-22 10:02:35,933 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=8 2011-07-22 10:02:36,933 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=8 2011-07-22 10:02:37,933 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=8 2011-07-22 10:02:38,934 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=8 2011-07-22 10:02:39,934 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=8 2011-07-22 10:02:40,363 INFO fetcher.Fetcher - fetching http://www.seek.com.au/marketing-communications-jobs/ 2011-07-22 10:02:40,934 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=7 etc. - --- hadoop.log for 'fetch' --- 2011-07-22 10:14:37,645 INFO crawl.Generator - Generator: finished at 2011-07-22 10:14:37, elapsed: 00:00:03 2011-07-22 10:16:46,088 WARN fetcher.Fetcher - Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. 2011-07-22 10:16:46,089 INFO fetcher.Fetcher - Fetcher: starting at 2011-07-22 10:16:46 2011-07-22 10:16:46,089 INFO fetcher.Fetcher - Fetcher: segment: /home/llist/nutchData/crawl/segments/20110722101436 2011-07-22 10:16:46,720 INFO fetcher.Fetcher - Fetcher: threads: 10 2011-07-22 10:16:46,741 INFO plugin.PluginRepository - Plugins: looking in: /usr/share/nutch/runtime/local/plugins 2011-07-22 10:16:46,746 INFO fetcher.Fetcher - QueueFeeder finished: total 1 records + hit by time limit :0 2011-07-22 10:16:46,815 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] --- Cheers, Leo On Fri, 2011-07-22 at 09:51 +1000, Leo Subscriptions wrote: Hi Lewis, Will try your suggestion shortly,
Re: skipping invalid segments nutch 1.3
Hi Sebastian, I think the problem is with the fetch not returning any results. I checked your suggestion, but it did not work. Cheers, Leo On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote: Hi Leo, hi Lewis, From the times both the fetching and parsing took, I suspecting that maybe Nutch didn't actually fetch the URL, This may be the reason. Empty segments may break some of the crawler steps. But if I'm not wrong it looks like the updatedb-command is not quite correct: llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch updatedb /home/llist/nutchData/crawl/crawldb -dir /home/llist/nutchData/crawl/segments/20110721122519 CrawlDb update: starting at 2011-07-21 12:28:03 CrawlDb update: db: /home/llist/nutchData/crawl/crawldb CrawlDb update: segments: [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text, file:/home/llist/nutchData/crawl/segments/20110721122519/content, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse, file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch, file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate] CrawlDb update: additions allowed: true As for other commands reading segments there are two ways two add segments as arguments: 1) all segments enumarated or 2) via -dir the parent directory of all segments. See: % $NUTCH_HOME/bin/nutch updatedb Usage: CrawlDb crawldb (-dir segments | seg1 seg2 ...) [-force] [-normalize] [-filter] [-noAdditions] crawldb CrawlDb to update -dir segments parent directory containing all segments to update from seg1 seg2 ... list of segment names to update from Try your updatedb command without -dir, it should work. Sebastian
Re: Configuration issue: Custom parser not being recognised.
Found the issue! plugin.xml defined extension id which didn't match id inside mimeType=application/xhtml+xml tag parse-plugins.xml. i.e.: below bold highlighted should match. plugin.xml: ?xml version=1.0 encoding=UTF-8? plugin id=food name=Food Parser. version=1.0.0 provider-name=amrut runtime library name=food.jar export name=*/ /library /runtime requires import plugin=nutch-extensionpoints/ /requires extension id=com.amrut.parser.TDRParser name=TDR Parser point=org.apache.nutch.parse.Parser * implementation id=com.amrut.parser.TDRParser class=com.amrut.parser.TDRParser parameter name=contentType value=application/xhtml+xml/ /implementation * /extension /plugin parse-plugins.xml: ?xml version=1.0 encoding=UTF-8? parse-plugins mimeType name=application/xhtml+xml * plugin id=food /* /mimeType aliases * alias name=food extension-id=com.amrut.parser.TDRParser /* alias name=parse-tika extension-id=org.apache.nutch.parse.tika.TikaParser / alias name=parse-ext extension-id=ExtParser / alias name=parse-html extension-id=org.apache.nutch.parse.html.HtmlParser / alias name=parse-js extension-id=JSParser / alias name=feed extension-id=org.apache.nutch.parse.feed.FeedParser / alias name=parse-swf extension-id=org.apache.nutch.parse.swf.SWFParser / alias name=parse-zip extension-id=org.apache.nutch.parse.zip.ZipParser / /aliases /parse-plugins -- View this message in context: http://lucene.472066.n3.nabble.com/Configuration-issue-Custom-parser-not-being-recognised-tp3179819p3190290.html Sent from the Nutch - User mailing list archive at Nabble.com.