Please follow the step-by-step tutorial, it's explained there: http://wiki.apache.org/nutch/NutchTutorial
On Tuesday 15 May 2012 13:40:26 Tolga wrote: > I'm a little confused. How can I not use the crawl command and execute > the separate crawl cycle commands at the same time? > > Regards, > > On 5/11/12 9:40 AM, Markus Jelsma wrote: > > Ah, that means don't use the crawl command and do a little shell > > scripting to execute the separte crawl cycle commands, see the nutch > > wiki for examples. And don't do solrdedup. Search the Solr wiki for > > deduplication. > > > > cheers > > > > On Fri, 11 May 2012 07:39:36 +0300, Tolga <[email protected]> wrote: > >> Hi, > >> > >> How do I exactly "omit solrdedup and use Solr's internal > >> deduplication" instead.? I don't even know what any of that means :D > >> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ > >> -depth 3 -topN 100 to get the error. I have to use all the steps? > >> > >> Regards, > >> > >> On 05/10/2012 11:38 PM, Markus Jelsma wrote: > >>> thanks > >>> > >>> This is a known issue: > >>> https://issues.apache.org/jira/browse/NUTCH-1100 > >>> > >>> I have not been able find the bug nor do i know how to reproduce it > >>> from scratch. If you have a public site with which we can reproduce > >>> it please comment to the Jira ticket. Make sure you use either > >>> default config or little, a seed URL and the exact crawl & dedup > >>> steps to reproduce. > >>> > >>> If you find it we might fix it. In any case we need to replace the > >>> dedup command with a more scalable tool which it currently is not. > >>> > >>> In the mean time you can omit solrdedup and use Solr's internal > >>> deduplication instead, it works similar and uses the same signature > >>> algorithm as Nutch has. Please consult the Solr wiki page on > >>> deduplication. > >>> > >>> Good luck > >>> > >>> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[email protected]> wrote: > >>>> Hi Markus, > >>>> > >>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: > >>>>> Hi, > >>>>> > >>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[email protected]> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> This will sound like a duplicate, but actually it differs from the > >>>>>> other one. Please bear with me. Following > >>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the > >>>>>> command > >>>>>> > >>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 > >>>>>> -topN 5 > >>>>>> > >>>>>> Then when I got the message > >>>>>> > >>>>>> Exception in thread "main" java.io.IOException: Job failed! > >>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > >>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu > >>>>>> plicates.java:373)>>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu > >>>>>> plicates.java:353)>>>>>> > >>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > >>>>> > >>>>> Please include the relevant part of the log. This can be a known > >>>>> issue. > >>>> > >>>> This is an excerpt from hadoop.log: > >>>> > >>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: > >>>> crawl-20120510222629 > >>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - > >>>> solrUrl=http://localhost:8983/solr/ > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 > >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at > >>>> 2012-05-10 22:26:30 > >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: > >>>> crawl-20120510222629/crawldb > >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls > >>>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting > >>>> injected urls to crawl db entries. > >>>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: > >>>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin > >>>> Auto-activation mode: [true] > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered > >>>> Plugins: > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch > >>>> core extension points (nutch-extensionpoints) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL > >>>> Normalizer (urlnormalizer-basic) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html > >>>> Parse Plug-in (parse-html) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic > >>>> Indexing Filter (index-basic) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP > >>>> Framework (lib-http) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - > >>>> Pass-through URL Normalizer (urlnormalizer-pass) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL > >>>> Filter (urlfilter-regex) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http > >>>> Protocol Plug-in (protocol-http) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL > >>>> Normalizer (urlnormalizer-regex) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika > >>>> Parser Plug-in (parse-tika) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC > >>>> Scoring Plug-in (scoring-opic) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko > >>>> HTML Parser (lib-nekohtml) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor > >>>> Indexing Filter (index-anchor) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL > >>>> Filter Framework (lib-regex-filter) > >>>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered > >>>> Extension-Points: > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL > >>>> Normalizer (org.apache.nutch.net.URLNormalizer) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch > >>>> Protocol (org.apache.nutch.protocol.Protocol) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch > >>>> Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL > >>>> Filter (org.apache.nutch.net.URLFilter) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch > >>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML > >>>> Parse Filter (org.apache.nutch.parse.HtmlParseFilter) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch > >>>> Content Parser (org.apache.nutch.parse.Parser) > >>>> 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch > >>>> Scoring (org.apache.nutch.scoring.ScoringFilter) > >>>> 2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find > >>>> rules for scope 'inject', using default > >>>> 2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging > >>>> injected urls into crawl db. > >>>> 2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load > >>>> native-hadoop library for your platform... using builtin-java classes > >>>> where applicable > >>>> 2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at > >>>> 2012-05-10 22:26:37, elapsed: 00:00:06 > >>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting > >>>> at 2012-05-10 22:26:37 > >>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting > >>>> best-scoring urls due for fetch. > >>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: > >>>> filtering: true > >>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: > >>>> normalizing: true > >>>> 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: 100 > >>>> 2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker > >>>> is 'local', generating exactly one partition. > >>>> 2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using > >>>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - > >>>> defaultInterval=2592000 > >>>> 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - > >>>> maxInterval=7776000 > >>>> 2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find > >>>> rules for scope 'partition', using default > >>>> ... > >>>> ... > >>>> INFO: [] webapp=/solr path=/update > >>>> > >>>> > >>>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version > >>>> =2} > >>>> > >>>> > >>>> status=0 QTime=221 > >>>> 2012-05-10 22:36:26,336 INFO solr.SolrIndexer - SolrIndexer: > >>>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05 > >>>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - > >>>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26 > >>>> 2012-05-10 22:36:26,339 INFO solr.SolrDeleteDuplicates - > >>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ > >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute > >>>> INFO: [] webapp=/solr path=/select > >>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 > >>>> status=0 QTime=74 > >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute > >>>> INFO: [] webapp=/solr path=/select > >>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=220 > >>>> status=0 QTime=0 > >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute > >>>> INFO: [] webapp=/solr path=/select > >>>> > >>>> > >>>> params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows > >>>> =220&version=2} > >>>> > >>>> > >>>> hits=220 status=0 QTime=9 > >>>> 2012-05-10 22:36:27,656 WARN mapred.LocalJobRunner - job_local_0020 > >>>> java.lang.NullPointerException > >>>> > >>>> at org.apache.hadoop.io.Text.encode(Text.java:388) > >>>> at org.apache.hadoop.io.Text.set(Text.java:178) > >>>> at > >>>> > >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne > >>>> xt(SolrDeleteDuplicates.java:270)>>>> > >>>> at > >>>> > >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne > >>>> xt(SolrDeleteDuplicates.java:241)>>>> > >>>> at > >>>> > >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask > >>>> .java:192)>>>> > >>>> at > >>>> > >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java: > >>>> 176) > >>>> > >>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > >>>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > >>>> at > >>>> > >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177 > >>>> ) > >>>> > >>>>>> I issued the commands > >>>>>> > >>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > >>>>>> > >>>>>> and > >>>>>> > >>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb > >>>>>> crawldb/linkdb crawldb/segments/* > >>>>>> > >>>>>> separately, after which I got no errors. When I browsed to > >>>>>> http://localhost:8983/solr/admin and attempted a search, I got the > >>>>>> error > >>>>>> > >>>>>> HTTP ERROR 400 > >>>>>> > >>>>>> Problem accessing /solr/select. Reason: > >>>>>> undefined field text > >>>>> > >>>>> But this is a Solr thing, you have no field named text. Resolve > >>>>> this in Solr or on the Solr mailing list. > >>>>> > >>>>>> --------------------------------------------------------------------- > >>>>>> --- > >>>>>> > >>>>>> > >>>>>> /Powered by Jetty:// > >>>>>> > >>>>>> /What am I doing wrong? > >>>>>> > >>>>>> Regards,/ > >>>>>> / > >>>> > >>>> Regards, -- Markus Jelsma - CTO - Openindex

