Hey Lewis, Thanks for the quick reply. Looks like I am tangled now =) I tried the tutorial mentioned at http://wiki.apache.org/nutch/RunningNutchAndSolr
For me step 3 is not working. Two of the directories are not created (which should be there after step 3 is complete.) crawl/crawldb - Created crawl/linkdb - not created crawl/segments - not created Also, I changed the url to http://nutch.apache.org, but still same log message "Generator: 0 records selected for fetching, exiting ..." Looks like I am missing some key step =(. -param On 7/12/11 1:37 PM, "lewis john mcgibbney" <[email protected]> wrote: > Hi, > > I think you are maybe getting tangled here. Please see the following > tutorial for Nutch 1.3 [1] > > Please also note that the URL you provided is the old Nutch site and now > redirects to http://nutch.apache.org > > [1] http://wiki.apache.org/nutch/RunningNutchAndSolr > > On Tue, Jul 12, 2011 at 5:23 PM, Sethi, Parampreet < > [email protected]> wrote: > >> Thanks for updating the tutorial. I tried my setup, the crawl command is >> running. But none of the pages are being crawled. >> I created urls directory inside local folder and added new file nutch with >> url in the same as mentioned in tutorial. >> >> (I also tried file named urls inside nutch/runtime/local diretcory. The >> contents of urls file is http://lucene.apache.org/nutch/ ) >> >> Here's the log: >> >> us137390:local parampreetsethi$ bin/nutch crawl urls -dir crawl -depth 3 >> -topN 50 >> solrUrl is not set, indexing will be skipped... >> crawl started in: crawl >> rootUrlDir = urls >> threads = 10 >> depth = 3 >> solrUrl=null >> topN = 50 >> Injector: starting at 2011-07-12 12:22:12 >> Injector: crawlDb: crawl/crawldb >> Injector: urlDir: urls >> Injector: Converting injected urls to crawl db entries. >> Injector: Merging injected urls into crawl db. >> Injector: finished at 2011-07-12 12:22:15, elapsed: 00:00:03 >> Generator: starting at 2011-07-12 12:22:15 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: filtering: true >> Generator: normalizing: true >> Generator: topN: 50 >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: 0 records selected for fetching, exiting ... >> Stopping at depth=0 - no more URLs to fetch. >> No URLs to fetch - check your seed list and URL filters. >> crawl finished: crawl >> >> >> Please help. >> >> Thanks >> Param >> >> On 7/12/11 5:52 AM, "Julien Nioche" <[email protected]> wrote: >> >>> On 12 July 2011 10:30, Julien Nioche <[email protected]> >> wrote: >>> >>>> >>>> >>>>>>> There seems to be no crawl-urlfilter file indeed. Don't know why it's >>>>>>> gone since >>>>>>> the crawl command is still there. You can find the file in the 1.2 >>>>>>> release: >> http://svn.apache.org/viewvc/nutch/branches/branch-1.2/conf/ >>>>>> >>>>>> Crawl-urlfilter has been removed purposefully as it did not add >>>>> anything >>>>>> to the other url filters (automaton | regex) in terms of >> functionality. >>>>> By >>>>>> default the urlfilters contain (+.) which IIRC was what the >>>>>> Crawl-urlfilter used to do. >>>>>> >>>>> >>>>> That's reasonable. But now news users are unaware and don't know what >> to >>>>> do >>>>> with this error message. >>>>> >>>> >>>> Yep, the tutorial needs updating indeed >>>> >>> >>> done >>> >>> >>>> >>>> >>>> >>>>> >>>>>>>> Thanks for a quick reply. >>>>>>>> >>>>>>>> I searched in the nutch directory but still do not see that file :(. >>>>>>> >>>>>>> Here's >>>>>>> >>>>>>>> complete file list inside runtime/local/conf directory. >>>>>>>> >>>>>>>> us137390:conf parampreetsethi$ pwd >>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/runtime/local/conf >>>>>>>> us137390:conf parampreetsethi$ ls -t >>>>>>>> automaton-urlfilter.txt domain-urlfilter.txt nutch-default.xml >>>>>>>> prefix-urlfilter.txt solrindex-mapping.xml >>>>>>>> configuration.xsl httpclient-auth.xml nutch-site.xml >>>>>>>> regex-normalize.xml subcollections.xml >>>>>>>> domain-suffixes.xml log4j.properties parse-plugins.dtd >>>>>>>> regex-urlfilter.txt suffix-urlfilter.txt >>>>>>>> domain-suffixes.xsd nutch-conf.xsl parse-plugins.xml >>>>>>>> schema.xml tika-mimetypes.xml >>>>>>>> >>>>>>>> By the way, I tried deploying the code by checking out from svn >>>>>>> >>>>>>> repository, >>>>>>> >>>>>>>> but could not build it. I was getting following error: >>>>>>>> >>>>>>>> resolve-default: >>>>>>> >>>>>>>> [ivy:resolve] :: Ivy 2.2.0 - 20100923230623 :: >>>>>>> http://ant.apache.org/ivy/ >>>>>>> >>>>>>>> :: [ivy:resolve] :: loading settings :: file = >>>>>>>> >>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/ivy/ivysettings.xml >>>>>>>> [ivy:resolve] >>>>>>>> [ivy:resolve] :: problems summary :: >>>>>>>> [ivy:resolve] :::: WARNINGS >>>>>>>> [ivy:resolve] module not found: >>>>>>>> org.apache.gora#gora-core;0.2-incubating >>>>>>>> [ivy:resolve] ==== local: tried >>>>>>>> [ivy:resolve] >>>>>>> >>>>>>> >>>>> >> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati >>>>>>> ng >>>>>>> >>>>>>>> / ivys/ivy.xml >>>>>>>> [ivy:resolve] -- artifact >>>>>>>> org.apache.gora#gora-core;0.2-incubating!gora-core.jar: >>>>>>>> [ivy:resolve] >>>>>>> >>>>>>> >>>>> >> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-core/0.2-incubati >>>>>>> ng >>>>>>> >>>>>>>> / jars/gora-core.jar >>>>>>>> [ivy:resolve] module not found: >>>>>>>> org.apache.gora#gora-sql;0.2-incubating >>>>>>>> [ivy:resolve] ==== local: tried >>>>>>>> [ivy:resolve] >>>>>>> >>>>>>> >>>>> >> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin >>>>>>> g/ >>>>>>> >>>>>>>> i vys/ivy.xml >>>>>>>> [ivy:resolve] -- artifact >>>>>>>> org.apache.gora#gora-sql;0.2-incubating!gora-sql.jar: >>>>>>>> [ivy:resolve] >>>>>>> >>>>>>> >>>>> >> /Users/parampreetsethi/.ivy2/local/org.apache.gora/gora-sql/0.2-incubatin >>>>>>> g/ >>>>>>> >>>>>>>> j ars/gora-sql.jar >>>>>>>> [ivy:resolve] :::::::::::::::::::::::::::::::::::::::::::::: >>>>>>>> [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: >>>>>>>> [ivy:resolve] :::::::::::::::::::::::::::::::::::::::::::::: >>>>>>>> [ivy:resolve] :: org.apache.gora#gora-core;0.2-incubating: >>>>> not >>>>>>>> found [ivy:resolve] :: >>>>> org.apache.gora#gora-sql;0.2-incubating: >>>>>>>> not found [ivy:resolve] >>>>>>>> >>>>>>>> :::::::::::::::::::::::::::::::::::::::::::::: [ivy:resolve] >>>>>>>> >>>>>>>> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS >>>>>>>> >>>>>>>> BUILD FAILED >>>>>>> >>>>>>>> /Users/parampreetsethi/Documents/workspace/nutch/build.xml:458: >>>>>>> impossible >>>>>>> >>>>>>>> to resolve dependencies: >>>>>>>> resolve failed - see output for details >>>>>>>> >>>>>>>> -param >>>>>>>> >>>>>>>> On 7/11/11 5:56 PM, "Jerry E. Craig, Jr." <[email protected]> >>>>>>> >>>>>>> wrote: >>>>>>>>> Look down a little further for the >>>>>>>>> >>>>>>>>> or >>>>>>>>> runtime/local/bin/nutch (version >= 1.3) >>>>>>>>> >>>>>>>>> If you download the bin then it's in the runtime directory. >>>>>>>>> >>>>>>>>> Jerry E. Craig, Jr. >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Sethi, Parampreet [mailto:[email protected]] >>>>>>>>> Sent: Monday, July 11, 2011 2:51 PM >>>>>>>>> To: [email protected] >>>>>>>>> Subject: Nutch Novice help >>>>>>>>> >>>>>>>>> Hi All, >>>>>>>>> >>>>>>>>> Sorry for such a naïve question, I downloaded nutch 1.3 binary >>>>> today >>>>>>> >>>>>>> and >>>>>>> >>>>>>>>> trying to set it up as mentioned in Tutorial at >>>>>>>>> http://wiki.apache.org/nutch/NutchTutorial >>>>>>>>> >>>>>>>>> How ever I am not able to find crawl-urlfilter.txt inside conf >>>>>>> >>>>>>> directory. >>>>>>> >>>>>>>>> Is there any other place where I should look for this file? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Param >>>>> >>>> >>>> >>>> >>>> -- >>>> * >>>> *Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> >>> >>> >> >> >

