One more thing. Make sure you seed URL are in UNIX format (end of line). All my test seeds worked, but the real seed fail with 0 record produce by the generator. Turns out the real seed list was in dos format. Runing dos2unix and uploading the on the hdfs fixes my problem
Le 2011-05-18 à 05:25, Marek Bachmann <[email protected]> a écrit : > Thank you very much for the help. > I checked ALL files in nutch/conf if there are an further expressions that > would exclude my URLs. > I found nothing like this. > In fact, as I mentioned before, the ./nuch crawl command works fine on > exactly the some input data. > > One again if I delete all entries in my crawl directory and the run: > > ./nutch crawl seedUrls/ -dir crawl -threads 30 -depth 10 > > crawl started in: crawl > rootUrlDir = seedUrls > threads = 30 > depth = 10 > indexer=lucene > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seedUrls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: crawl/segments/20110518111323 > Generator: done. > Fetcher: Your 'http.agent.name' value should be listed first in > 'http.robots.agents' property. > Fetcher: starting > Fetcher: segment: crawl/segments/20110518111323 > Fetcher: threads: 30 > QueueFeeder finished: total 10 records + hit by time limit :0 > fetching http://portal.uni-kassel.de/ > fetching http://www.studentenwerk-kassel.de/ > fetching http://www.asta-kassel.de/ > fetching http://www.uni-kassel.de/fb16 > fetching http://www.uni-kassel.de/ > fetching http://www.uni-kassel.de/uni/studium/ > fetching http://www.uni-kassel.de/uni/fachbereiche/ > fetching http://www.uni-kassel.de/uni/ > fetching http://www.uni-kassel.de/uni/forschung/ > fetching http://www.cs.uni-kassel.de/ > > But if I try it manually (after deleting the crawldb once again): > > ./nutch inject crawl/crawldb seedUrls/ > > Injector: starting > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seedUrls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > > ./nutch generate crawl/crawldb/ crawl/segments > > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > > > So my conclusion is that the crawl command does the url injecting in some > other way? I just don't get it why it works with the crawl command but > doesn't work when manually injecting. Any further suggestions where I could > find my failure would be great :-) > > > On 16.05.2011 16:14, Markus Jelsma wrote: >> I see that too and it shouldn't dump an exception if there's nothing in the >> CrawlDB. >> This is, however, not your problem it seems. If you inject but there's >> nothing >> in the CrawlDB then you have some filters running that skip your seed URL's. >> Check your domain filter settings or other url filter settings, depening on >> the plugin's you defined. >> >> On Monday 16 May 2011 15:56:26 Marek Bachmann wrote: >>> Hello people, >>> >>> I was trying to do an manual crawl like described in the nutch tutorial >>> on http://wiki.apache.org/nutch/NutchTutorial >>> >>> First of all: If I do a crawl, with the same seed urls, using the "nutch >>> crawl" command, everything works fine. >>> >>> Here's what I was trying to do: >>> >>> 1.) Trying to create a new crawlDB with: >>> >>> ./nutch inject crawl/crawldb seedUrls >>> >>> The directory crawl was empty and in the directory seedUrls is >>> one file "urls" with this content: >>> http://www.uni-kassel.de >>> http://portal.uni-kassel.de >>> http://www.asta-kassel.de >>> http://www.uni-kassel.de/fb16 >>> http://www.cs.uni-kassel.de >>> http://www.studentenwerk-kassel.de >>> >>> The command runs without any error: >>> ./nutch inject crawl/crawldb seedUrls >>> Injector: starting >>> Injector: crawlDb: crawl/crawldb >>> Injector: urlDir: seedUrls >>> Injector: Converting injected urls to crawl db entries. >>> Injector: Merging injected urls into crawl db. >>> Injector: done >>> >>> After that a new directory with the name crawldb exists in crawl/ >>> >>> 2.) Trying to generate new segments: >>> >>> ./nutch generate crawl/crawldb/ crawl/segments -noFilter >>> Generator: Selecting best-scoring urls due for fetch. >>> Generator: starting >>> Generator: filtering: false >>> Generator: normalizing: true >>> Generator: jobtracker is 'local', generating exactly one partition. >>> Generator: 0 records selected for fetching, exiting ... >>> >>> So I am wondering why the generator does not create segements. It says >>> that it had 0 records selected for fetching. It seems to me, that the >>> injector hadn't injected the urls into the db. >>> >>> When I run: >>> ./nutch readdb crawl/crawldb/ -stats >>> >>> It outputs: >>> CrawlDb statistics start: crawl/crawldb/ >>> Statistics for CrawlDb: crawl/crawldb/ >>> Exception in thread "main" java.lang.NullPointerException >>> at >>> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352) >>> at >>> org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502) >>> >>> Anybody has an idea what am I doing wrong? >>> >>> Is there any possibility to get more verbose output / logging from the >>> commands? >

