One more thing. Make sure you seed URL are in UNIX format (end of line).
All my test seeds worked, but the real seed fail with 0 record produce
by the generator. Turns out the real seed list was in dos format.
Runing dos2unix and uploading the on the hdfs fixes my problem

Le 2011-05-18 à 05:25, Marek Bachmann <[email protected]> a écrit :

> Thank you very much for the help.
> I checked ALL files in nutch/conf if there are an further expressions that 
> would exclude my URLs.
> I found nothing like this.
> In fact, as I mentioned before, the ./nuch crawl command works fine on 
> exactly the some input data.
>
> One again if I delete all entries in my crawl directory and the run:
>
> ./nutch crawl seedUrls/ -dir crawl -threads 30 -depth 10
>
>   crawl started in: crawl
>   rootUrlDir = seedUrls
>   threads = 30
>   depth = 10
>   indexer=lucene
>   Injector: starting
>   Injector: crawlDb: crawl/crawldb
>   Injector: urlDir: seedUrls
>   Injector: Converting injected urls to crawl db entries.
>   Injector: Merging injected urls into crawl db.
>   Injector: done
>   Generator: Selecting best-scoring urls due for fetch.
>   Generator: starting
>   Generator: filtering: true
>   Generator: normalizing: true
>   Generator: jobtracker is 'local', generating exactly one partition.
>   Generator: Partitioning selected urls for politeness.
>   Generator: segment: crawl/segments/20110518111323
>   Generator: done.
>   Fetcher: Your 'http.agent.name' value should be listed first in
>   'http.robots.agents' property.
>   Fetcher: starting
>   Fetcher: segment: crawl/segments/20110518111323
>   Fetcher: threads: 30
>   QueueFeeder finished: total 10 records + hit by time limit :0
>   fetching http://portal.uni-kassel.de/
>   fetching http://www.studentenwerk-kassel.de/
>   fetching http://www.asta-kassel.de/
>   fetching http://www.uni-kassel.de/fb16
>   fetching http://www.uni-kassel.de/
>   fetching http://www.uni-kassel.de/uni/studium/
>   fetching http://www.uni-kassel.de/uni/fachbereiche/
>   fetching http://www.uni-kassel.de/uni/
>   fetching http://www.uni-kassel.de/uni/forschung/
>   fetching http://www.cs.uni-kassel.de/
>
> But if I try it manually (after deleting the crawldb once again):
>
> ./nutch inject crawl/crawldb seedUrls/
>
>   Injector: starting
>   Injector: crawlDb: crawl/crawldb
>   Injector: urlDir: seedUrls
>   Injector: Converting injected urls to crawl db entries.
>   Injector: Merging injected urls into crawl db.
>   Injector: done
>
> ./nutch generate crawl/crawldb/ crawl/segments
>
>   Generator: Selecting best-scoring urls due for fetch.
>   Generator: starting
>   Generator: filtering: true
>   Generator: normalizing: true
>   Generator: jobtracker is 'local', generating exactly one partition.
>   Generator: 0 records selected for fetching, exiting ...
>
>
> So my conclusion is that the crawl command does the url injecting in some 
> other way? I just don't get it why it works with the crawl command but 
> doesn't work when manually injecting. Any further suggestions where I could 
> find my failure would be great :-)
>
>
> On 16.05.2011 16:14, Markus Jelsma wrote:
>> I see that too and it shouldn't dump an exception if there's nothing in the
>> CrawlDB.
>> This is, however, not your problem it seems. If you inject but there's 
>> nothing
>> in the CrawlDB then you have some filters running that skip your seed URL's.
>> Check your domain filter settings or other url filter settings, depening on
>> the plugin's you defined.
>>
>> On Monday 16 May 2011 15:56:26 Marek Bachmann wrote:
>>> Hello people,
>>>
>>> I was trying to do an manual crawl like described in the nutch tutorial
>>> on http://wiki.apache.org/nutch/NutchTutorial
>>>
>>> First of all: If I do a crawl, with the same seed urls, using the "nutch
>>> crawl" command, everything works fine.
>>>
>>> Here's what I was trying to do:
>>>
>>> 1.) Trying to create a new crawlDB with:
>>>
>>>      ./nutch inject crawl/crawldb seedUrls
>>>
>>>          The directory crawl was empty and in the directory seedUrls is
>>> one file "urls" with this content:
>>>              http://www.uni-kassel.de
>>>              http://portal.uni-kassel.de
>>>              http://www.asta-kassel.de
>>>              http://www.uni-kassel.de/fb16
>>>              http://www.cs.uni-kassel.de
>>>              http://www.studentenwerk-kassel.de
>>>
>>>      The command runs without any error:
>>>      ./nutch inject crawl/crawldb seedUrls
>>>      Injector: starting
>>>      Injector: crawlDb: crawl/crawldb
>>>      Injector: urlDir: seedUrls
>>>      Injector: Converting injected urls to crawl db entries.
>>>      Injector: Merging injected urls into crawl db.
>>>      Injector: done
>>>
>>>      After that a new directory with the name crawldb exists in crawl/
>>>
>>> 2.) Trying to generate new segments:
>>>
>>>      ./nutch generate crawl/crawldb/ crawl/segments -noFilter
>>>      Generator: Selecting best-scoring urls due for fetch.
>>>      Generator: starting
>>>      Generator: filtering: false
>>>      Generator: normalizing: true
>>>      Generator: jobtracker is 'local', generating exactly one partition.
>>>      Generator: 0 records selected for fetching, exiting ...
>>>
>>> So I am wondering why the generator does not create segements. It says
>>> that it had 0 records selected for fetching. It seems to me, that the
>>> injector hadn't injected the urls into the db.
>>>
>>> When I run:
>>>      ./nutch readdb crawl/crawldb/ -stats
>>>
>>> It outputs:
>>>      CrawlDb statistics start: crawl/crawldb/
>>>      Statistics for CrawlDb: crawl/crawldb/
>>>      Exception in thread "main" java.lang.NullPointerException
>>>          at
>>> org.apache.nutch.crawl.CrawlDbReader.processStatJob(CrawlDbReader.java:352)
>>>          at
>>> org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:502)
>>>
>>> Anybody has an idea what am I doing wrong?
>>>
>>> Is there any possibility to get more verbose output / logging from the
>>> commands?
>

Reply via email to