Re: Not getting all documents

Markus Jelsma Tue, 17 Aug 2010 05:17:10 -0700

Check logs/hadoop.log for connection time out errors.

On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote:
> There are 128 entries in url/nutch formatted as so:
> http://server.example.com/docs/DF-09/
> http://server.example.com/docs/DF-10/
> http://server.example.com/docs/EG-02/
> http://server.example.com/docs/EG-03/
> http://server.example.com/docs/EG-04/
> 
> There are 428 directories in http://server.example.com/docs  I only wanted
> to start out with a small number to reduce the wait times while
>  configuring. I am wondering if it is timing out waiting for apache to
>  generate the index page and just taking whatever it gets before moving on.
>   Maybe I should increase my wait times...
> 
> On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma 
<markus.jel...@buyways.nl>wrote:
> > Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the
> > seeding didn't go too well? Make sure that all your Apache directory
> > listings
> > are injected into the CrawlDB. If you then generate, fetch, parse and
> > update
> > the DB, you should have all URL's in your DB.
> >
> > How many directory listing pages do you have anyway?
> >
> > On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote:
> > > Thanks for your reply, Markus.
> > >
> > > I ran the command several times.  Each subsequent run finished in a few
> > > seconds with only this output:
> > >
> > > crawl started in: crawl
> > > rootUrlDir = urls
> > > threads = 100
> > > depth = 5
> > > indexer=lucene
> > > topN = 5000
> > > Injector: starting
> > > Injector: crawlDb: crawl/crawldb
> > > Injector: urlDir: urls
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: done
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: starting
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5000
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=0 - no more URLs to fetch.
> > > No URLs to fetch - check your seed list and URL filters.
> > > crawl finished: crawl
> > >
> > >
> > > The query shows all URLs fetched:
> > > #bin/nutch readdb crawl/crawldb/ -stats
> > > CrawlDb statistics start: crawl/crawldb/
> > > Statistics for CrawlDb: crawl/crawldb/
> > > TOTAL urls:     8795
> > > retry 0:        8795
> > > min score:      0.0090
> > > avg score:      0.028536895
> > > max score:      13.42
> > > status 2 (db_fetched):  8795
> > > CrawlDb statistics: done
> > >
> > > I have tried deleteing the crawl dir and starting from scratch with the
> > >  same results.  I'm at a loss.  I've been over all of the values in
> > > nutch-default.xml but I can't really see anything that seems wrong.
> > >
> > > On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma
> >
> > <markus.jel...@buyways.nl>wrote:
> > > > Hi,
> > > >
> > > >
> > > >
> > > > Quite hard to debug, but lets try to make this a lucky guess: how
> > > > many times did you crawl? If you have all the Apache directory
> > > > listing pages injected by seeding, you'll only need one generate
> > > > command. But, depending on different settings, you might need to
> > > > fetch and parse multiple times.
> > > >
> > > >
> > > >
> > > > Also, you can check how many URL's are yet to be fetched by using the
> > > > readdb command:
> > > >
> > > > # bin/nutch readdb crawl/crawldb/ -stats
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > -----Original message-----
> > > > From: Bill Arduino <robots...@gmail.com>
> > > > Sent: Mon 16-08-2010 23:11
> > > > To: user@nutch.apache.org;
> > > > Subject: Not getting all documents
> > > >
> > > > Hi all,
> > > >
> > > > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch
> >
> > flat
> >
> > > > file.  Each line is a dir on the same server like so:
> > > > http://myserver.mydomain.com/docs/SC-09
> > > > http://myserver.mydomain.com/docs/SC-10
> > > >
> > > > In each of these dirs are anywhere from 1 to 15,000 PDF files.  The
> >
> > index
> >
> > > > is
> > > > dynamically generated by apache for each dir.  In total there are 1.2
> > > > million PDF files I need to index.
> > > >
> > > > Running the command:
> > > > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
> > > >
> > > > seems to work and I get data that I can search, but I know I am not
> > > > getting all of the PDFs fetched or indexed.  If I do this:
> > > >
> > > > grep pdf logs/hadoop.log | grep fetching | wc -l
> > > > 12386
> > > >
> > > > I know there are 276,867  PDFs in the URLs I provided in the nutch
> >
> > file,
> >
> > > > yet
> > > > it fetched only 12,386 of them.
> > > >
> > > > I'm not sure on the -topN parameter, but it seems to run the same no
> > > > matter what I put in it.  I have these settings in my nutch-site.xml:
> > > >
> > > > file.content.limit -1
> > > > http.content.limit -1
> > > > fetcher.threads.fetch 100
> > > > fetcher.threads.per.host 100
> > > >
> > > > PDF parser is working.  I also have this in nutch-site:
> > > >
> > > > <!-- plugin properties -->
> > > > <property>
> > > > <name>plugin.includes</name>
> >
> > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer
> >
> > >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnor
> > >ma
> > >
> > > >lizer-(pass|regex|basic)|parse-pdf</value>
> > > >
> > > > </property>
> > > >
> > > > Any ideas?
> > > > Thanks!
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
>


Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Not getting all documents

Reply via email to