Check logs/hadoop.log for connection time out errors. On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote: > There are 128 entries in url/nutch formatted as so: > http://server.example.com/docs/DF-09/ > http://server.example.com/docs/DF-10/ > http://server.example.com/docs/EG-02/ > http://server.example.com/docs/EG-03/ > http://server.example.com/docs/EG-04/ > > There are 428 directories in http://server.example.com/docs I only wanted > to start out with a small number to reduce the wait times while > configuring. I am wondering if it is timing out waiting for apache to > generate the index page and just taking whatever it gets before moving on. > Maybe I should increase my wait times... > > On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma <markus.jel...@buyways.nl>wrote: > > Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the > > seeding didn't go too well? Make sure that all your Apache directory > > listings > > are injected into the CrawlDB. If you then generate, fetch, parse and > > update > > the DB, you should have all URL's in your DB. > > > > How many directory listing pages do you have anyway? > > > > On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote: > > > Thanks for your reply, Markus. > > > > > > I ran the command several times. Each subsequent run finished in a few > > > seconds with only this output: > > > > > > crawl started in: crawl > > > rootUrlDir = urls > > > threads = 100 > > > depth = 5 > > > indexer=lucene > > > topN = 5000 > > > Injector: starting > > > Injector: crawlDb: crawl/crawldb > > > Injector: urlDir: urls > > > Injector: Converting injected urls to crawl db entries. > > > Injector: Merging injected urls into crawl db. > > > Injector: done > > > Generator: Selecting best-scoring urls due for fetch. > > > Generator: starting > > > Generator: filtering: true > > > Generator: normalizing: true > > > Generator: topN: 5000 > > > Generator: jobtracker is 'local', generating exactly one partition. > > > Generator: 0 records selected for fetching, exiting ... > > > Stopping at depth=0 - no more URLs to fetch. > > > No URLs to fetch - check your seed list and URL filters. > > > crawl finished: crawl > > > > > > > > > The query shows all URLs fetched: > > > #bin/nutch readdb crawl/crawldb/ -stats > > > CrawlDb statistics start: crawl/crawldb/ > > > Statistics for CrawlDb: crawl/crawldb/ > > > TOTAL urls: 8795 > > > retry 0: 8795 > > > min score: 0.0090 > > > avg score: 0.028536895 > > > max score: 13.42 > > > status 2 (db_fetched): 8795 > > > CrawlDb statistics: done > > > > > > I have tried deleteing the crawl dir and starting from scratch with the > > > same results. I'm at a loss. I've been over all of the values in > > > nutch-default.xml but I can't really see anything that seems wrong. > > > > > > On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma > > > > <markus.jel...@buyways.nl>wrote: > > > > Hi, > > > > > > > > > > > > > > > > Quite hard to debug, but lets try to make this a lucky guess: how > > > > many times did you crawl? If you have all the Apache directory > > > > listing pages injected by seeding, you'll only need one generate > > > > command. But, depending on different settings, you might need to > > > > fetch and parse multiple times. > > > > > > > > > > > > > > > > Also, you can check how many URL's are yet to be fetched by using the > > > > readdb command: > > > > > > > > # bin/nutch readdb crawl/crawldb/ -stats > > > > > > > > > > > > > > > > Cheers, > > > > > > > > -----Original message----- > > > > From: Bill Arduino <robots...@gmail.com> > > > > Sent: Mon 16-08-2010 23:11 > > > > To: user@nutch.apache.org; > > > > Subject: Not getting all documents > > > > > > > > Hi all, > > > > > > > > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch > > > > flat > > > > > > file. Each line is a dir on the same server like so: > > > > http://myserver.mydomain.com/docs/SC-09 > > > > http://myserver.mydomain.com/docs/SC-10 > > > > > > > > In each of these dirs are anywhere from 1 to 15,000 PDF files. The > > > > index > > > > > > is > > > > dynamically generated by apache for each dir. In total there are 1.2 > > > > million PDF files I need to index. > > > > > > > > Running the command: > > > > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000 > > > > > > > > seems to work and I get data that I can search, but I know I am not > > > > getting all of the PDFs fetched or indexed. If I do this: > > > > > > > > grep pdf logs/hadoop.log | grep fetching | wc -l > > > > 12386 > > > > > > > > I know there are 276,867 PDFs in the URLs I provided in the nutch > > > > file, > > > > > > yet > > > > it fetched only 12,386 of them. > > > > > > > > I'm not sure on the -topN parameter, but it seems to run the same no > > > > matter what I put in it. I have these settings in my nutch-site.xml: > > > > > > > > file.content.limit -1 > > > > http.content.limit -1 > > > > fetcher.threads.fetch 100 > > > > fetcher.threads.per.host 100 > > > > > > > > PDF parser is working. I also have this in nutch-site: > > > > > > > > <!-- plugin properties --> > > > > <property> > > > > <name>plugin.includes</name> > > > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer > > > > >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnor > > >ma > > > > > > >lizer-(pass|regex|basic)|parse-pdf</value> > > > > > > > > </property> > > > > > > > > Any ideas? > > > > Thanks! > > > > Markus Jelsma - Technisch Architect - Buyways BV > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 >
Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350