Also, doing it the tutorial way is only giving me around 125 links at a time. Previously, the queue was in the 5k range. As far as I can tell, the only differences were not using the noParsing tag and not using the filter and normalize tags.
On Fri, Sep 23, 2011 at 9:15 AM, Bai Shen <[email protected]> wrote: > I looked at the tutorial, and it's doing pretty much the same thing as the > lucid link I referenced earlier. It just leaves out the noParsing and also > swaps the updatedb and parse commands. Does the order make a difference? > > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney < > [email protected]> wrote: > >> Hi Bai, >> >> I hope various comments have helped you somewhat, however I another small >> one as well. please see below >> >> On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <[email protected]> >> wrote: >> >> > I'm using 1.3. This is a new setup, so I'm running the latest versions. >> > >> > I did inject the urls already. It's just that the part I was having >> issues >> > with was the fetch, etc. I'm using the steps at Lucid Imagination ยป >> Using >> > Nutch with Solr< >> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except >> > that I alredy had Nutch set up and configured. >> > >> > When did noParsing change? I noticed that the Nutch wiki is out of >> date, >> > so >> > I'm not sure what the current setups are. >> > >> >> You will find the official Nutch tutorial and command line options (for >> what >> you require) up-to-date, these can be found on the wiki. If you have >> anything to add please do. >> >> >> > The log data made some mention of hadoop, but I don't remember what it >> was. >> > I'll see if it happens again and post the message. >> > >> > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney < >> > [email protected]> wrote: >> > >> > > Hi Bai, >> > > >> > > You haven't mentioned which Nutch version you're using... this would >> be >> > > good >> > > if you could. >> > > >> > > You haven't injected any seed URLs into your crawldb. From memory I >> think >> > > the -topN parameter should be passed to the generate command. >> > > >> > > Just to note, it is not necessary to set noParsing while executing the >> > > fetch >> > > command. This is already default behaviour. Not sure why your machine >> is >> > > churning but this shouldn't be happening. Do you have any log data to >> > > suggest why this is the case. >> > > >> > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]> >> > wrote: >> > > >> > > > So I was able to get Nutch up and working using the crawl command. >> I >> > set >> > > > my >> > > > depth and topN and it ran and indexed the pages for me. >> > > > >> > > > But not I'm trying to split out the separate pieces in order to >> > > distribute >> > > > them and add my own parser. I'm running the following. >> > > > >> > > > bin/nutch generate crawl/crawldb crawl/segments >> > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` >> > > > bin/nutch fetch $SEGMENT -noParsing >> > > > bin/nutch parse $SEGMENT >> > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize >> > > > >> > > > >> > > > I don't see any way to determine how deep to crawl. Is this >> possible, >> > or >> > > > do >> > > > I have to manually manage the db? And if so, how do I do that? >> > > > >> > > > And as a side note, why does Nutch invoke hadoop during the fetch >> > command >> > > > even though I have noParsing set? After fetching my links, my >> machine >> > > > churns for around twenty minutes before finally ending, even though >> all >> > > the >> > > > fetch threads completed already. >> > > > >> > > > Thanks. >> > > > >> > > >> > > >> > > >> > > -- >> > > *Lewis* >> > > >> > >> >> >> >> -- >> *Lewis* >> > >

