excellent Bai, thanks for pointing this out. It has been fixed.
On Fri, Sep 23, 2011 at 8:03 PM, Bai Shen <[email protected]> wrote: > You need to change the two code blocks underneath that as well. They still > show the update before the parse. > > bin/nutch fetch $s2 > bin/nutch updatedb crawldb $s2 > bin/nutch parse $s2 > > > > On Fri, Sep 23, 2011 at 10:01 AM, lewis john mcgibbney < > [email protected]> wrote: > > > this has been fixed > > > > Thanks for raising and looking in to this guys > > > > On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma > > <[email protected]>wrote: > > > > > Hmm, the wiki tutorial seems wrong. You must parse before updating any > > DB. > > > > > > > > > > > > On Friday 23 September 2011 15:15:45 Bai Shen wrote: > > > > I looked at the tutorial, and it's doing pretty much the same thing > as > > > the > > > > lucid link I referenced earlier. It just leaves out the noParsing > and > > > also > > > > swaps the updatedb and parse commands. Does the order make a > > difference? > > > > > > > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney < > > > > > > > > [email protected]> wrote: > > > > > Hi Bai, > > > > > > > > > > I hope various comments have helped you somewhat, however I another > > > small > > > > > one as well. please see below > > > > > > > > > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <[email protected] > > > > > wrote: > > > > > > I'm using 1.3. This is a new setup, so I'm running the latest > > > > > > versions. > > > > > > > > > > > > I did inject the urls already. It's just that the part I was > > having > > > > > > > > > > issues > > > > > > > > > > > with was the fetch, etc. I'm using the steps at Lucid > Imagination > > ยป > > > > > > > > > > Using > > > > > > > > > > > Nutch with Solr< > > > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > >except > > > > > > that I alredy had Nutch set up and configured. > > > > > > > > > > > > When did noParsing change? I noticed that the Nutch wiki is out > of > > > > > > date, so > > > > > > I'm not sure what the current setups are. > > > > > > > > > > You will find the official Nutch tutorial and command line options > > (for > > > > > what > > > > > you require) up-to-date, these can be found on the wiki. If you > have > > > > > anything to add please do. > > > > > > > > > > > The log data made some mention of hadoop, but I don't remember > what > > > it > > > > > > > > > > was. > > > > > > > > > > > I'll see if it happens again and post the message. > > > > > > > > > > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney < > > > > > > > > > > > > [email protected]> wrote: > > > > > > > Hi Bai, > > > > > > > > > > > > > > You haven't mentioned which Nutch version you're using... this > > > would > > > > > > > be good > > > > > > > if you could. > > > > > > > > > > > > > > You haven't injected any seed URLs into your crawldb. From > memory > > I > > > > > > > > > > think > > > > > > > > > > > > the -topN parameter should be passed to the generate command. > > > > > > > > > > > > > > Just to note, it is not necessary to set noParsing while > > executing > > > > > > > the fetch > > > > > > > command. This is already default behaviour. Not sure why your > > > machine > > > > > > > > > > is > > > > > > > > > > > > churning but this shouldn't be happening. Do you have any log > > data > > > to > > > > > > > suggest why this is the case. > > > > > > > > > > > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen < > > [email protected] > > > > > > > > > > > > > > > > wrote: > > > > > > > > So I was able to get Nutch up and working using the crawl > > > command. > > > > > > > > I > > > > > > > > > > > > set > > > > > > > > > > > > > > my > > > > > > > > depth and topN and it ran and indexed the pages for me. > > > > > > > > > > > > > > > > But not I'm trying to split out the separate pieces in order > to > > > > > > > > > > > > > > distribute > > > > > > > > > > > > > > > them and add my own parser. I'm running the following. > > > > > > > > > > > > > > > > bin/nutch generate crawl/crawldb crawl/segments > > > > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` > > > > > > > > bin/nutch fetch $SEGMENT -noParsing > > > > > > > > bin/nutch parse $SEGMENT > > > > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize > > > > > > > > > > > > > > > > > > > > > > > > I don't see any way to determine how deep to crawl. Is this > > > > > > > > > > possible, > > > > > > > > > > > or > > > > > > > > > > > > > > do > > > > > > > > I have to manually manage the db? And if so, how do I do > that? > > > > > > > > > > > > > > > > And as a side note, why does Nutch invoke hadoop during the > > fetch > > > > > > > > > > > > command > > > > > > > > > > > > > > even though I have noParsing set? After fetching my links, > my > > > > > > > > > > machine > > > > > > > > > > > > > churns for around twenty minutes before finally ending, even > > > though > > > > > > > > > > all > > > > > > > > > > > > the > > > > > > > > > > > > > > > fetch threads completed already. > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > -- > > > > > > > *Lewis* > > > > > > > > > > -- > > > > > *Lewis* > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > > > > > > > > > > -- > > *Lewis* > > > -- *Lewis*

