Re: Nutch crawl vs other commands

lewis john mcgibbney Fri, 23 Sep 2011 12:17:42 -0700

excellent Bai,

thanks for pointing this out. It has been fixed.


On Fri, Sep 23, 2011 at 8:03 PM, Bai Shen <[email protected]> wrote:

> You need to change the two code blocks underneath that as well.  They still
> show the update before the parse.
>
> bin/nutch fetch $s2
> bin/nutch updatedb crawldb $s2
> bin/nutch parse $s2
>
>
>
> On Fri, Sep 23, 2011 at 10:01 AM, lewis john mcgibbney <
> [email protected]> wrote:
>
> > this has been fixed
> >
> > Thanks for raising and looking in to this guys
> >
> > On Fri, Sep 23, 2011 at 2:32 PM, Markus Jelsma
> > <[email protected]>wrote:
> >
> > > Hmm, the wiki tutorial seems wrong. You must parse before updating any
> > DB.
> > >
> > >
> > >
> > > On Friday 23 September 2011 15:15:45 Bai Shen wrote:
> > > > I looked at the tutorial, and it's doing pretty much the same thing
> as
> > > the
> > > > lucid link I referenced earlier.  It just leaves out the noParsing
> and
> > > also
> > > > swaps the updatedb and parse commands.  Does the order make a
> > difference?
> > > >
> > > > On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> > > >
> > > > [email protected]> wrote:
> > > > > Hi Bai,
> > > > >
> > > > > I hope various comments have helped you somewhat, however I another
> > > small
> > > > > one as well. please see below
> > > > >
> > > > > On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <[email protected]
> >
> > > wrote:
> > > > > > I'm using 1.3.  This is a new setup, so I'm running the latest
> > > > > > versions.
> > > > > >
> > > > > > I did inject the urls already.  It's just that the part I was
> > having
> > > > >
> > > > > issues
> > > > >
> > > > > > with was the fetch, etc.  I'm using the steps at Lucid
> Imagination
> > »
> > > > >
> > > > > Using
> > > > >
> > > > > > Nutch with Solr<
> > > > > > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
> >except
> > > > > > that I alredy had Nutch set up and configured.
> > > > > >
> > > > > > When did noParsing change?  I noticed that the Nutch wiki is out
> of
> > > > > > date, so
> > > > > > I'm not sure what the current setups are.
> > > > >
> > > > > You will find the official Nutch tutorial and command line options
> > (for
> > > > > what
> > > > > you require) up-to-date, these can be found on the wiki. If you
> have
> > > > > anything to add please do.
> > > > >
> > > > > > The log data made some mention of hadoop, but I don't remember
> what
> > > it
> > > > >
> > > > > was.
> > > > >
> > > > > > I'll see if it happens again and post the message.
> > > > > >
> > > > > > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> > > > > >
> > > > > > [email protected]> wrote:
> > > > > > > Hi Bai,
> > > > > > >
> > > > > > > You haven't mentioned which Nutch version you're using... this
> > > would
> > > > > > > be good
> > > > > > > if you could.
> > > > > > >
> > > > > > > You haven't injected any seed URLs into your crawldb. From
> memory
> > I
> > > > >
> > > > > think
> > > > >
> > > > > > > the -topN parameter should be passed to the generate command.
> > > > > > >
> > > > > > > Just to note, it is not necessary to set noParsing while
> > executing
> > > > > > > the fetch
> > > > > > > command. This is already default behaviour. Not sure why your
> > > machine
> > > > >
> > > > > is
> > > > >
> > > > > > > churning but this shouldn't be happening. Do you have any log
> > data
> > > to
> > > > > > > suggest why this is the case.
> > > > > > >
> > > > > > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <
> > [email protected]
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > So I was able to get Nutch up and working using the crawl
> > > command.
> > > > > > > > I
> > > > > >
> > > > > > set
> > > > > >
> > > > > > > > my
> > > > > > > > depth and topN and it ran and indexed the pages for me.
> > > > > > > >
> > > > > > > > But not I'm trying to split out the separate pieces in order
> to
> > > > > > >
> > > > > > > distribute
> > > > > > >
> > > > > > > > them and add my own parser.  I'm running the following.
> > > > > > > >
> > > > > > > > bin/nutch generate crawl/crawldb crawl/segments
> > > > > > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > > > > > > bin/nutch fetch $SEGMENT -noParsing
> > > > > > > > bin/nutch parse $SEGMENT
> > > > > > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't see any way to determine how deep to crawl.  Is this
> > > > >
> > > > > possible,
> > > > >
> > > > > > or
> > > > > >
> > > > > > > > do
> > > > > > > > I have to manually manage the db?  And if so, how do I do
> that?
> > > > > > > >
> > > > > > > > And as a side note, why does Nutch invoke hadoop during the
> > fetch
> > > > > >
> > > > > > command
> > > > > >
> > > > > > > > even though I have noParsing set?  After fetching my links,
> my
> > > > >
> > > > > machine
> > > > >
> > > > > > > > churns for around twenty minutes before finally ending, even
> > > though
> > > > >
> > > > > all
> > > > >
> > > > > > > the
> > > > > > >
> > > > > > > > fetch threads completed already.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > >
> > > > > > > --
> > > > > > > *Lewis*
> > > > >
> > > > > --
> > > > > *Lewis*
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Nutch crawl vs other commands

Reply via email to