Re: Nutch crawl vs other commands

Bai Shen Fri, 23 Sep 2011 06:27:22 -0700

Also, doing it the tutorial way is only giving me around 125 links at a
time.  Previously, the queue was in the 5k range.  As far as I can tell, the
only differences were not using the noParsing tag and not using the filter
and normalize tags.


On Fri, Sep 23, 2011 at 9:15 AM, Bai Shen <[email protected]> wrote:

> I looked at the tutorial, and it's doing pretty much the same thing as the
> lucid link I referenced earlier.  It just leaves out the noParsing and also
> swaps the updatedb and parse commands.  Does the order make a difference?
>
>
> On Fri, Sep 23, 2011 at 5:45 AM, lewis john mcgibbney <
> [email protected]> wrote:
>
>> Hi Bai,
>>
>> I hope various comments have helped you somewhat, however I another small
>> one as well. please see below
>>
>> On Thu, Sep 22, 2011 at 6:08 PM, Bai Shen <[email protected]>
>> wrote:
>>
>> > I'm using 1.3.  This is a new setup, so I'm running the latest versions.
>> >
>> > I did inject the urls already.  It's just that the part I was having
>> issues
>> > with was the fetch, etc.  I'm using the steps at Lucid Imagination »
>> Using
>> > Nutch with Solr<
>> > http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
>> > that I alredy had Nutch set up and configured.
>> >
>> > When did noParsing change?  I noticed that the Nutch wiki is out of
>> date,
>> > so
>> > I'm not sure what the current setups are.
>> >
>>
>> You will find the official Nutch tutorial and command line options (for
>> what
>> you require) up-to-date, these can be found on the wiki. If you have
>> anything to add please do.
>>
>>
>> > The log data made some mention of hadoop, but I don't remember what it
>> was.
>> > I'll see if it happens again and post the message.
>> >
>> > On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
>> > [email protected]> wrote:
>> >
>> > > Hi Bai,
>> > >
>> > > You haven't mentioned which Nutch version you're using... this would
>> be
>> > > good
>> > > if you could.
>> > >
>> > > You haven't injected any seed URLs into your crawldb. From memory I
>> think
>> > > the -topN parameter should be passed to the generate command.
>> > >
>> > > Just to note, it is not necessary to set noParsing while executing the
>> > > fetch
>> > > command. This is already default behaviour. Not sure why your machine
>> is
>> > > churning but this shouldn't be happening. Do you have any log data to
>> > > suggest why this is the case.
>> > >
>> > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]>
>> > wrote:
>> > >
>> > > > So I was able to get Nutch up and working using the crawl command.
>>  I
>> > set
>> > > > my
>> > > > depth and topN and it ran and indexed the pages for me.
>> > > >
>> > > > But not I'm trying to split out the separate pieces in order to
>> > > distribute
>> > > > them and add my own parser.  I'm running the following.
>> > > >
>> > > > bin/nutch generate crawl/crawldb crawl/segments
>> > > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
>> > > > bin/nutch fetch $SEGMENT -noParsing
>> > > > bin/nutch parse $SEGMENT
>> > > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
>> > > >
>> > > >
>> > > > I don't see any way to determine how deep to crawl.  Is this
>> possible,
>> > or
>> > > > do
>> > > > I have to manually manage the db?  And if so, how do I do that?
>> > > >
>> > > > And as a side note, why does Nutch invoke hadoop during the fetch
>> > command
>> > > > even though I have noParsing set?  After fetching my links, my
>> machine
>> > > > churns for around twenty minutes before finally ending, even though
>> all
>> > > the
>> > > > fetch threads completed already.
>> > > >
>> > > > Thanks.
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > *Lewis*
>> > >
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Re: Nutch crawl vs other commands

Reply via email to