Re: Continuous crawling

庄名洲 Tue, 29 Nov 2011 02:01:45 -0800

I'd like to know details in continuous crawling, too.
could anyone fwd me the original email, because i'm new here. thanks to all
of you.


2011/11/29 庄名洲 <[email protected]>

> no agnet is listed in the http.agent.name property.
> I met this before.
> Just rebuild with ant~~
> And maybe you'll need .patch files to fix the source. Good luck
>
>
> 2011/11/29 Bai Shen <[email protected]>
>
>> I've changed nutch to use the pseudo-distributed mode, but it keeps
>> erroring out that no agent is listed in the http.agent.name property.  I
>> copied over my conf directory from local, but that didn't fix it.  What am
>> I missing?
>>
>> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
>> [email protected]> wrote:
>>
>> > Simply run Nutch in pseudo-distributed mode. If you have no idea of what
>> > this means, then it would be a good idea to have a look at
>> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and
>> in
>> > particular the section mentioning http://localhost:50030/jobtracker.jsp
>> >
>> > On 28 November 2011 14:09, Bai Shen <[email protected]> wrote:
>> >
>> > > We looked at the hadoop reporter and aren't sure how to access it with
>> > > nutch.  Is there a certain way it works?  Can you give me an example?
>> > > Thanks.
>> > >
>> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
>> > > <[email protected]>wrote:
>> > >
>> > > > **
>> > > >
>> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
>> > > >
>> > > > >
>> > > >
>> > > > > <[email protected]>wrote:
>> > > >
>> > > > > > > Interesting. How do you tell if the segments have been
>> fetched,
>> > > etc?
>> > > >
>> > > > > >
>> > > >
>> > > > > > after a job the shell script waits for its completion and return
>> > > code.
>> > > > If
>> > > >
>> > > > > > it
>> > > >
>> > > > > > returns 0 all is fine and we move it to another queue. If != 0
>> then
>> > > >
>> > > > > > there's an
>> > > >
>> > > > > > error and reports via mail.
>> > > >
>> > > > > >
>> > > >
>> > > > > > Ah, okay. I didn't realize it was returning an error code.
>> > > >
>> > > > > >
>> > > >
>> > > > > > > How
>> > > >
>> > > > > > > do you know if there are any urls that had problems?
>> > > >
>> > > > > >
>> > > >
>> > > > > > Hadoop reporter shows statistics. There are always many errors
>> for
>> > > many
>> > > >
>> > > > > > reasons. This is normal because we crawl everything.
>> > > >
>> > > > >
>> > > >
>> > > > > How are you running Hadoop reporter?
>> > > >
>> > > > You'll get it for free when operating a Hadoop cluster.
>> > > >
>> > > > >
>> > > >
>> > > > > > > Or fetch jobs that
>> > > >
>> > > > > > > errored out, etc.
>> > > >
>> > > > > >
>> > > >
>> > > > > > The non-zero return code.
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> >
>>
>
>
>
> --
> *Best Regards :-)*
> *mingzhou zhuang
> Department of Computer Science & Technology,Tsinghua University, Beijing,
> China*
>



-- 
*Best Regards :-)*
*mingzhou zhuang
Department of Computer Science & Technology,Tsinghua University, Beijing,
China*

Re: Continuous crawling

Reply via email to