I am using bin/crawl - I'll change the timeLimitFetch to something a bit
higher.

Thanks!

On Sun, Mar 17, 2013 at 5:07 PM, feng lu <[email protected]> wrote:

> yes, the property is fetcher.timelimit.mins. if you not set this property,
> the QueueFeeder will not filter the url and log output may like this
>
> QueueFeeder finished: total 36651 records + hit by time limit :0
>
> Do you use bin/crawl command script. it will set the time limit for
> fetching to 180. the code like this
>
> # time limit for fetching
> timeLimitFetch=180
>
> # fetching the segment
> echo "Fetching : $SEGMENT"
> $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
> $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads
>
>
>
>
>
> On Sun, Mar 17, 2013 at 7:47 PM, Amit Sela <[email protected]> wrote:
>
> > By fetcher.limit property,do you mean fetcher.timelimit.mins ? because I
> > have it set on default (-1) - no time limit.
> >
> >
> > On Sat, Mar 16, 2013 at 5:12 PM, feng lu <[email protected]> wrote:
> >
> > > Hi Amit
> > >
> > > <<
> > > I also note
> > > that the total hit by time limit here is 50927 but the job counters
> show
> > > 7493.
> > > >>
> > >
> > > This two time limits are all set bye fetcher.limit property. One is
> used
> > in
> > > QueueFeeder class, indicate that the QueueFeeder should finish load
> data
> > if
> > > current time is larger than time limit. So the total hit by time limit
> is
> > > 50927. Another is used in FetchItemQueues class, indicate that check
> the
> > > time if current time is larger than time limit and feeder has stopped ,
> > > emptying the queues, So here job counters of time limit is 7493. There
> > are
> > > not equal.
> > >
> > >
> > > <<
> > > Summing all of theses numbers does equal the total map input.
> > > >>
> > >
> > > do you set the property of "fetcher.follow.outlinks.depth", when
> > > fetcher.parse is true and this value is greater than 0 the fetcher will
> > > extract outlinks
> > >   and follow until the desired depth is reached.
> > >
> > > Another reason is that when this page is redirect to another page ,
> fetch
> > > will add new redirect page to fetch queues, so you can see that map
> input
> > > is not equal to numbers of all sum.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Sat, Mar 16, 2013 at 8:03 AM, Lewis John Mcgibbney <
> > > [email protected]> wrote:
> > >
> > > > Hi Amit,
> > > >
> > > > I know this thread is a bit old now, however it is also something
> which
> > > > bugged me when I was looking into something else (InjectorJob
> > counters).
> > > >
> > > > On Tue, Mar 5, 2013 at 3:16 AM, Amit Sela <[email protected]>
> wrote:
> > > >
> > > > >
> > > > > And summing all counters does not equal the total map input...
> > > > >
> > > > > Summing all of theses numbers does equal the total map input. I
> also
> > > note
> > > > > that the total hit by time limit here is 50927 but the job counters
> > > show
> > > > > 7493.
> > > > >
> > > > >
> > > > Basically, the easiest way to see and generally understand counters
> is
> > to
> > > > run the Nutch application within your Hadoop cluster (if no cluster
> > > > available then use psudo mode) and use the web application interface
> to
> > > > Hadoop. You will clearly see all counters associated with the job and
> > you
> > > > can take it from there.
> > > > I like the notion of creating custom counters to obtain specific
> > metrics
> > > > but this is solely driven by user requirements.
> > > > Do you want to learn more about counters? Look into the code.
> > > > Do you want to know more about Nutch counters, or make the counters
> > more
> > > > explicit? Then consider opening a Jira issue and we can discuss this
> in
> > > > more detail.
> > > > With regards to the Fetcher, there are many possible areas where
> > counters
> > > > are (and could be) really useful... as I said though this s only
> driven
> > > by
> > > > user requirements.
> > > >
> > >
> > >
> > >
> > > --
> > > Don't Grow Old, Grow Up... :-)
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Reply via email to