I am using bin/crawl - I'll change the timeLimitFetch to something a bit higher.
Thanks! On Sun, Mar 17, 2013 at 5:07 PM, feng lu <[email protected]> wrote: > yes, the property is fetcher.timelimit.mins. if you not set this property, > the QueueFeeder will not filter the url and log output may like this > > QueueFeeder finished: total 36651 records + hit by time limit :0 > > Do you use bin/crawl command script. it will set the time limit for > fetching to 180. the code like this > > # time limit for fetching > timeLimitFetch=180 > > # fetching the segment > echo "Fetching : $SEGMENT" > $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch > $CRAWL_PATH/segments/$SEGMENT -noParsing -threads $numThreads > > > > > > On Sun, Mar 17, 2013 at 7:47 PM, Amit Sela <[email protected]> wrote: > > > By fetcher.limit property,do you mean fetcher.timelimit.mins ? because I > > have it set on default (-1) - no time limit. > > > > > > On Sat, Mar 16, 2013 at 5:12 PM, feng lu <[email protected]> wrote: > > > > > Hi Amit > > > > > > << > > > I also note > > > that the total hit by time limit here is 50927 but the job counters > show > > > 7493. > > > >> > > > > > > This two time limits are all set bye fetcher.limit property. One is > used > > in > > > QueueFeeder class, indicate that the QueueFeeder should finish load > data > > if > > > current time is larger than time limit. So the total hit by time limit > is > > > 50927. Another is used in FetchItemQueues class, indicate that check > the > > > time if current time is larger than time limit and feeder has stopped , > > > emptying the queues, So here job counters of time limit is 7493. There > > are > > > not equal. > > > > > > > > > << > > > Summing all of theses numbers does equal the total map input. > > > >> > > > > > > do you set the property of "fetcher.follow.outlinks.depth", when > > > fetcher.parse is true and this value is greater than 0 the fetcher will > > > extract outlinks > > > and follow until the desired depth is reached. > > > > > > Another reason is that when this page is redirect to another page , > fetch > > > will add new redirect page to fetch queues, so you can see that map > input > > > is not equal to numbers of all sum. > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 16, 2013 at 8:03 AM, Lewis John Mcgibbney < > > > [email protected]> wrote: > > > > > > > Hi Amit, > > > > > > > > I know this thread is a bit old now, however it is also something > which > > > > bugged me when I was looking into something else (InjectorJob > > counters). > > > > > > > > On Tue, Mar 5, 2013 at 3:16 AM, Amit Sela <[email protected]> > wrote: > > > > > > > > > > > > > > And summing all counters does not equal the total map input... > > > > > > > > > > Summing all of theses numbers does equal the total map input. I > also > > > note > > > > > that the total hit by time limit here is 50927 but the job counters > > > show > > > > > 7493. > > > > > > > > > > > > > > Basically, the easiest way to see and generally understand counters > is > > to > > > > run the Nutch application within your Hadoop cluster (if no cluster > > > > available then use psudo mode) and use the web application interface > to > > > > Hadoop. You will clearly see all counters associated with the job and > > you > > > > can take it from there. > > > > I like the notion of creating custom counters to obtain specific > > metrics > > > > but this is solely driven by user requirements. > > > > Do you want to learn more about counters? Look into the code. > > > > Do you want to know more about Nutch counters, or make the counters > > more > > > > explicit? Then consider opening a Jira issue and we can discuss this > in > > > > more detail. > > > > With regards to the Fetcher, there are many possible areas where > > counters > > > > are (and could be) really useful... as I said though this s only > driven > > > by > > > > user requirements. > > > > > > > > > > > > > > > > -- > > > Don't Grow Old, Grow Up... :-) > > > > > > > > > -- > Don't Grow Old, Grow Up... :-) >

