Re: no digest field avaliable

2013-07-03 Thread Christian Nölle
Am 03.07.2013 23:12, schrieb Sebastian Nagel: with Nutch 1.7 and Solr 3.6.2 and same for [1], [2] the digest field appears and is filled well. Ok, there is no further configuration neccessary? Nutch will generate this out of the box? Just tested it with a fresh compiled 1.7 - no digest once

Re: no digest field avaliable

2013-07-03 Thread Sebastian Nagel
Hi Christian, with Nutch 1.7 and Solr 3.6.2 and same for [1], [2] the digest field appears and is filled well. Sebastian On 07/03/2013 09:14 AM, Christian Nölle wrote: > Am 02.07.2013 22:29, schrieb Sebastian Nagel: > >>> no field "digest" showing up in the indexchecker >> That's correct to som

RE: Nutch scalability tests

2013-07-03 Thread Markus Jelsma
How many different hosts do you crawl? I see one reducer and only one queue and Nutch queus by domain or host. Hosts will always end up in the same queue so Nutch will only crawl a lot and very fast if there's a large number of queues to process. The only thing you can do then is increase the

Re: Nutch scalability tests

2013-07-03 Thread h b
Hi, I reran this job again. I had 5 urls in my seed, and first pass of fetch, fetched about 230 pages in 20 minutes. Then I ran a second pass of fetch, and it has been running over 3.5 hours. Again, it is still the 1 reducer doing all the work, and its jobtracker has nothing in its log yet. 20/20

Re: Integration of Apache-nutch and eclipse.

2013-07-03 Thread Tejas Patil
Have to looked at http://wiki.apache.org/nutch/RunNutchInEclipse ? This is recently been updated and worked for several people over the user-group. It has some cool screen shots which would make your life easy setting up Nutch with eclipse. On Wed, Jul 3, 2013 at 12:39 AM, Ramakrishna wrote: > G

Re: Nutch scalability tests

2013-07-03 Thread Tejas Patil
The steps you performed are right. Did you get the log for that one "hardworking" reducer ? It will hint us why the job took so much. Ideally you should get logs for every job and its attempts. If you cannot get the log for that reducer, then I feel that your cluster is having some problem and thi

Re: Stepwise nutch execution order

2013-07-03 Thread Tejas Patil
The correct order is: inject loop generate fetch parse updatedb end loop solr The nutch tutorial [0] and the crawl script are using the same. [0] : http://wiki.apache.org/nutch/NutchTutorial [1] : http://svn.apache.org/viewvc/nutch/trunk/src/bin/crawl?view=markup On Wed, Jul 3, 2013 at

Re: Nutch scalability tests

2013-07-03 Thread h b
Hi Tejas, looks like we were tying at the same time So anyway, my job ended fine, just to be sure what I am doing is right, I have cleared the db and started another round again. If I stumble again, will respond back on this thread. On Wed, Jul 3, 2013 at 8:43 AM, Tejas Patil wrote: > > The seco

Stepwise nutch execution order

2013-07-03 Thread h b
On most documents and email list, I have seen that the order of crawl for nutch-solr is inject loop generate fetch updatedb parse end loop solr When I follow this path I always see solr has 0 docs, even if i run solr inside the loop, i still get 0 docs in solr. However, if I switch the o

Integration of Apache-nutch and eclipse.

2013-07-03 Thread Ramakrishna
Guys.. I'm extremely sorry for posting/asking same doubt again.. After reading many documents also i dint get how to integrate nutch and eclipse. I've apache-nutch-2.2 and eclipse-juno versions are there. Plz tel me step by step, how to integrate eclipse and nutch with documentation. If possible pl

Re: Nutch scalability tests

2013-07-03 Thread h b
Spoke too soon, the fetch completed in 21 min. On Wed, Jul 3, 2013 at 8:32 AM, h b wrote: > oh and yes, generate.max.count is set to 5000 > > > On Wed, Jul 3, 2013 at 8:29 AM, h b wrote: > >> I dropped my webpage database, restarted with 5 seed urls. First fetch >> completed in a few seconds.

Re: Nutch scalability tests

2013-07-03 Thread Tejas Patil
> The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. > This one reducers log on the jobtracker however, is empty. This is weird. There can be a explanation for first line: The data

Re: Nutch scalability tests

2013-07-03 Thread h b
oh and yes, generate.max.count is set to 5000 On Wed, Jul 3, 2013 at 8:29 AM, h b wrote: > I dropped my webpage database, restarted with 5 seed urls. First fetch > completed in a few seconds. The second run, still shows 1 reduce running, > although it shows as 100% complete, so my thought is it

Re: Nutch scalability tests

2013-07-03 Thread h b
I dropped my webpage database, restarted with 5 seed urls. First fetch completed in a few seconds. The second run, still shows 1 reduce running, although it shows as 100% complete, so my thought is it is writing out to the disk, though it has been about 30+ minutes. Again, I had 80 reducers, when I

Re: Number of mappers in a distributed mode

2013-07-03 Thread Lewis John Mcgibbney
Please look for mapred-site.xml in hadoop conf directory. you can specify mapred.reduce.tasks and set an int for this value You will need to restart the jobtracker for this to kickin I would imagine. On Wednesday, July 3, 2013, Sznajder ForMailingList < bs4mailingl...@gmail.com> wrote: > Hi > > Wh

Number of mappers in a distributed mode

2013-07-03 Thread Sznajder ForMailingList
Hi When running Nutch in distributed mode, I see on my Hadoop jobtracker ( http://host:50300/jobtracker.jsp ) I see only 2 mappers running I wanted to set the number of mappers to the number of nodes I have on the cluster (6). Looking in the wiki, I found documents speaking about hadoop-site

RE: [ANNOUNCE] Apache Nutch v2.2.1 Released

2013-07-03 Thread Markus Jelsma
Great news, thanks Lewis! -Original message- From: Lewis John Mcgibbney Sent: Tuesday 2nd July 2013 18:32 To: user@nutch.apache.org; d...@nutch.apache.org Subject: [ANNOUNCE] Apache Nutch v2.2.1 Released Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the immed

Re: no digest field avaliable

2013-07-03 Thread Christian Nölle
Am 02.07.2013 22:29, schrieb Sebastian Nagel: Which Solr version is used? Sorry, forgot about that bit: 3.6.1 -- -c

Re: no digest field avaliable

2013-07-03 Thread Christian Nölle
Am 02.07.2013 22:29, schrieb Sebastian Nagel: no field "digest" showing up in the indexchecker That's correct to some extend. The class of indexchecker is called IndexingFiltersChecker and it shows the fields added by the configured IndexingFilters. The field digest is added as a field by the c