Re: INTEGRATION OF NUTCH AND SOLR

2013-07-02 Thread Lewis John Mcgibbney
Hi Avilash, It is extremely difficult to comment here. We need information on whats actually happening. Your description is a bit of a black box. Can you please look in hadoop.log and solr logs as well. THIS WIll give you an indication of how many documents are/were written down to Solr. thank you

INTEGRATION OF NUTCH AND SOLR

2013-07-02 Thread Avilash Kumar
Hi, I followed the tutorial on the Nutch Website. I am using Nutch 1.6 with Solr 3.6. Everything went well till the end but when I searched passed a query. It gave me no results. Need help. Thanx and regards Avilash

Re: Nutch scalability tests

2013-07-02 Thread Lewis John Mcgibbney
Hi, On Tue, Jul 2, 2013 at 3:53 PM, h b wrote: > So, I tried this with the generate.max.count property set to 5000, rebuild > ant; ant jar; ant job and reran fetch. > It still appears the same, first 79 reducers zip through and the last one > is crawling, literally... > Sorry I should have been

Re: Nutch scalability tests

2013-07-02 Thread h b
So, I tried this with the generate.max.count property set to 5000, rebuild ant; ant jar; ant job and reran fetch. It still appears the same, first 79 reducers zip through and the last one is crawling, literally... As for the logs, I mentioned on one of my earlier threads that when I run from the d

Re: Nutch scalability tests

2013-07-02 Thread Lewis John Mcgibbney
Hi, Please try *http://s.apache.org/mo* Specifically the generate.max.count property. Many many URLs are unfetched here... look into the logs and see what is going on. This is really quite bad and there is most likely one/a small number of reasons which ultimately determine why so many URLs are unf

Re: Nutch scalability tests

2013-07-02 Thread h b
Hi, I seeded 4 urls, all in the same domain. I am running fetch with 20 threads and 80 numTasks. The reducer is stuck on the last reduce. I ran a dump of the readdb to see the status, and I see 122K of the total 133K urls are 'status_unfetched'. This is after 12 hours. The delay between fetches is

RE: no digest field avaliable

2013-07-02 Thread Markus Jelsma
I've got a version of the indexchecker that does that, as well as providing a telnet server. I was just thinking to open an issue about that this afternoon! -Original message- > From:Sebastian Nagel > Sent: Tuesday 2nd July 2013 22:29 > To: user@nutch.apache.org > Subject: Re: no dige

Re: no digest field avaliable

2013-07-02 Thread Sebastian Nagel
Hi Christian, > no field "digest" showing up in the indexchecker That's correct to some extend. The class of indexchecker is called IndexingFiltersChecker and it shows the fields added by the configured IndexingFilters. The field digest is added as a field by the class IndexerMapReduce. The digest

Re: [ANNOUNCE] Apache Nutch v2.2.1 Released

2013-07-02 Thread Julien Nioche
Great stuff! Thanks Lewis On 2 July 2013 17:32, Lewis John Mcgibbney wrote: > Good Afternoon Everyone, > > The Apache Nutch PMC are very pleased to announce the immediate release of > Apache Nutch v2.2.1, we advise all current users and developers of the 2.X > series to upgrade to this release A

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Julien Nioche
Neither. You leave it in $NUTCH/conf and compile a job file with 'ant job' which gets used from runtime/deploy/bin BTW new users should at least do the basic Hadoop tutorial. On 2 July 2013 16:23, Lewis John Mcgibbney wrote: > I'm assuming that if your running on an established hadoop cluster y

Re: Running multiple nutch jobs to fetch a same site with millions of pages

2013-07-02 Thread alxsss
You can decrease fetcher.server.delay. Another way is to split storage table and run many instances of nutch. However, if you do not own the server where the crawled domain hosted you could be blocked, since frequent requests might be accepted as a Dos attack. hth. Alex. -Origina

[ANNOUNCE] Apache Nutch v2.2.1 Released

2013-07-02 Thread Lewis John Mcgibbney
Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the immediate release of Apache Nutch v2.2.1, we advise all current users and developers of the 2.X series to upgrade to this release ASAP. Apache Nutch is an open source web-search software project. Stemming from Apache L

[RESULT] WAS Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-07-02 Thread Lewis John Mcgibbney
Sorry team, this should have been a [RESULT] thread. Thanks Lewis On Tue, Jul 2, 2013 at 9:08 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > In the famous words of Truman good morning, good afternoon, good evening > and good night... to all Nutch'ers!!! > > I would like to bring

Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-07-02 Thread Lewis John Mcgibbney
In the famous words of Truman good morning, good afternoon, good evening and good night... to all Nutch'ers!!! I would like to bring this thread to a close and formally end the VOTE'ing. VOTE's tally as follows [ ] +1, let's get it released!!! Markus Jelsma Tejas Patil Chris A Mattmann Feng Lu R

Re: no digest field avaliable

2013-07-02 Thread Christian Nölle
Am 02.07.2013 17:19, schrieb Lewis John Mcgibbney: Which version of Nutch are you using please? Sorry, totally forgot to mention. Tested with 1.5.1 and 1.7 -- -c

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
I'm assuming that if your running on an established hadoop cluster you will wish to keep it over there. Lewis On Tuesday, July 2, 2013, Sznajder ForMailingList wrote: > Thanks > Can I copy this file to my $NUTCH/conf directory, or must I keep it in the > $HADOOP/conf directory? > > Benjamin > > >

Re: no digest field avaliable

2013-07-02 Thread Lewis John Mcgibbney
Which version of Nutch are you using please? On Tuesday, July 2, 2013, Christian Nölle wrote: > Hi everbody, > > I got a problem concering solrdedup. We got a field digest in solr, solrindex-mapping for digest is fine as well, but there is no field "digest" showing up in the indexchecker and thus

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Sznajder ForMailingList
Thanks Can I copy this file to my $NUTCH/conf directory, or must I keep it in the $HADOOP/conf directory? Benjamin On Tue, Jul 2, 2013 at 5:10 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > in mapred-site.xml > It is your Mapreduce configuration override. > hth > > On Tuesday, J

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList wrote: > Thanks a lot Markus! > > Where do we define this parameter, please? > > Benjamin > > > On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma wrote: > >> Hi, >> >> Increase your m

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList wrote: > Thanks a lot Markus! > > Where do we define this parameter, please? > > Benjamin > > > On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma wrote: > >> Hi, >> >> Increase your m

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Lewis John Mcgibbney
in mapred-site.xml It is your Mapreduce configuration override. hth On Tuesday, July 2, 2013, Sznajder ForMailingList wrote: > Thanks a lot Markus! > > Where do we define this parameter, please? > > Benjamin > > > On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma wrote: > >> Hi, >> >> Increase your m

Re: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Sznajder ForMailingList
Thanks a lot Markus! Where do we define this parameter, please? Benjamin On Tue, Jul 2, 2013 at 4:28 PM, Markus Jelsma wrote: > Hi, > > Increase your memory in the task trackers by setting your Xmx in > mapred.map.child.java.opts. > > Cheers > > > > -Original message- > > From:Sznajder

RE: Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Markus Jelsma
Hi, Increase your memory in the task trackers by setting your Xmx in mapred.map.child.java.opts. Cheers -Original message- > From:Sznajder ForMailingList > Sent: Tuesday 2nd July 2013 15:25 > To: user@nutch.apache.org > Subject: Distributed mode and java/lang/OutOfMemoryError > >

Distributed mode and java/lang/OutOfMemoryError

2013-07-02 Thread Sznajder ForMailingList
Hi, I am running Nutch 1.7 on a cluster of 6 nodes. I tempted to launch the bin/crawl script in this configuration and I am getting a very strange error (an error I did not get in the local mode): 13/07/02 16:04:23 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1372781063368 13/07/02 16:04:2

RE: Nutch scalability tests

2013-07-02 Thread Markus Jelsma
Hi, Nutch can easily scale to many many billions of records, it just depends on how many and how powerful your nodes are. Crawl speed is not very relevant as it is always very fast, the problem usually is updating the databases. If you spread your data over more machines you will increase your

no digest field avaliable

2013-07-02 Thread Christian Nölle
Hi everbody, I got a problem concering solrdedup. We got a field digest in solr, solrindex-mapping for digest is fine as well, but there is no field "digest" showing up in the indexchecker and thus not in our solr, when performing a real crawl. Is there anything missing? Are we missing a cru