Re: Release planning

2011-01-04 Thread Mattmann, Chris A (388J)
(cc to d...@nutch since you are addressing devs too) Hey Andrzej: > > As you probably know, there are currently two active lines of > development for Nutch: > [...snip...] > > Regarding branch-1.2 (which is a maintenance branch after release 1.2) > there have been pretty no updates there, if

Re: Exception on segment merging

2011-01-04 Thread Markus Jelsma
Use the hadoop.tmp.dir setting in nutch-site.conf to point to a disk where plenty is space is available. > Other users have previously reported similar problems which were due to a > lack on space on disk as suggested by this > > *Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException

Re: Exception on segment merging

2011-01-04 Thread Julien Nioche
Other users have previously reported similar problems which were due to a lack on space on disk as suggested by this *Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could notfind any valid local directory for taskTracker/jobcache/job_local_0001/attempt_local_0001_m_32_0/outp

Re: Release planning

2011-01-04 Thread Julien Nioche
+1 from me. I've committed today a bunch of patches which were in 1.2 but not in 1.3 (just one last one to do) but haven't compared with 2.0 Having a release based on 1.3 would be great as it would be a nice transition towards 2.0 (delegate indexing/search, dependency management with Ivy, separati

Release planning

2011-01-04 Thread Andrzej Bialecki
Hi users & devs, As you probably know, there are currently two active lines of development for Nutch: * Nutch trunk, a.k.a. Nutch 2.0: this is based on a completely redesigned storage layer that uses Apache Gora, which in turn can use various storage implementations such as HBase, Cassandra,

Re: Exception on segment merging

2011-01-04 Thread alxsss
Which command did you use? Merging segments is very expensive in resources, so I try to avoid merging them. -Original Message- From: Marseld Dedgjonaj To: user Sent: Tue, Jan 4, 2011 7:12 am Subject: FW: Exception on segment merging I see in hadup log and some more details a

Re: unnecessary results in search

2011-01-04 Thread alxsss
Hello, Thanks you for your response. Let me give you more detail of the issue that I have. First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com a

FW: Exception on segment merging

2011-01-04 Thread Marseld Dedgjonaj
I see in hadup log and some more details about the exception are there. Please help me what to check for this error. Here are the details: 2011-01-04 07:40:23,999 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:36,563 INFO segment.SegmentMerger - Slice size: 5 URLs. 20

Re: unnecessary results in search

2011-01-04 Thread Julien Nioche
Alex, See http://wiki.apache.org/solr/FieldCollapsing for implementing this in SOLR. Since the indexing and searching is delegated to SOLR as of Nutch 1.3 and 2.0 this won't be implemented directly in Nutch. HTH Julien On 4 January 2011 00:10, wrote: > Hello, > > I used nutch-1.2 to index a

Which parse-plugins.xml is being used?

2011-01-04 Thread Steve Cohen
I am using a hadoop cluster so I am putting my conf files into the nutch source conf directory and building a nutch job file. I am then putting the job file into the classpath. I thought it was working fine since it seems to be reading the regex-urlfilter.txt from there. However, I am getting messa

Exception on segment merging

2011-01-04 Thread Marseld Dedgjonaj
Hello everybody, I have configured nutch-1.2 to crawl all urls of a specific website. It runs fine for a while but now that the number of indexed urls has grown more than 30'000, I got an exception on segment merging. Have anybody seen this kind of error. The exception is shown below.

Re: unnecessary results in search

2011-01-04 Thread Gora Mohanty
On Tue, Jan 4, 2011 at 5:40 AM, wrote: > Hello, > > I used nutch-1.2 to index a few domains. I noticed that nutch correctly > crawled all sub-pages of domains. By sub-pages I mean the followings, for > example for a domain mydomain.com all links inside it like > mydomain.com/show/photos/1 and e