RE: Language specifications

2010-04-22 Thread Arkadi.Kosmynin
Hi Joshua, > -Original Message- > From: Joshua J Pavel [mailto:jpa...@us.ibm.com] > Sent: Friday, 23 April 2010 6:57 AM > To: nutch-user@lucene.apache.org > Subject: Language specifications > > > Alternate question... thanks to everyone who has tried to help me > through > the hadoop/AIX

RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Arkadi.Kosmynin
Hi Tim, I would think that this parameter is related to the problem you describe, but the default value should allow indexing pages of the size you mention. Did you change this parameter? Regards, Arkadi indexer.max.tokens 1 The maximum number of tokens that will be indexed for

RE: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-20 Thread Arkadi.Kosmynin
What is in your regex-urlfilter.txt? > -Original Message- > From: joshua paul [mailto:jos...@neocodesoftware.com] > Sent: Wednesday, 21 April 2010 9:44 AM > To: nutch-user@lucene.apache.org > Subject: nutch says No URLs to fetch - check your seed list and URL > filters when trying to index

RE: Question about crawler.

2010-04-20 Thread Arkadi.Kosmynin
Hi Phil, > -Original Message- > From: Phil Barnett [mailto:ph...@philb.us] > Sent: Wednesday, 21 April 2010 8:39 AM > To: nutch-user@lucene.apache.org > Subject: Question about crawler. > > Is there some place to tell why the crawler has rejected a page? I'm > trying > to get 1.1 working

RE: Hadoop Disk Error

2010-04-20 Thread Arkadi.Kosmynin
1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to a place with, say, 50GB free? Your task may be successful on Windows just because the temp space limit is different there. From: Joshua J Pavel [mailto:jpa...@us.ibm.com] Sent: Wednesday, 21 April 2010 3:40 AM To: nut

RE: fetch depth

2010-04-19 Thread Arkadi.Kosmynin
Hi Fernando Crawling is done in iterations. At each iteration next portion of URLs selected for fetching are fetched. It is normal that only your seed URLs are fetched at the first iteration. See example of a crawling script here: http://wiki.apache.org/nutch/Crawl Regards, Arkadi > -Ori

RE: Hadoop Disk Error

2010-04-19 Thread Arkadi.Kosmynin
Are you sure that you have enough space in the temporary directory used by Hadoop? From: Joshua J Pavel [mailto:jpa...@us.ibm.com] Sent: Tuesday, 20 April 2010 6:42 AM To: nutch-user@lucene.apache.org Subject: Re: Hadoop Disk Error Some more information, if anyone can help: If I turn fetcher.p

RE: Running out of disk space during segment merger

2010-04-09 Thread Arkadi.Kosmynin
Hi Yves, I am glad it helped. Wish you success. Regards, Arkadi > -Original Message- > From: Yves Petinot [mailto:y...@snooth.com] > Sent: Saturday, 10 April 2010 12:56 AM > To: nutch-user@lucene.apache.org > Subject: Re: Running out of disk space during segment merger > > Arkadi, > >

RE: Nutch segment merge is very slow

2010-04-05 Thread Arkadi.Kosmynin
Hi, > -Original Message- > From: Susam Pal [mailto:susam@gmail.com] > Sent: Tuesday, 6 April 2010 12:18 AM > To: nutch-user@lucene.apache.org > Subject: Re: Nutch segment merge is very slow > > On Mon, Apr 5, 2010 at 5:27 PM, > wrote: > > > Hi > > > > I'm using Nutch crawler in my p

RE: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Arkadi.Kosmynin
[x] +1 - yes, I vote for the proposal > -Original Message- > From: Andrzej Bialecki [mailto:a...@getopt.org] > Sent: Friday, 2 April 2010 4:24 AM > To: nutch-user@lucene.apache.org > Subject: [VOTE] Nutch to become a top-level project (TLP) > > Hi all, > > According to an earlier [DISCUS

RE: Is it necce necessary to restart Servlet/JSP container after recrawl?

2010-03-29 Thread Arkadi.Kosmynin
Try "touching" your nutch/WEB-INF/web.xml file. This should restart Tomcat Nutch application without restarting Tomcat itself. > -Original Message- > From: 段军义 [mailto:duanju...@1218.com.cn] > Sent: Monday, 29 March 2010 11:51 AM > To: nutch-user@lucene.apache.org > Subject: Is it necce n

RE: Running out of disk space during segment merger

2010-03-26 Thread Arkadi.Kosmynin
There are two solutions: 1. Write a light weight version of segment version, which should not be hard if you are familiar with Hadoop. 2. Don't merge segments. If you have a reasonable number of segments, even in 100s, Nutch still can handle this. Regards, Arkadi > -Original Message-

RE: Running out of disk space during segment merger

2010-03-25 Thread Arkadi.Kosmynin
Hi Yves, Yes, what you got is a "normal" result. This issue is discussed every few months on this list. To my mind, the segment merger is too general. It assumes that the segments are at arbitrary stages of completion and works on this assumption. But, this is not a common case at all. Mostly,

RE: invertlinks: Input path does not exist

2010-03-21 Thread Arkadi.Kosmynin
Hi Patricio, It seems to be quite a lot, but whether it is enough depends on your data size. Regards, Arkadi > -Original Message- > From: Patricio Galeas [mailto:pgal...@yahoo.de] > Sent: Sunday, March 21, 2010 1:40 AM > To: nutch-user@lucene.apache.org > Subject: AW: invertlinks: Inpu

RE: invertlinks: Input path does not exist

2010-03-18 Thread Arkadi.Kosmynin
I had similar problems caused by lack of space in temp directory. To solve, edited hadoop-site.xml and set hadoop.tmp.dir to a directory with plenty of space. > -Original Message- > From: kevin chen [mailto:kevinc...@bdsing.com] > Sent: Friday, March 19, 2010 1:42 PM > To: nutch-user@luc

Announcing release of Arch - an extension of Nutch for intranet search

2010-03-17 Thread Arkadi.Kosmynin
Hello, I have been reading this list for quite a while. This was frustrating at times because very often I thought, "If only I could release Arch now, I could help this..., and this..., and this..." But, it was not ready. Now it is ready and I am more than happy to release it. I hope it will

RE: BOOST documents at indexing

2009-10-15 Thread Arkadi.Kosmynin
Yes. See Nutch scoring filters. Arkadi > -Original Message- > From: BELLINI ADAM [mailto:mbel...@msn.com] > Sent: Friday, October 16, 2009 3:33 AM > To: nutch-user@lucene.apache.org > Subject: BOOST documents at indexing > > > hi, > > could some one tell me if it's possible to boost c

RE: nutch-1.0.war deploying error

2009-10-12 Thread Arkadi.Kosmynin
Hi, It looks like you have to upgrade your jvm. Arkadi > -Original Message- > From: nikinch [mailto:maill...@qwamci.com] > Sent: Tuesday, October 13, 2009 1:20 AM > To: nutch-user@lucene.apache.org > Subject: nutch-1.0.war deploying error > > > Hello > > I have been playing around wi

RE: Plugin development

2009-08-02 Thread Arkadi.Kosmynin
> -Original Message- > From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul > Tomblin > Sent: Friday, July 31, 2009 9:49 PM > To: nutch-user@lucene.apache.org > Subject: Re: Plugin development > > On Fri, Jul 31, 2009 at 4:33 AM, Alexander > Aristov wrote: > > What do you

RE: Why did my crawl fail?

2009-07-26 Thread Arkadi.Kosmynin
Sorry, I think you misunderstood me. I meant no content has been fetched on that iteration, for the segment that does not have parse_data. > -Original Message- > From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul > Tomblin > Sent: Monday, July 27, 2009 11:12 AM > To: n

RE: Why did my crawl fail?

2009-07-26 Thread Arkadi.Kosmynin
This is a very interesting issue. I guess that absence of parse_data means that no content has been fetched. Am I wrong? This happened in my crawls a few times. Theoretically (I am guessing again) this may happen if all urls selected for fetching on this iteration are either blocked by the fil

RE: Merge taking forever

2009-06-04 Thread Arkadi.Kosmynin
Hi John, > -Original Message- > From: John Martyniak [mailto:j...@beforedawnsolutions.com] > Sent: Thursday, June 04, 2009 10:12 PM > To: nutch-user@lucene.apache.org > Subject: Re: Merge taking forever > > Thanks for all of the input, I was leaning towards setting up hadoop > cluster for

RE: Merge taking forever

2009-06-04 Thread Arkadi.Kosmynin
Hi Andrzej, > -Original Message- > From: Andrzej Bialecki [mailto:a...@getopt.org] > Sent: Thursday, June 04, 2009 9:47 PM > To: nutch-user@lucene.apache.org > Subject: Re: Merge taking forever > > Bartosz Gadzimski wrote: > > As Arkadi said, your hdd is to slow for 2 x quad core processo

RE: Merge taking forever

2009-06-03 Thread Arkadi.Kosmynin
Hi John, > -Original Message- > From: John Martyniak [mailto:j...@beforedawnsolutions.com] > Sent: Thursday, June 04, 2009 12:30 PM > To: nutch-user@lucene.apache.org > Subject: Re: Merge taking forever > > Hi Arkadi, > > Thanks for the info, that does sound like a good feature to have,

RE: Merge taking forever

2009-06-03 Thread Arkadi.Kosmynin
Hi John, This was my experience, too. If I've interpreted the source code correctly, the time in merging is spent on sorting, which is required because the segments are assumed to be "random", possibly containing duplicated URLs. The sort process groups URLs together and allows to choose the on

Minimizing Nutch memory requirements

2009-05-24 Thread Arkadi.Kosmynin
Hello, I am having memory problems while trying to crawl a local website. I give Nutch 1GB, but still can't finish the crawl. In order to solve this problem, I want to try to keep the segments size limited. The sizes of produced segments vary. The first 3-5 levels are small. This is understan

Seemingly abnormal temp space use by segment merger

2009-05-12 Thread Arkadi.Kosmynin
Hello, I am trying to merge 20 segments, total size 13GB, using Nutch 1.0 segment merger on a single computer. I have 100GB free in temp partition. Still, Nutch runs out of free space on the device. This does not seem right. Is there anything I can do to reduce the use of temp space? Perhaps s

RE: Nutch fetching skipped files

2008-04-03 Thread Arkadi.Kosmynin
Hello Vinet, Try using regex-urlfilter instead of crawl-urlfilter. Regards, Arkadi > -Original Message- > From: Vineet Garg [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 02, 2008 10:34 PM > To: nutch-user@lucene.apache.org > Subject: Nutch fetching skipped files > > Hi, > I am usi

RE: Custom fields

2008-03-31 Thread Arkadi.Kosmynin
Evgeny, I doubt that it is possible to add a custom field before crawling process. However, your problem has a solution: write an indexing plugin that will be called at the indexing stage. You can easily add a custom field at this point. You will have to put your urls and categories into a databas

RE: fetcher failing with outofmemory exception

2008-02-08 Thread Arkadi.Kosmynin
I don't know if you are using any custom plugins on the fetching stage. I don't even know if this is possible (I don't need it). But, I have had a similar experience with indexing. After a few thousand pages, Nutch would start complaining about lack of memory. The culprit was my plugin that created

Applying patch NUTCH-573 ("multiple domains search") - which exactly Nutch version?

2008-01-16 Thread Arkadi.Kosmynin
Hi, Can anyone please point me to a version of Nutch (sources) compatible with this patch? I've tried to apply it to 0.9.0 available on a mirror (http://apache.wildit.net.au/lucene/nutch/), but patching fails. Nor could I apply it to 0.8.1. Is anyone actively using this patch? Is it stable?