Hi Joshua,
> -Original Message-
> From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
> Sent: Friday, 23 April 2010 6:57 AM
> To: nutch-user@lucene.apache.org
> Subject: Language specifications
>
>
> Alternate question... thanks to everyone who has tried to help me
> through
> the hadoop/AIX
Hi Tim,
I would think that this parameter is related to the problem you describe, but
the default value should allow indexing pages of the size you mention. Did you
change this parameter?
Regards,
Arkadi
indexer.max.tokens
1
The maximum number of tokens that will be indexed for
What is in your regex-urlfilter.txt?
> -Original Message-
> From: joshua paul [mailto:jos...@neocodesoftware.com]
> Sent: Wednesday, 21 April 2010 9:44 AM
> To: nutch-user@lucene.apache.org
> Subject: nutch says No URLs to fetch - check your seed list and URL
> filters when trying to index
Hi Phil,
> -Original Message-
> From: Phil Barnett [mailto:ph...@philb.us]
> Sent: Wednesday, 21 April 2010 8:39 AM
> To: nutch-user@lucene.apache.org
> Subject: Question about crawler.
>
> Is there some place to tell why the crawler has rejected a page? I'm
> trying
> to get 1.1 working
1 or even 2 GB are far from impressing. Why don't you switch hadoop.tmp.dir to
a place with, say, 50GB free? Your task may be successful on Windows just
because the temp space limit is different there.
From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Wednesday, 21 April 2010 3:40 AM
To: nut
Hi Fernando
Crawling is done in iterations. At each iteration next portion of URLs selected
for fetching are fetched. It is normal that only your seed URLs are fetched at
the first iteration. See example of a crawling script here:
http://wiki.apache.org/nutch/Crawl
Regards,
Arkadi
> -Ori
Are you sure that you have enough space in the temporary directory used by
Hadoop?
From: Joshua J Pavel [mailto:jpa...@us.ibm.com]
Sent: Tuesday, 20 April 2010 6:42 AM
To: nutch-user@lucene.apache.org
Subject: Re: Hadoop Disk Error
Some more information, if anyone can help:
If I turn fetcher.p
Hi Yves,
I am glad it helped. Wish you success.
Regards,
Arkadi
> -Original Message-
> From: Yves Petinot [mailto:y...@snooth.com]
> Sent: Saturday, 10 April 2010 12:56 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Running out of disk space during segment merger
>
> Arkadi,
>
>
Hi,
> -Original Message-
> From: Susam Pal [mailto:susam@gmail.com]
> Sent: Tuesday, 6 April 2010 12:18 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Nutch segment merge is very slow
>
> On Mon, Apr 5, 2010 at 5:27 PM,
> wrote:
>
> > Hi
> >
> > I'm using Nutch crawler in my p
[x] +1 - yes, I vote for the proposal
> -Original Message-
> From: Andrzej Bialecki [mailto:a...@getopt.org]
> Sent: Friday, 2 April 2010 4:24 AM
> To: nutch-user@lucene.apache.org
> Subject: [VOTE] Nutch to become a top-level project (TLP)
>
> Hi all,
>
> According to an earlier [DISCUS
Try "touching" your nutch/WEB-INF/web.xml file. This should restart Tomcat
Nutch application without restarting Tomcat itself.
> -Original Message-
> From: 段军义 [mailto:duanju...@1218.com.cn]
> Sent: Monday, 29 March 2010 11:51 AM
> To: nutch-user@lucene.apache.org
> Subject: Is it necce n
There are two solutions:
1. Write a light weight version of segment version, which should not be hard if
you are familiar with Hadoop.
2. Don't merge segments. If you have a reasonable number of segments, even in
100s, Nutch still can handle this.
Regards,
Arkadi
> -Original Message-
Hi Yves,
Yes, what you got is a "normal" result. This issue is discussed every few
months on this list. To my mind, the segment merger is too general. It assumes
that the segments are at arbitrary stages of completion and works on this
assumption. But, this is not a common case at all. Mostly,
Hi Patricio,
It seems to be quite a lot, but whether it is enough depends on your data size.
Regards,
Arkadi
> -Original Message-
> From: Patricio Galeas [mailto:pgal...@yahoo.de]
> Sent: Sunday, March 21, 2010 1:40 AM
> To: nutch-user@lucene.apache.org
> Subject: AW: invertlinks: Inpu
I had similar problems caused by lack of space in temp directory. To solve,
edited hadoop-site.xml and set hadoop.tmp.dir to a directory with plenty of
space.
> -Original Message-
> From: kevin chen [mailto:kevinc...@bdsing.com]
> Sent: Friday, March 19, 2010 1:42 PM
> To: nutch-user@luc
Hello,
I have been reading this list for quite a while. This was frustrating at times
because very often I thought, "If only I could release Arch now, I could help
this..., and this..., and this..." But, it was not ready. Now it is ready and I
am more than happy to release it.
I hope it will
Yes. See Nutch scoring filters.
Arkadi
> -Original Message-
> From: BELLINI ADAM [mailto:mbel...@msn.com]
> Sent: Friday, October 16, 2009 3:33 AM
> To: nutch-user@lucene.apache.org
> Subject: BOOST documents at indexing
>
>
> hi,
>
> could some one tell me if it's possible to boost c
Hi,
It looks like you have to upgrade your jvm.
Arkadi
> -Original Message-
> From: nikinch [mailto:maill...@qwamci.com]
> Sent: Tuesday, October 13, 2009 1:20 AM
> To: nutch-user@lucene.apache.org
> Subject: nutch-1.0.war deploying error
>
>
> Hello
>
> I have been playing around wi
> -Original Message-
> From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
> Tomblin
> Sent: Friday, July 31, 2009 9:49 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Plugin development
>
> On Fri, Jul 31, 2009 at 4:33 AM, Alexander
> Aristov wrote:
> > What do you
Sorry, I think you misunderstood me. I meant no content has been fetched on
that iteration, for the segment that does not have parse_data.
> -Original Message-
> From: ptomb...@gmail.com [mailto:ptomb...@gmail.com] On Behalf Of Paul
> Tomblin
> Sent: Monday, July 27, 2009 11:12 AM
> To: n
This is a very interesting issue. I guess that absence of parse_data means that
no content has been fetched. Am I wrong?
This happened in my crawls a few times. Theoretically (I am guessing again)
this may happen if all urls selected for fetching on this iteration are either
blocked by the fil
Hi John,
> -Original Message-
> From: John Martyniak [mailto:j...@beforedawnsolutions.com]
> Sent: Thursday, June 04, 2009 10:12 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Merge taking forever
>
> Thanks for all of the input, I was leaning towards setting up hadoop
> cluster for
Hi Andrzej,
> -Original Message-
> From: Andrzej Bialecki [mailto:a...@getopt.org]
> Sent: Thursday, June 04, 2009 9:47 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Merge taking forever
>
> Bartosz Gadzimski wrote:
> > As Arkadi said, your hdd is to slow for 2 x quad core processo
Hi John,
> -Original Message-
> From: John Martyniak [mailto:j...@beforedawnsolutions.com]
> Sent: Thursday, June 04, 2009 12:30 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Merge taking forever
>
> Hi Arkadi,
>
> Thanks for the info, that does sound like a good feature to have,
Hi John,
This was my experience, too. If I've interpreted the source code correctly, the
time in merging is spent on sorting, which is required because the segments are
assumed to be "random", possibly containing duplicated URLs. The sort process
groups URLs together and allows to choose the on
Hello,
I am having memory problems while trying to crawl a local website. I give Nutch
1GB, but still can't finish the crawl. In order to solve this problem, I want
to try to keep the segments size limited.
The sizes of produced segments vary. The first 3-5 levels are small. This is
understan
Hello,
I am trying to merge 20 segments, total size 13GB, using Nutch 1.0 segment
merger on a single computer. I have 100GB free in temp partition. Still, Nutch
runs out of free space on the device. This does not seem right.
Is there anything I can do to reduce the use of temp space? Perhaps s
Hello Vinet,
Try using regex-urlfilter instead of crawl-urlfilter.
Regards,
Arkadi
> -Original Message-
> From: Vineet Garg [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 02, 2008 10:34 PM
> To: nutch-user@lucene.apache.org
> Subject: Nutch fetching skipped files
>
> Hi,
> I am usi
Evgeny,
I doubt that it is possible to add a custom field before crawling
process. However, your problem has a solution: write an indexing plugin
that will be called at the indexing stage. You can easily add a custom
field at this point. You will have to put your urls and categories into
a databas
I don't know if you are using any custom plugins on the fetching stage.
I don't even know if this is possible (I don't need it). But, I have had
a similar experience with indexing. After a few thousand pages, Nutch
would start complaining about lack of memory. The culprit was my plugin
that created
Hi,
Can anyone please point me to a version of Nutch (sources) compatible
with this patch? I've tried to apply it to 0.9.0 available on a mirror
(http://apache.wildit.net.au/lucene/nutch/), but patching fails. Nor
could I apply it to 0.8.1.
Is anyone actively using this patch? Is it stable?
31 matches
Mail list logo