remove fields

2009-11-26 Thread Fadzi Ushewokunze
hi all, there are 4 document fields in my index that i am not indexing anymore; then i have 4 new fields i need to add to my index, so i created a new indexing filter. how i can add these new fields while preserving the removed fields in the existing docs? at the moment when i run bin/index

Re: 100 fetches per second?

2009-11-26 Thread Otis Gospodnetic
I think in the end what Ken Krugler did with Bixo (limiting crawl time) and what Julien added in https://issues.apache.org/jira/browse/NUTCH-770 (plus https://issues.apache.org/jira/browse/NUTCH-769) are solutions to this problem, in addition to what Andrzej described below. Can you try

Encoding the content got from Fetcher

2009-11-26 Thread Santiago PĂ©rez
Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a way which deletes the new line characters (\n) and transform useful characters in Spanish

Re: 100 fetches per second?

2009-11-26 Thread MilleBii
Yep, I will try right after this run ends... Which is likely tomorrow by the sound of it. Still how come there is a factor 6+ difference from one run to the next ... Timing hosts blocking the queue maybe, but the probability to get one in the queue can not be so different from one run to run.

Broken segments ?

2009-11-26 Thread Mischa Tuffield
Hello All, I was wondering if there is any way to check the integrity of a segment? As it stands, I can't create the index I want due to a number of my segments freaking out like below : Is there anyway to check if my segments are OK, I guess i could always re:fetch them if need be.

Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki
Mischa Tuffield wrote: Hello All, http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track

Re: Encoding the content got from Fetcher

2009-11-26 Thread fadzi
hi have you tried to change this property: parser.character.encoding.default Hej, I am a newbie in Nutch and I need some help with a problem because I do not find clear documentation. In crawling proccess when the each of the FetcherThread get the content, this is in formatted in a

add parse-wml plugin to Nutch!

2009-11-26 Thread yangfeng
hi, i have to add parse-wml plugin to Nutch, if it has been finished,pls give me some advise. Tks!

Re: 100 fetches per second?

2009-11-26 Thread MilleBii
Interesting updates on the current run of 450K urls : + 30minutes @ 3Mbits/s + drop to 1Mbit/s (1/X shape) + gradual improvement to 1.5 Mbit/s and steady for 7 hours + sudden drop to 0.9 Mbits/s and steady for 4 hours + up to 1.7 Mbits for 1hour + staircasing down to 0.5 Mbit/s by steps of 1 hour

Re: Nutch near future - strategic directions

2009-11-26 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Lots of good thoughts and ideas, easy to agree with. Something for the ease of use category: -allow running on top of plain vanilla hadoop What does it mean plain vanilla here? Do you mean the current DB implementation? That's the idea, we should