Feature idea - Indexing Text Lengths

2006-05-07 Thread Douglas Brunner
Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful. I'm not completely sure how the spider works while indexing, but I've noticed when indexing a site like w3schools.com they have a lot of keywords listed in their side menus. So, if I just

Re: Feature idea - Indexing Text Lengths

2006-05-07 Thread Jérôme Charron
Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful. The idea is to limit the length of sentences that get entered into the index. So, after parsing a page, and words that don't make what appears to be a complete sentence get ignored. Douglas,

generate.max.per.host is per reduce task

2006-05-07 Thread Chris Schneider
Gang, I just noticed that the generate.max.per.host property is only enforced on a per reduce task basis during the first generate job (see Generator.Selector.reduce for details). At a minimum, it should probably be documented this way in nutch-default.xml.template. Thoughts? - Chris --

Re: generate.max.per.host is per reduce task

2006-05-07 Thread Doug Cutting
Chris Schneider wrote: I just noticed that the generate.max.per.host property is only enforced on a per reduce task basis during the first generate job (see Generator.Selector.reduce for details). At a minimum, it should probably be documented this way in nutch-default.xml.template. Yes, but

http chunked content

2006-05-07 Thread Stefan Groschupf
Hi, looks like the http protocol plugin does not handle chunked content. :( The method readChunkedContent is never used and readPlainContent does not handle chunked content. As far I know a lot of http servers response with chunked content at least all that return dynamically generated

Re: Feature idea - Indexing Text Lengths

2006-05-07 Thread Douglas Brunner
Thanks for the link, it was an interesting read. Seems like their over complicating things a bit. To me it's just a matter of counting how long a sentence is, if you look at most web pages the sentences in their side columns are usually filler, and short, while the sentences in the main content