Sorry i cant give more then an idea, I'm not a java developer, but I think the
idea could prove useful.
I'm not completely sure how the spider works while indexing, but I've noticed
when indexing a site like w3schools.com they have a lot of keywords listed in
their side menus. So, if I just
Sorry i cant give more then an idea, I'm not a java developer, but I think
the idea could prove useful.
The idea is to limit the length of sentences that get entered into the
index. So, after parsing a page, and words that don't make what appears to
be a complete sentence get ignored.
Douglas,
Gang,
I just noticed that the generate.max.per.host property is only
enforced on a per reduce task basis during the first generate job
(see Generator.Selector.reduce for details). At a minimum, it should
probably be documented this way in nutch-default.xml.template.
Thoughts?
- Chris
--
Chris Schneider wrote:
I just noticed that the generate.max.per.host property is only enforced
on a per reduce task basis during the first generate job (see
Generator.Selector.reduce for details). At a minimum, it should probably
be documented this way in nutch-default.xml.template.
Yes, but
Hi,
looks like the http protocol plugin does not handle chunked content. :(
The method readChunkedContent is never used and readPlainContent does
not handle chunked content.
As far I know a lot of http servers response with chunked content at
least all that return dynamically generated
Thanks for the link, it was an interesting read. Seems like their over
complicating things a bit. To me it's just a matter of counting how long a
sentence is, if you look at most web pages the sentences in their side columns
are usually filler, and short, while the sentences in the main content