Thee http.content.limit works on the actual wire content, so if you have lots
of non-text on top, you get nothing. Best would be to do this in the indexing
filter, but that requires some coding.
-----Original message-----
> From:Lewis John Mcgibbney <[email protected]>
> Sent: Friday 10th January 2014 16:36
> To: [email protected]
> Subject: Re: Content Field
>
> Hi Luis and d_k,
>
> On Fri, Jan 10, 2014 at 3:10 PM, <[email protected]> wrote:
>
> >
> > One way is to use a copyField [0] in Solr and limit its length using the
> > maxChars attribute and search for the original text and return the coped
> > field although i'm not sure how much it will be useful for the end user.
> >
> > Yes you could use this, however if you know that you DO NOT require
> anything over a certain character threshold (and that this is NOT going to
> come back and bit you in the future) then I would suggest using the
> http.content.limit property override in nutch-site.xml.
> This will limit the webpage content you fetch, parse and send to be
> indexed. It would be more efficient as oppose to fetching it, parsing it
> and NOT using it later on... the latter seems a bit of a waste of time and
> resources.
> hth
> Lewis
>