Try to familiarise yourself with as many properties in nutch-default.xml as
you can.

The ones in particular which will probably be causing you bother is
http.content.limit which truncates content above a default limit. If you
invcrease this value you will no doubt solve the problem. Please gte back
to us and tell us the outcome.

On Wed, Nov 9, 2011 at 5:09 AM, jepse <[email protected]> wrote:

> Hi there,
>
> i'm new to nutch. But with all those tutorials i found a good start. Since
> i
> have set up a nutch-solr environment, i can crawl and parse pages an index
> them with solr. Also, i modified the schema.xml to index the crawled
> content.
>
> When i query solr, i can see a new field called "content". This field
> provides the filtered (no tags etc) content of the parsed page. So far so
> good! But when i crawl a large page, i. e. a board article, nutch index
> just
> a part of the content.
>
> But when i run the HTML-Parser Plugin in standalone, i receive the full
> content.
>
> Now my question: what configuration do i have to provide, to index the
> fully
> parsed content of a page?
>
> thanks for help
> Jepse
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Reply via email to