Try to familiarise yourself with as many properties in nutch-default.xml as you can.
The ones in particular which will probably be causing you bother is http.content.limit which truncates content above a default limit. If you invcrease this value you will no doubt solve the problem. Please gte back to us and tell us the outcome. On Wed, Nov 9, 2011 at 5:09 AM, jepse <[email protected]> wrote: > Hi there, > > i'm new to nutch. But with all those tutorials i found a good start. Since > i > have set up a nutch-solr environment, i can crawl and parse pages an index > them with solr. Also, i modified the schema.xml to index the crawled > content. > > When i query solr, i can see a new field called "content". This field > provides the filtered (no tags etc) content of the parsed page. So far so > good! But when i crawl a large page, i. e. a board article, nutch index > just > a part of the content. > > But when i run the HTML-Parser Plugin in standalone, i receive the full > content. > > Now my question: what configuration do i have to provide, to index the > fully > parsed content of a page? > > thanks for help > Jepse > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*

