Hi there,

i'm new to nutch. But with all those tutorials i found a good start. Since i
have set up a nutch-solr environment, i can crawl and parse pages an index
them with solr. Also, i modified the schema.xml to index the crawled
content. 

When i query solr, i can see a new field called "content". This field
provides the filtered (no tags etc) content of the parsed page. So far so
good! But when i crawl a large page, i. e. a board article, nutch index just
a part of the content. 

But when i run the HTML-Parser Plugin in standalone, i receive the full
content. 

Now my question: what configuration do i have to provide, to index the fully
parsed content of a page?

thanks for help
Jepse

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Content-field-does-not-provied-fully-parsed-text-Why-tp3493471p3493471.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to