Re: why nutch 1.4 don't set the origin html content field in solrindexer

Cube Agen Wed, 28 Dec 2011 06:57:06 -0800

Thanks, that is my question.

If I want to make a html snapshot, how should I do? Modify the SolrIndexer
and IndexerMapReduce ?



2011/12/28 Marek Bachmann <[email protected]>

> Hey ho,
>
> I think the questions was why only the PARSED content is in the content
> field.
>
> As I have understood Cube wants to have the raw page content to be
> stored and / or indexed.
>
> Cube, for what will you need the raw content? It is possible to add it
> to solr, even to index it in the content field. But I am not sure if it
> makes sense because I don't know what you want to do. :)
>
> Am 28.12.2011 15:35, schrieb Markus Jelsma:
> > check your solr schema, its likely set not to store.
> >
> >> When I use sorlindex command post the crawled content. I can find the
> >> content field that is Parsed text. Why not have the raw content field?
>
>

Re: why nutch 1.4 don't set the origin html content field in solrindexer

Reply via email to