Hi Kneerosh,
Golden rule of posting.
Which Nutch and which Solr versions are you using?
add index-more to your plugin configuration and it will get you two out of
three.
author... if the marup is there is it trivial.
Lewis


On Tuesday, May 7, 2013, kneerosh <[email protected]> wrote:
> Hi,
>
>   Im crawling some sites using Nutch and indexing in solr. I get only the
> host, tsstamp ,content , url to solr.
> I also wanted content type, last modified and author.
>
> For this I changed nutch-site.xml and added index-more to plugin.includes.
> <property>
>   <name>plugin.includes</name>
>
>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
> Then in solrindex-mapping.xml , ive set
>         <field dest="content_type" source="primaryType"/>
>                <field dest="last_modified" source="date"/>
>
> But I just dont get content_Type- Im expecting text/html.
>
> How do I do this?
>
>
>
>
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Passing-content-type-last-modified-from-nutch-to-solr-tp4061288.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Reply via email to