Pull request #205 was recently merged into master branch for Nutch 1.x in 
fulfillment of NUTCH-1129 "microdata for Nutch 1.x"

I am new to nutch and solr and have just started crawling and indexing a few 
select websites. Using the built in html parsing/indexing, I am getting 
searchable fields like url, content, host, sometimes a title, and a few other 
indexing related fields like digest, boost, segment, and tstamp. That said, I 
realized very quickly that I need better results. While exploring the source of 
the website, I noticed references to schema.org and get excited by what I see. 
That’s how I stumbled upon NUTCH-1129.

I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer. 

Q: Now what?  How do I gain Any23 microdata parsing / indexing capabilities 
introduced by NUTCH-1129? 
Q: Do I replace parse-(html | tika)|index-(basic | anchor) in plugin.includes 
with something like parse-(html | tika | any23)|index-(basic | anchor | any23)
Q: How do I expose the discovered microdata structure / items to end-user such 
as Solr? For example, what are the microdata items and do I need to map them to 
Solr in solrindex-mapping.xml?

I’d also be interested to learn how to point at a specific URL and see how 
nutch sees the microdata (best case), then learn how to leverage this into 
nutch and finally into solr. 

Thanks for any guidance.

Reply via email to