In absence of any23 nutch plugin to parse metadata, I would probably parse the file manually using external APIs and add desired fields in the index much like what Florian does in this tutorial: http://florianhartl.com/nutch-plugin-tutorial.html . I must say though that I am really new to nutch and to Java programming itself as I come from Python and C background so what I am thinking could be really ugly, un-scalable and naive in Java's perspective.
Prasanna On Wed, Jul 4, 2012 at 4:21 PM, Lewis John Mcgibbney < [email protected]> wrote: > 1) Do you have any thoughts on how to store the extracted microdata in > such a way? We can do it in the metadata but I am not yet sure about > the size that the metadata would need to grow to to accommodate a > whole host of triples/structured content? > 2) For me the problem is visualizing how to make the triple model fit > into the current model which is present with most other metadata... I > need to be honest and say that I am kinda lost here > > Lewis > > On Wed, Jul 4, 2012 at 5:52 PM, Prasanna. Suman > <[email protected]> wrote: > > Hi Lewis, > > Thanks for working on this. Basically I want to be able to parse and > index > > HTML5 microdata and rdf to be able to display rich snippets in my search > > results. So, for an instance along side with "content", "title" and > "url", I > > want to be able to store, index and display "description" or "icon" > which a > > lot of sites provide through micro-data. > > > > Prasanna > > > > > > On Wed, Jul 4, 2012 at 6:32 AM, Lewis John Mcgibbney > > <[email protected]> wrote: > >> > >> Hi Prasanna, > >> > >> I began working on this and have a half baked patch in my local check > >> out of Nutch trunk. > >> > >> Basically the my own idea of the plugin was for it to be a Tika > >> wrapped parser or else an htmlparsefiler, however the problem I was > >> having was when trying to visualize what we do with the extracted > >> content? > >> The triple model does not fit well with Solr, so I was therefor > >> thinking along the lines of having a separate indexingfilter which > >> simply fires the extracted triple content to a triple store... Jena > >> TDB? > >> > >> What was your query about as I am interested but also think a better > >> job could be made of the plugin if I got some inout from others with > >> use case(s) > >> > >> Best > >> Lewis > >> > >> On Wed, Jul 4, 2012 at 12:09 AM, Prasanna. Suman > >> <[email protected]> wrote: > >> > Is Any23 already integrated into Tika as planned? If not, is it on the > >> > way? > >> > > >> > -- > >> > -- > >> > -- > >> > Prasanna Suman > >> > > >> > #"Any program is only as good as it is useful." - Linus Torvalds > >> > >> > >> > >> -- > >> Lewis > > > > > > > > > > -- > > -- > > -- > > Prasanna Suman > > > > #"Any program is only as good as it is useful." - Linus Torvalds > > > > > > > > > > -- > Lewis > -- -- -- Prasanna Suman #"Any program is only as good as it is useful." - Linus Torvalds

