Re: Nutch Any23 plugin

Prasanna. Suman Wed, 04 Jul 2012 13:51:57 -0700

In absence of any23 nutch plugin to parse metadata, I would probably parse
the file manually using external APIs and add desired fields in the index
much like what Florian does in this tutorial:
http://florianhartl.com/nutch-plugin-tutorial.html . I must say though that
I am really new to nutch and to Java programming itself as I come from
Python and C background so what I am thinking could be really ugly,
un-scalable  and naive in Java's perspective.


Prasanna

On Wed, Jul 4, 2012 at 4:21 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> 1) Do you have any thoughts on how to store the extracted microdata in
> such a way? We can do it in the metadata but I am not yet sure about
> the size that the metadata would need to grow to to accommodate a
> whole host of triples/structured content?
> 2) For me the problem is visualizing how to make the triple model fit
> into the  current model which is present with most other metadata... I
> need to be honest and say that I am kinda lost here
>
> Lewis
>
> On Wed, Jul 4, 2012 at 5:52 PM, Prasanna. Suman
> <[email protected]> wrote:
> > Hi Lewis,
> > Thanks for working on this. Basically I want to be able to parse and
> index
> > HTML5 microdata and rdf to be able to display rich snippets in my search
> > results. So, for an instance along side with "content", "title" and
> "url", I
> > want to be able to store, index and display "description" or "icon"
> which a
> > lot of sites provide through micro-data.
> >
> > Prasanna
> >
> >
> > On Wed, Jul 4, 2012 at 6:32 AM, Lewis John Mcgibbney
> > <[email protected]> wrote:
> >>
> >> Hi Prasanna,
> >>
> >> I began working on this and have a half baked patch in my local check
> >> out of Nutch trunk.
> >>
> >> Basically the my own idea of the plugin was for it to be a Tika
> >> wrapped parser or else an htmlparsefiler, however the problem I was
> >> having was when trying to visualize what we do with the extracted
> >> content?
> >> The triple model does not fit well with Solr, so I was therefor
> >> thinking along the lines of having a separate indexingfilter which
> >> simply fires the extracted triple content to a triple store... Jena
> >> TDB?
> >>
> >> What was your query about as I am interested but also think a better
> >> job could be made of the plugin if I got some inout from others with
> >> use case(s)
> >>
> >> Best
> >> Lewis
> >>
> >> On Wed, Jul 4, 2012 at 12:09 AM, Prasanna. Suman
> >> <[email protected]> wrote:
> >> > Is Any23 already integrated into Tika as planned? If not, is it on the
> >> > way?
> >> >
> >> > --
> >> > --
> >> > --
> >> > Prasanna Suman
> >> >
> >> > #"Any program is only as good as it is useful." - Linus Torvalds
> >>
> >>
> >>
> >> --
> >> Lewis
> >
> >
> >
> >
> > --
> > --
> > --
> > Prasanna Suman
> >
> > #"Any program is only as good as it is useful." - Linus Torvalds
> >
> >
> >
>
>
>
> --
> Lewis
>



-- 
-- 
-- 
Prasanna Suman

#"Any program is only as good as it is useful." - Linus Torvalds

Re: Nutch Any23 plugin

Reply via email to