Re: Indexing Tika xmpDM properties

André Ricardo Tue, 17 Aug 2010 12:14:24 -0700

Hello Julien,

Thank you for your help, using IndexingFilter I am now indexing the tika
properties :)


But now I can't get Nutch search.jsp to query the fields indexed like
"album:dirty", I've followed both methods to search data in
http://wiki.apache.org/nutch/HowToMakeCustomSearch#Now.2C_how_do_I_search_my_indexed_data.3F

This is the output of the the explain page:
page

   - genre = Rock
   - lastModified = 1203475391000
   - segment = 20100817195518
   - album = Dirty Wings
   - digest = c1c813ff5b309e081f02c14ef026d14c
   - tstamp = 20100817185549496
   - url =
   
http://www.joshwoodward.com/mp3/JoshWoodward-IWantToDestroySomethingBeautiful.mp3
   - title = I Want To Destroy Something Beautiful
   - boost = 0.11624764
   - artist = Josh Woodward
   - contentLength = 4431809

score for query: dirty
(...)

0.19958043 = (MATCH) fieldWeight(album:dirty in 33), product of:

   - 1.0 = tf(termFreq(album:dirty)=1)


   - 2.5546296 = idf(docFreq=14, maxDocs=71)

0.078125 = fieldNorm(field=album, doc=33)

It's looking in the Field "album" but how to I query Nutch to look only in
that field, for example to list all albums with "dirty" on it?

Also, how does the Creative Commons CCQueryFilter works? Tried to look for
"cc:by" http://localhost:8080/nutch-1.1/search.jsp?lang=en&query=cc%3Aby but
could not list all cc works with "by" in the license.


Thanks again,
André Ricardo



On Thu, Aug 12, 2010 at 8:29 PM, Julien Nioche <
[email protected]> wrote:

> Hi Andre,
>
>
> > I was able to see that Tika identified nicely all that I want like
> artist,
> > album, genre using xmpDM etc..
> >
> >
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - Getting text...
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - Getting title...
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - Getting links...
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - found 0 outlinks in
> > http://www.joshwoodward.com/mp3/JoshWoodward-Stickybee.mp3
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - xmpDM:releaseDate: 2007
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - title: Stickybee
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - samplerate: 44100
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - xmpDM:album: Dirty Wings
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - xmpDM:artist: Josh
> Woodward
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - Author: Josh Woodward
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - channels: 2
> > 2010-08-12 18:59:00,656 TRACE tika.TikaParser - xmpDM:genre: Rock
> > 2010-08-12 18:59:00,657 TRACE tika.TikaParser - xmpDM:audioSampleRate:
> > 44100
> > 2010-08-12 18:59:00,657 TRACE tika.TikaParser - xmpDM:logComment:
> > XXXCommentshttp://www.joshwoodward.com/
> > 2010-08-12 18:59:00,657 TRACE tika.TikaParser - Content-Type: audio/mpeg
> > 2010-08-12 18:59:00,657 TRACE tika.TikaParser - version: MPEG 3 Layer III
> > Version 1
> >
> >
> > How can I index this fields in the same way Creative Commons parser does?
> > Shouldn't "nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));" do
> just
> > that?
> >
>
> The TikaParser stores the metadata returned by Tika in the ParseMetadata.
> It's not up to the parser to decide what should be indexed. This is the job
> of the IndexingFilters. What you need to do is create a new plugin which an
> implementation of an IndexingFilter which will inspect the parse metadata
> and generate the fields accordingly. Have a look at the plugins index-* or
> creativecommons to see examples of IndexingFilters.
>
> HTH
>
> Julien Nioche
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>

Re: Indexing Tika xmpDM properties

Reply via email to