I'll try to give it a shot this week for the 1.2 branch and trunk if it isn't too different. It shouldn't be too hard and Julien's explanation on how to read the configuration makes a lot of sense.
On Wednesday 08 September 2010 16:37:29 Mattmann, Chris A (388J) wrote: > Hi Markus, > > > Interesting! But can the mime extractor return more than one type for a > > given file in Nutch? > > Sure, Nutch metadata is a named Field->multi-value structure so a file (or > piece of content) can certainly have more than 1 type. > > > I see, but in that case it would be helpful if the canonical, top and sub > > types have their own field which would also give more meaning to the > > whole. The way it works now results in a real nasty mess when faceting on > > the type field. > > I hear ya! Though I guess it's a mess from your perspective. From mine, it > is nice to be able to see things like: > > Mime Type: > text (720) > plain (77) > text/plain (250) > xml (235) > ... > > Faceting using the primary and sub types works fine for me. > > > What would be a good (configurable) improvement? Just adding the option > > to disable the split? Or also add an option that spits out up to three > > distinct fields? > > I think that both of your suggestions are great improvements and we can > include a patch to make each configurable. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.mattm...@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350