Hi Thomas, There was a fix in Tika to support *.hf files and it was for 0.9. I saw it here : http://mail-archives.apache.org/mod_mbox/tika-dev/201103.mbox/%3c129269711.10858.1299771239436.javamail.tom...@hel.zones.apache.org%3E
The cas-filemanager 0.3 version is using tika-core-0.8.jar. And I'm not sure if the latest oodt version is using tika 0.9 or above. On Tue, Feb 21, 2012 at 4:20 AM, Thomas Bennett <[email protected]> wrote: > Hi, > > I see that the file manager extracts the mime type from the product > references that are passed to it via the xml-rcp ingestProduct call. > > I'm ingesting hdf5 files (ext .h5) into my archive. > > I've captured the methodCall and here is the actual parameter that is > passed to the File Manager on a successful. > > <member> > <name>references</name> > ... > <member> > <name>mimeType</name> > <value>application/octet-stream</value> > </member> > <member> > <name>origReference</name> > <value>file:/var/kat/data/1329472755.h5</value> > </member> > ... > </member> > > As you can see the mimeType is detected as application/octet-stream. > > This mimeType is auto-detected by the CAS-Crawler (I'm using the > AutoDetectProductCrawler > crawlerId). > > However. I configure the Crawler policy/mimetypes.xml: > > <mime-info> > <mime-type type="product/hdf5"> > <glob pattern="\d{10}\.h5$" isregex="true"/> > </mime-type> > </mime-info> > > and policy/mime-extractor-map.xml: > > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" > magic="true or false" > mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml"> > <mime type="product/hdf5"> > <extractor > class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor"> > <config > file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/> > <preCondComparators> > <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/> > </preCondComparators> > </extractor> > </mime> > </cas:mimetypemap> > > The AutoDetectProductCrawler now uses this to detect the file and extract > the metadata. However, when it comes to MimeType detection, this is done in > the following line of code in > org.apache.oodt.cas.filemgr.structs.Reference.java: > > > try { > > this.mimeType = mimeTypeRepository > > .getMimeType(new URL(origRef)); > > } catch (MalformedURLException e) { > > e.printStackTrace(); > > } > So the mime-type is actually detected by the Tika library. Woot! So Tika > does not seem to know about .h5 files and that they are hdf5 files. > > Forcing a MimeType to be "application/x-hdf" in the MetaData results in > the mimetype being appended. > > MimeTypeapplication/x-hdfapplication/octet-stream applicationoctet-stream > > So my question: Is this okay? Do I live with the application/octet-stream. > Any recommendations on how to fix this? > > Cheers, > Tom > > > > > > -- -Sheryl
