Hi Thomas,

There was a fix in Tika to support *.hf files and it was for 0.9.
I saw it here :
http://mail-archives.apache.org/mod_mbox/tika-dev/201103.mbox/%3c129269711.10858.1299771239436.javamail.tom...@hel.zones.apache.org%3E

The cas-filemanager 0.3 version is using tika-core-0.8.jar. And I'm not
sure if the latest oodt version is using tika 0.9 or above.

On Tue, Feb 21, 2012 at 4:20 AM, Thomas Bennett <[email protected]> wrote:

> Hi,
>
> I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
>
> I'm ingesting hdf5 files (ext .h5) into my archive.
>
> I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
>
> <member>
>     <name>references</name>
>        ...
>                         <member>
>                             <name>mimeType</name>
>                             <value>application/octet-stream</value>
>                         </member>
>                         <member>
>                             <name>origReference</name>
>                             <value>file:/var/kat/data/1329472755.h5</value>
>                         </member>
>        ...
> </member>
>
> As you can see the mimeType is detected as application/octet-stream.
>
> This mimeType is auto-detected by the CAS-Crawler (I'm using the 
> AutoDetectProductCrawler
> crawlerId).
>
> However. I configure the Crawler policy/mimetypes.xml:
>
> <mime-info>
> <mime-type type="product/hdf5">
>  <glob pattern="\d{10}\.h5$" isregex="true"/>
> </mime-type>
> </mime-info>
>
> and policy/mime-extractor-map.xml:
>
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas";
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
>  <mime type="product/hdf5">
> <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
>  <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> <preCondComparators>
>  <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
> </preCondComparators>
>  </extractor>
> </mime>
> </cas:mimetypemap>
>
> The AutoDetectProductCrawler now uses this to detect the file and extract
> the metadata. However, when it comes to MimeType detection, this is done in
> the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
>
>
>         try {
>
>             this.mimeType = mimeTypeRepository
>
>                     .getMimeType(new URL(origRef));
>
>         } catch (MalformedURLException e) {
>
>             e.printStackTrace();
>
>         }
> So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
>
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
>
> MimeTypeapplication/x-hdfapplication/octet-stream applicationoctet-stream
>
> So my question: Is this okay? Do I live with the application/octet-stream.
> Any recommendations on how to fix this?
>
> Cheers,
> Tom
>
>
>
>
>
>


-- 
-Sheryl

Reply via email to