Forwarding the message posted on tika-user mailing list as there is no
response/activity on that mailing list.

 

From: Jana, Kumar Raja [mailto:kj...@ptc.com] 
Sent: Tuesday, January 27, 2009 8:49 PM
To: tika-u...@lucene.apache.org
Subject: Customizing Tika to parse MSProject Files

 

Hi,

 

I am trying to customize Tika-0.3-dev to parse MSProject and related
files using MPXJ <http://mpxj.sourceforge.net/>  libraries. 

 

For this, I've made the following changes in
org/apache/tika/tika-config.xml and
org/apache/tika/mime/tika-mimetypes.xml.

 

In org/apache/tika/mime/tika-mimetypes.xml:

(Added) 

  <mime-type type="application/MSProject">

    <glob pattern="*.mpp" />

    <glob pattern="*.mpd" />

    <glob pattern="*.mpx" />

  </mime-type>

 

 

In org/apache/tika/tika-config.xml:

(Added)

        <parser name="parse-msproject"
class="test.parser.microsoft.MSProjectParser">

                <mime>application/MSProject</mime>

        </parser>

 

 

I then tried to parse 3 different files ( mpp, mpd and mpx files). This
is what has happened:

1.     Mpp file's mime type was identified as
application/x-tika-msoffice 

2.     Mpd file's mime type was identified as parse-msproject

3.     Mpx file's mime type was identified as application/xml

 

The first case seems to be 'coz of the following entry in
tika-mimetypes.xml

  <mime-type type="application/x-tika-msoffice">

    <magic>

      <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8" />

    </magic>

  </mime-type>

 

I could not find any mention of x-tika-msoffice in the code other than
that it is configured to OfficeParser and even there nothing is done for
such mime types. 

Can I safely comment out the above entry to make it possible for my
customized Tika to parse *.mpp files? Will there be any side effects?

Or is there a workaround to make it possible for Tika to detect *.mpp
files in spite of the entry in the configuration? (perhaps by changing
the magic match value to something else)

 

I could not figure out why *.mpx files are getting parsed as xml files
rather than using the MPXJ libraries. Can someone please help me out
with this? (I am using the standard tika configuration files with the
above mentioned changes)

 

Are there any parser libraries available with Apache or Tika for parsing
MSProject files? Or is it possible for Tika-Dev team to integrate the
MPXJ libraries into Tika?

 

 

Thanks,

Kumar

 

Reply via email to