Forwarding the message posted on tika-user mailing list as there is no response/activity on that mailing list.
From: Jana, Kumar Raja [mailto:kj...@ptc.com] Sent: Tuesday, January 27, 2009 8:49 PM To: tika-u...@lucene.apache.org Subject: Customizing Tika to parse MSProject Files Hi, I am trying to customize Tika-0.3-dev to parse MSProject and related files using MPXJ <http://mpxj.sourceforge.net/> libraries. For this, I've made the following changes in org/apache/tika/tika-config.xml and org/apache/tika/mime/tika-mimetypes.xml. In org/apache/tika/mime/tika-mimetypes.xml: (Added) <mime-type type="application/MSProject"> <glob pattern="*.mpp" /> <glob pattern="*.mpd" /> <glob pattern="*.mpx" /> </mime-type> In org/apache/tika/tika-config.xml: (Added) <parser name="parse-msproject" class="test.parser.microsoft.MSProjectParser"> <mime>application/MSProject</mime> </parser> I then tried to parse 3 different files ( mpp, mpd and mpx files). This is what has happened: 1. Mpp file's mime type was identified as application/x-tika-msoffice 2. Mpd file's mime type was identified as parse-msproject 3. Mpx file's mime type was identified as application/xml The first case seems to be 'coz of the following entry in tika-mimetypes.xml <mime-type type="application/x-tika-msoffice"> <magic> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8" /> </magic> </mime-type> I could not find any mention of x-tika-msoffice in the code other than that it is configured to OfficeParser and even there nothing is done for such mime types. Can I safely comment out the above entry to make it possible for my customized Tika to parse *.mpp files? Will there be any side effects? Or is there a workaround to make it possible for Tika to detect *.mpp files in spite of the entry in the configuration? (perhaps by changing the magic match value to something else) I could not figure out why *.mpx files are getting parsed as xml files rather than using the MPXJ libraries. Can someone please help me out with this? (I am using the standard tika configuration files with the above mentioned changes) Are there any parser libraries available with Apache or Tika for parsing MSProject files? Or is it possible for Tika-Dev team to integrate the MPXJ libraries into Tika? Thanks, Kumar