And one more thing.
I added a new type like described below.

  <mime-type type="application/x-hwp-test">
    <magic priority="50">
      <match value="HWP Document File" type="string" offset="0"/>
    </magic>
  </mime-type>

As you can see, I used "HWP Document File" instead of "HWP Document
File V", because it is the starting signature of HWP 5.0 file formats.
However, 'java -jar tika-app-1.10.jar --detect test_5.0.hwp' returns
'application/x-tika-msoffice'.
I tried tika-mimetypes.xml and custom-mimtypes.xml, in/decreasing the
prority, and '<glob pattern="*.hwp"/>'.
Is it possible because of the bug I issued?

On Thu, Sep 3, 2015 at 6:06 PM, Mungeol Heo <[email protected]> wrote:
>>> That means that the HWP file is based on the OLE2 file format, but that 
>>> no-one has told Tika about that, so detection isn't working properly. If 
>>> you could create a new bug in JIRA for this, and upload a very small HWP 
>>> file (ideally just a few KB), we can get that fixed
>
> I created a bug in JIRA which is 
> https://issues.apache.org/jira/browse/TIKA-1728
>
> On Wed, Sep 2, 2015 at 7:57 PM, Allison, Timothy B. <[email protected]> 
> wrote:
>> Great.  In the meantime, if you could open a JIRA issue and attach some 
>> example files (including the different versions), it might be helpful for 
>> the community to take a look.
>>
>> Thank you!
>>
>> -----Original Message-----
>> From: Mungeol Heo [mailto:[email protected]]
>> Sent: Tuesday, September 01, 2015 9:02 PM
>> To: [email protected]
>> Subject: Re: Does tika support "HWP"?
>>
>> Thank you for your reply.
>> I will try to write a customized parser for HWP file.
>> And if my code is "pretty enough", I will consider to contribute it.
>> Again, thank you.
>>
>> On Tue, Sep 1, 2015 at 7:58 PM, Nick Burch <[email protected]> wrote:
>>> On Tue, 1 Sep 2015, Mungeol Heo wrote:
>>>>>
>>>>> java -jar tika-app-1.10.jar --list-supported-types | grep hwp
>>>>> application/x-hwp
>>>
>>>
>>> That means the mime type has been defined in some way
>>>
>>>>> java -jar tika-app-1.10.jar --detect sample.hwp
>>>>> application/x-tika-msoffice
>>>
>>>
>>> That means that the HWP file is based on the OLE2 file format, but
>>> that no-one has told Tika about that, so detection isn't working
>>> properly. If you could create a new bug in JIRA for this, and upload a
>>> very small HWP file (ideally just a few KB), we can get that fixed
>>>
>>>> And another thing is, there is no 'application/x-hwp' in the
>>>> supported formats list which are mentioned at
>>>> 'http://tika.apache.org/1.10/formats.html' page.
>>>
>>>
>>> That means there is no parser available for HWP, and you'd need to
>>> write + contribute one
>>>
>>>> So, does tika support "HWP"?
>>>
>>>
>>> Depends on your definition of "supports"!
>>>
>>> Nick

Reply via email to