On Thu, 9 Jun 2011, Fernando Arreola wrote:
I read through 5 minute quick start tutorial and started following the steps
detailed there. I noticed that the tika-mimetypes.xml file already has an
entry which contains the afm and pfb file types.
<mime-type type="application/x-font-type1">
<glob pattern="*.pfa"/>
<glob pattern="*.pfb"/>
<glob pattern="*.pfm"/>
<glob pattern="*.afm"/>
</mime-type>
I have a feeling that .pfa and .pbf are the fonts themselves, and the .pfm
and .afm files are metadata about them. Can anyone confirm? If so, we
should split this entry into two
Both seem to work, at least for the .pfb files, which brings me to my next
question. I have about 9 different .afm files which I downloaded from an
Adobe site. When I run tika on these files one is recognized appropriately
("x-font-type1" in the original version and "x-font-afm" in the updated
version), but the rest are only recognized as "text/plain". I haven't really
added a real parser, I basically copied the one from the tutorial and
changed the supported type to be the corresponding mime type.
The detection is separate from the parser. I wouldn't expect you to have
some detected but others not, my best guess is perhaps you have the old
mimetype file around somewhere?
When you get your font parser working, it would be great if you could post
it to JIRA as an enhancement to Tika!
Nick