> Do the characters come in the same place in every file? Is there a common > header / sequence of bytes before the text? Can you spot the offset of the > text elsewhere in the file? > > If you can't find a library for the files, and you can't find a spec for > it, then you'll have to start reverse engineering the file format... > > Nick >
The magic mime type characters occur within the first six bytes (12 hexes) of the file, yes. The user inputted text is scattered throughout the file. My collection of files average 300K, and don't exceed 1MB in size. I'm no file decoder, but I did review about a dozen prts created with different versions of the program, and different companies. The closest thing I can find to a common header or sequence of bytes is the occurrence of sextuple 3's and nine 0's just before text fields. Viewing the file in Visual studio, a sample hex code header would be [33 33 33 33 33 33 E3 3F 00 00 00 00 00 00 00 00 00 01 1F 1B 00] which is immediately followed by the hex code of the text, for instance [43 48 45 43 4B 45 44] is the text "CHECKED" The header always stays the same except for variation in the three hexes just before the text. That would be [1F 1B 00] from my example above. So, there does seem to be a pattern here to work with. I'm just not sure which tools will help me manipulate this pattern. Got any ideas? Note: I do have access to some commercial text extractors that can extract text from this filetype, but I'm looking to contribute to a tika solution.
