On Thu, 16 Jun 2011, Troy Witthoeft wrote:
Thanks to your pointers, I did notice that there is a common delimiter [0A
00] that follows the ASCII text.

0x0a is \n
0x00 is null

So your strings are usually terminated with a new line, but always with a null. I'd suggest you use the \n to decide when to output a new paragraph in the xhtml, and stop on a 0x00

And, yes, one of the hex pairs preceding the ASCII text always translates to a larger decimal value when the text is longer. (Typically, the hex value equals the number of text characters +2)

Length including the null terminator, that keeps things simple.

What I'd suggest you try next is getting/creating some files with accents in them. Start with some western european accents (those that are in iso-8859-1) and see how they come out. Then, try some eastern european letters and see. Finally, some east asian text. That should give an idea of the encoding of the text, and if it changes. (For example, some formats have a code before the text which says if it's 8 byte or 16 byte text)

Anyone have any experience parsing patterns such as these?

It shouldn't be too bad. You'll want something like:


byte[] prefix = new byte[] { 0x33, 0x33 /// etc };
int pos = 0;
int read;
while( (read = inp.read()) > -1) {
   if(read == prefix[pos]) {
     pos++;
     if(pos == prefix.length) {
       // found it!
       int length = inp.read();
       int unknown = inp.read();
       byte[] text = new byte[length];
       IOUtils.readFully(inp, text);

       // turn it into a string, removing null termination
       // assumes it's found to be utf-8
       String str = new String(text, 0, text.length - 1, "UTF-8");
       xhtmlHandler.startElement("p)";
       xhtmlHandler.characters(str);
       xhtmlHandler.endElement("p");
     }
   } else {
     pos = 0;
   }
}

Take a look at IOUtils in Tika (and various util classes in POI) for help with bits of this. Note - code was typed straight into an email client as a guide, it may well need a little bit of work...!

Nick

Reply via email to