Nick,

Thanks for the starter code. I'm trying to finish up the parser.
I made some changes, and brought inline with other tika parser examples I
have seen.
I've looked over IOUtils, however I'm a bit rusty on my Java.  By rusty I
mean inept.
The following code does iterate through the bytes and manages to find the
first bit of user text.
Printing to the system confims this.
Unfortunately, after finding the first bit of text it's getting an exception
on line 71 ( java.lang.ArrayIndexOutOfBoundsException: 2)
I've attached an example PRT file.


Note: I found a simpler prefix that delineates the start of user text.
[0x01] [0x1F]




//Imports

public class PRTParser implements Parser {

        private static final Set<MediaType> SUPPORTED_TYPES =
Collections.singleton(MediaType.application("prt"));
        public static final String PRT_MIME_TYPE = "application/prt";

        public Set<MediaType> getSupportedTypes(ParseContext context) {
                return SUPPORTED_TYPES;
        }
        public void parse(
                        InputStream stream, ContentHandler handler,
                        Metadata metadata, ParseContext context)
                        throws IOException, SAXException, TikaException {

byte[] prefix = new byte[] {0x01, 0x1F};
int pos = 0;
int read;
while( (read = stream.read()) > -1) {
  if(read == prefix[pos]) {
pos++;
if(pos == prefix.length) {
  // found it!
  int length = stream.read();
  int unknown = stream.read();
  byte[] text = new byte[length];
  IOUtils.readFully(stream, text);

  // turn it into a string, removing null termination
  // assumes it's found to be utf-8
  String str = new String(text, 0, text.length - 1, "UTF-8");
  System.err.println(str); //<- DEBUGGING
  XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
  xhtml.startElement("p");
  xhtml.characters(str);
  xhtml.endElement("p");

}
  } else {
pos = 0;
  }
}
        }

        /**
         * @deprecated This method will be removed in Apache Tika 1.0.
         */
        public void parse(
                        InputStream stream, ContentHandler handler, Metadata
metadata)
                        throws IOException, SAXException, TikaException {
                parse(stream, handler, metadata, new ParseContext());
        }
}

Attachment: TikaTest.prt
Description: Binary data

Reply via email to