parsing html as a string from getDocumentText

Sam Fisher Wed, 12 Mar 2008 10:28:40 -0700

Hi All,

Having played around with plain text files in UIMA, I'm now inputting anhtml file to the Document Analyzer. The jcas holds the contents of thisfile, both mark up and text, as a text string. After reading throughthe markmail archives, I decide to try using the jericho html parser forextracting the plain text content from the html string (e.g. StringtheHtml = jcas.getDocumentText()). I'm probably not using Jerichocorrectly, because the output of the parser is the same as what went in(not stripped down to only the text content).

So that I bark up the right tree, I wonder if the CAS forces some kindof encoding, like UTF-8, that might cause the parser to be blind to themark up tags in the html string? This seems ridiculous, but I thoughtI'd ask.


Has anyone had success using jericho with uima?


Many thanks,

Sam

parsing html as a string from getDocumentText

Reply via email to