A question about HTML reader component

Chengmin Ding Fri, 24 Aug 2007 09:35:17 -0700

Hi, Folks,

We have been using UIMA to mine data points from some documents in plain
text format and our AE worked fine. But recently those documents are
delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and our
AEs can no longer mine the data correctly. Our question is if whether there
is any HTML Collection Reader component or library already available so we
do not need to reinvent the wheel?


We tried an HTMLCommon collection reader but looks like it cannot parse a
table correctly. It often adds many blank lines between tables cells/rows
which confuses our AE.

Any of your help is highly appreciated.

Thanks

-Chengmin

A question about HTML reader component

Reply via email to