Thank you Pablo for the prompt reply. I will check out the w3 community project and possibly participate in it. I think this HTML detagging function is such a useful one and deservers more participation.
-Chengmin On 8/24/07, Pablo Duboue <[EMAIL PROTECTED]> wrote: > > Hi Chengmin, > > The blank lines you refer to are easy to remove and are there by > design. The detagger has a list of "non-paragraph separating tags", > any other tag is supposed to delimit chunks of text, thus the added > blank lines. But there is no reason that behavior can't be > parameterized. > > If you want to join the (IBM internal) project, please stop by the > Community Source w3 site. > > Best regards, > > Pablo > > On 8/24/07, Chengmin Ding <[EMAIL PROTECTED]> wrote: > > Hi, Folks, > > > > We have been using UIMA to mine data points from some documents in plain > > text format and our AE worked fine. But recently those documents are > > delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and > our > > AEs can no longer mine the data correctly. Our question is if whether > there > > is any HTML Collection Reader component or library already available so > we > > do not need to reinvent the wheel? > > > > We tried an HTMLCommon collection reader but looks like it cannot parse > a > > table correctly. It often adds many blank lines between tables > cells/rows > > which confuses our AE. > > > > Any of your help is highly appreciated. > > > > Thanks > > > > -Chengmin > > >
