Re: A question about HTML reader component

Chengmin Ding Fri, 24 Aug 2007 11:00:36 -0700

Thank you Pablo for the prompt reply. I will check out the w3
community project and possibly participate in it. I think this HTML
detagging function is such a useful one and deservers more participation.


-Chengmin

On 8/24/07, Pablo Duboue <[EMAIL PROTECTED]> wrote:
>
> Hi Chengmin,
>
> The blank lines you refer to are easy to remove and are there by
> design. The detagger has a list of "non-paragraph separating tags",
> any other tag is supposed to delimit chunks of text, thus the added
> blank lines. But there is no reason that behavior can't be
> parameterized.
>
> If you want to join the (IBM internal) project, please stop by the
> Community Source w3 site.
>
> Best regards,
>
> Pablo
>
> On 8/24/07, Chengmin Ding <[EMAIL PROTECTED]> wrote:
> > Hi, Folks,
> >
> > We have been using UIMA to mine data points from some documents in plain
> > text format and our AE worked fine. But recently those documents are
> > delivered in HTML format (i.e. with a bunch of HTML tags mixed in) and
> our
> > AEs can no longer mine the data correctly. Our question is if whether
> there
> > is any HTML Collection Reader component or library already available so
> we
> > do not need to reinvent the wheel?
> >
> > We tried an HTMLCommon collection reader but looks like it cannot parse
> a
> > table correctly. It often adds many blank lines between tables
> cells/rows
> > which confuses our AE.
> >
> > Any of your help is highly appreciated.
> >
> > Thanks
> >
> > -Chengmin
> >
>

Re: A question about HTML reader component

Reply via email to