Hi Jaya,
> On May 24, 2018, at 1:34 PM, Johnson, Jaya <[email protected]> wrote:
>
> No I don’t care about that – I have something like
> List of underwiters….some text
> <TABLE>
> <TR><TD>…..
> Etc etc
> <TABLE>
> <>
> I want to get all of that – so can I look for say all table tags content in
> them and then say a few words before the tag TABLE. I can do the parsing etc.
In that case you should be able to use your own content handler (which will get
a stream of SAX events), and process the elements as they come in. E.g.
something like...
File document = new File(target);
Parser parser = new AutoDetectParser();
ContentHandler handler = new MyTableAwareContentHandler();
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata, new
ParseContext());
where ContentHandler is org.xml.sax.ContentHandler.
— Ken
>
> Thnaks.
> From: Ken Krugler [mailto:[email protected]
> <mailto:[email protected]>]
> Sent: Thursday, May 24, 2018 4:09 PM
> To: [email protected] <mailto:[email protected]>
> Subject: Re: Extract HTML objects using TIKA
>
> Hi Jaya,
>
> On May 24, 2018, at 12:42 PM, Johnson, Jaya <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> I was wondering if it was possible to extract all tables from an HTML
> document using TIKA is there anything out of the box or would one have to
> write something.
>
> Tika will call the content handler you provide with the standard set of table
> elements. From DefaultHtmlMapper.java:
>
> ….
> put("TABLE", "table");
> put("THEAD", "thead");
> put("TBODY", "tbody");
> put("TR", "tr");
> put("TH", "th");
> put("TD", "td”);
> ….
>
> But often when people ask about extracting tables, they’re actually
> interested in getting structured data (column names, data types, etc). And
> that’s something Tika doesn’t automagically do for you.
>
> It would be interesting to create such a thing (similar to what we did for
> Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde
> <https://github.com/seagatesoft/sde>
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
> -----------------------------------------
> Moody's monitors email communications through its networks for regulatory
> compliance purposes and to protect its customers, employees and business and
> where allowed to do so by applicable law. The information contained in this
> e-mail message, and any attachment thereto, is confidential and may not be
> disclosed without our express permission. If you are not the intended
> recipient or an employee or agent responsible for delivering this message to
> the intended recipient, you are hereby notified that you have received this
> message in error and that any review, dissemination, distribution or copying
> of this message, or any attachment thereto, in whole or in part, is strictly
> prohibited. If you have received this message in error, please immediately
> notify us by telephone, fax or e-mail and delete the message and all of its
> attachments. Every effort is made to keep our network free from viruses. You
> should, however, review this e-mail message, as well as any attachment
> thereto, for viruses. We take no responsibility and have no liability for any
> computer virus which may be transferred via this e-mail message.
> -----------------------------------------
--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378