No I don’t care about that – I have something like
List of underwiters….some text
<TABLE>
<TR><TD>…..
Etc etc
<TABLE>

I want to get all of that – so can I look for say all table tags content in 
them and then say a few words before the tag TABLE. I can do the parsing etc.

Thnaks.
From: Ken Krugler [mailto:[email protected]]
Sent: Thursday, May 24, 2018 4:09 PM
To: [email protected]
Subject: Re: Extract HTML objects using TIKA

Hi Jaya,

On May 24, 2018, at 12:42 PM, Johnson, Jaya 
<[email protected]<mailto:[email protected]>> wrote:


I was wondering if it was possible to extract all tables from an HTML document 
using TIKA is there anything out of the box or would one have to write 
something.

Tika will call the content handler you provide with the standard set of table 
elements. From DefaultHtmlMapper.java:

        ….
        put("TABLE", "table");
        put("THEAD", "thead");
        put("TBODY", "tbody");
        put("TR", "tr");
        put("TH", "th");
        put("TD", "td”);
        ….

But often when people ask about extracting tables, they’re actually interested 
in getting structured data (column names, data types, etc). And that’s 
something Tika doesn’t automagically do for you.

It would be interesting to create such a thing (similar to what we did for 
Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

-----------------------------------------

Moody's monitors email communications through its networks for regulatory 
compliance purposes and to protect its customers, employees and business and 
where allowed to do so by applicable law. The information contained in this 
e-mail message, and any attachment thereto, is confidential and may not be 
disclosed without our express permission. If you are not the intended recipient 
or an employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that you have received this message in error 
and that any review, dissemination, distribution or copying of this message, or 
any attachment thereto, in whole or in part, is strictly prohibited. If you 
have received this message in error, please immediately notify us by telephone, 
fax or e-mail and delete the message and all of its attachments. Every effort 
is made to keep our network free from viruses. You should, however, review this 
e-mail message, as well as any attachment thereto, for viruses. We take no 
responsibility and have no liability for any computer virus which may be 
transferred via this e-mail message.

-----------------------------------------

Reply via email to