No I don’t care about that – I have something like List of underwiters….some text <TABLE> <TR><TD>….. Etc etc <TABLE>
I want to get all of that – so can I look for say all table tags content in them and then say a few words before the tag TABLE. I can do the parsing etc. Thnaks. From: Ken Krugler [mailto:[email protected]] Sent: Thursday, May 24, 2018 4:09 PM To: [email protected] Subject: Re: Extract HTML objects using TIKA Hi Jaya, On May 24, 2018, at 12:42 PM, Johnson, Jaya <[email protected]<mailto:[email protected]>> wrote: I was wondering if it was possible to extract all tables from an HTML document using TIKA is there anything out of the box or would one have to write something. Tika will call the content handler you provide with the standard set of table elements. From DefaultHtmlMapper.java: …. put("TABLE", "table"); put("THEAD", "thead"); put("TBODY", "tbody"); put("TR", "tr"); put("TH", "th"); put("TD", "td”); …. But often when people ask about extracting tables, they’re actually interested in getting structured data (column names, data types, etc). And that’s something Tika doesn’t automagically do for you. It would be interesting to create such a thing (similar to what we did for Boilerpipe) for use with Tika. E.g. see https://github.com/seagatesoft/sde — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra ----------------------------------------- Moody's monitors email communications through its networks for regulatory compliance purposes and to protect its customers, employees and business and where allowed to do so by applicable law. The information contained in this e-mail message, and any attachment thereto, is confidential and may not be disclosed without our express permission. If you are not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution or copying of this message, or any attachment thereto, in whole or in part, is strictly prohibited. If you have received this message in error, please immediately notify us by telephone, fax or e-mail and delete the message and all of its attachments. Every effort is made to keep our network free from viruses. You should, however, review this e-mail message, as well as any attachment thereto, for viruses. We take no responsibility and have no liability for any computer virus which may be transferred via this e-mail message. -----------------------------------------
