Hi Benjamin,
It sounds like you want to use the IdentityHtmlMapper (so no HTML elements get
transformed), and your own content handler, so that you get all of the tag
start/end SAX events. So something like...
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parseContext.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);
new HtmlParser().parse (
myInputStream,
myContentHandler,
metadata,
parseContext);
Where myContentHandler is an instance of a custom class that extends
org.xml.sax.helpers.DefaultHandler (similar to ToTextContentHandler in Tika).
This will get called with all of the SAX events, in particular startElement(),
endElement(), and characters()
-- Ken
> From: Sznajder ForMailingList
> Sent: August 17, 2015 8:51:03am PDT
> To: [email protected]
> Subject: Extracting the structure of an HTML Document
>
> Hi
>
> I am a new user of Tika.
>
> I am handling HTML documents... I succeeded to parse the HTML documents to a
> "clean" text string.
>
> However, I am interested to get the structure of the documents : what are the
> different sections, what are the titles of these sections etc...
>
> Is there a way to do that with Tika?
>
> Thanks!
>
> Benjamin
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr