-----Original Message-----
From: Ken Krugler [mailto:[email protected]]
Sent: Tue 2/14/2017 5:09 PM
To: [email protected]
Subject: Re: How to keep all HTML link when doing file content extraction?
> On Feb 14, 2017, at 4:35pm, Zhang, Lisheng <[email protected]>
> wrote:
>
>
> Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot!
>
> So far when TIKA extracted text, it throws away HTML link and only keep word,
> this is good for search indexing, but in new application we need to keep
> whole HTML link
> when extracting text from a binary file like MS DOC, i could not find a
> simple way to do that, could you provide a pointer to suitable API or doc?
One example is in the Bixo web mining toolkit.
See https://github.com/bixo/bixo/tree/master/src/main/java/bixo/parser for all
the related files.
Specifically there's
https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java,
which runs the parse in a thread (so that if it hangs it doesn't kill the
hadoop job).
It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler
that sends SAX events to the regular content extraction handler, and
(typically) the SimpleLinkExtractor class (in the same package).
- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
Thanks very much for such timely help, i will study and test