Re: How to keep all HTML link when doing file content extraction?

Ken Krugler Tue, 14 Feb 2017 17:10:18 -0800

> On Feb 14, 2017, at 4:35pm, Zhang, Lisheng <[email protected]> 
> wrote:
> 
> 
> Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot!
> 
> So far when TIKA extracted text, it throws away HTML link and only keep word, 
> this is good for search indexing, but in new application we need to keep 
> whole HTML link
> when extracting text from a binary file like MS DOC, i could not find a 
> simple way to do that, could you provide a pointer to suitable API or doc?


One example is in the Bixo web mining toolkit.

See https://github.com/bixo/bixo/tree/master/src/main/java/bixo/parser for all 
the related files.

Specifically there’s 
https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java,
 which runs the parse in a thread (so that if it hangs it doesn’t kill the 
hadoop job).

It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler 
that sends SAX events to the regular content extraction handler, and 
(typically) the SimpleLinkExtractor class (in the same package).

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: How to keep all HTML link when doing file content extraction?

Reply via email to