> On Feb 14, 2017, at 4:35pm, Zhang, Lisheng <[email protected]> > wrote: > > > Hi, We have been using TIKA for sometime, which is very helpful, thanks a lot! > > So far when TIKA extracted text, it throws away HTML link and only keep word, > this is good for search indexing, but in new application we need to keep > whole HTML link > when extracting text from a binary file like MS DOC, i could not find a > simple way to do that, could you provide a pointer to suitable API or doc?
One example is in the Bixo web mining toolkit. See https://github.com/bixo/bixo/tree/master/src/main/java/bixo/parser for all the related files. Specifically there’s https://github.com/bixo/bixo/blob/master/src/main/java/bixo/parser/TikaCallable.java, which runs the parse in a thread (so that if it hangs it doesn’t kill the hadoop job). It calls the Tika parse() method with a org.apache.tika.sax.TeeContentHandler that sends SAX events to the regular content extraction handler, and (typically) the SimpleLinkExtractor class (in the same package). — Ken -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr
