Your question is more suitable for the Tika mailing list - it is better if you ask there. You should share more code on what you are currently doing.
Here is the documentation on how to get a different output format: https://tika.apache.org/1.8/examples.html#Parsing_using_the_Auto-Detect_Parser > Am 04.09.2019 um 08:29 schrieb Khare, Kushal (MIND) > <kushal.kh...@mind-infotech.com>: > > I already spent a lot of time reading on the internet about the same, when I > was finished with all the trials and solutions, then only I posted my query > here. > I know time zones are different and you people are busy, I totally understand > it & highly appreciate your efforts! > > Regarding my file formats, I read about them, tried various ways to handle > it. Did not worked. > I even tried to customise my Tika, but did not got any help on internet how > to proceed with it. It would be great if anyone of you could tell me about > customising/configuring Tika config file - where to keep that, it's entry, > etc. > I also tried the below code from Tika's Documentation to convert into XHTML > first and then considering the required content, but there was a namespace > error. > > ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler()); > > Error : org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not > declared > > Will customizing Tika (configuring Tika parsers) be any good for my > requirement? If yes, then how to deal with that? > > Cheers! > -----Original Message----- > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: 04 September 2019 11:17 > To: solr-user@lucene.apache.org > Subject: Re: Skip Headers & Footers while text extraction using Apache Tika > parsing for PPT & PDF formats > > People here are in different timezones, have their normal jobs for which they > are actually paid to provide answers to questions as those one below etc. > There are also a wide number of resources out on the Internet. > > It can also not harm to read more about the formats that you are processing > and understand them. > > PDF is a problematic format as headers and footers are not specified per se > as headers and footers in the document, but only as drawing instructions on > the page. There is no chance for a software to find them based on the > structure. However, depending on your documents you can identify them based > on regex pattern and remove them from the content. It may be helpful to > configure Tika to output in HTML format and then try to identify > header/footer. > > Removing headers and footers for ppt are probably not supported yet by Tika > (you may ask on their mailing list). So a similar approach as for the PDFs > could be applicable. Alternatively you can check in the Apache POI mailing > list because Tika uses internally for Ms Office formats Apache POI. > > Feel free to contribute a solution to those problems to the Apache Tika > project. > >> Am 04.09.2019 um 05:42 schrieb Khare, Kushal (MIND) >> <kushal.kh...@mind-infotech.com>: >> >> Guys, could I get any help ? Or it's useless posting queries over here ? >> >> On Sep 3, 2019 4:00 PM, "Khare, Kushal (MIND)" >> <kushal.kh...@mind-infotech.com> wrote: >> Hello, mates ! >> I am extracting content from my documents using Apache Tika. >> I need to exclude the headers & footers of the documents. I have already >> done this for Word & Excel format using OfficeParseConfig, but need to >> implement the same for PPT & PDF. >> How to achieve that ? >> >> >> ________________________________ >> >> The information contained in this electronic message and any attachments to >> this message are intended for the exclusive use of the addressee(s) and may >> contain proprietary, confidential or privileged information. If you are not >> the intended recipient, you should not disseminate, distribute or copy this >> e-mail. Please notify the sender immediately and destroy all copies of this >> message and any attachments. WARNING: Computer viruses can be transmitted >> via email. The recipient should check this email and any attachments for the >> presence of viruses. The company accepts no liability for any damage caused >> by any virus/trojan/worms/malicious code transmitted by this email. >> www.motherson.com > > ________________________________ > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused > by any virus/trojan/worms/malicious code transmitted by this email. > www.motherson.com