Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

Jörn Franke Tue, 03 Sep 2019 23:36:15 -0700

Your question is more suitable for the Tika mailing list - it is better if you 
ask there. You should share more code on what you are currently doing.


Here is the documentation on how to get a different output format:
https://tika.apache.org/1.8/examples.html#Parsing_using_the_Auto-Detect_Parser

> Am 04.09.2019 um 08:29 schrieb Khare, Kushal (MIND) 
> <kushal.kh...@mind-infotech.com>:
> 
> I already spent a lot of time reading on the internet about the same, when I 
> was finished with all the trials and solutions, then only I posted my query 
> here.
> I know time zones are different and you people are busy, I totally understand 
> it & highly appreciate your efforts!
> 
> Regarding my file formats, I read about them, tried various ways to handle 
> it. Did not worked.
> I even tried to customise my Tika, but did not got any help on internet how 
> to proceed with it. It would be great if anyone of you could tell me about 
> customising/configuring Tika config file - where to keep that, it's entry, 
> etc.
> I also tried the below code from Tika's Documentation to convert into XHTML 
> first and then considering the required content, but there was a namespace 
> error.
> 
> ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
> 
> Error : org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not 
> declared
> 
> Will customizing Tika (configuring Tika parsers) be any good for my 
> requirement? If yes, then how to deal with that?
> 
> Cheers!
> -----Original Message-----
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: 04 September 2019 11:17
> To: solr-user@lucene.apache.org
> Subject: Re: Skip Headers & Footers while text extraction using Apache Tika 
> parsing for PPT & PDF formats
> 
> People here are in different timezones, have their normal jobs for which they 
> are actually paid to provide answers to questions as those one below etc. 
> There are also a wide number of resources out on the Internet.
> 
> It can also not harm to read more about the formats that you are processing 
> and understand them.
> 
> PDF is a problematic format as headers and footers are not specified per se 
> as headers and footers in the document, but only as drawing instructions on 
> the page. There is no chance for a software to find them based on the 
> structure. However, depending on your documents you can identify them based 
> on regex pattern and remove them from the content. It may be helpful to 
> configure Tika to output in HTML format and then try to identify 
> header/footer.
> 
> Removing headers and footers for ppt are probably  not supported yet by Tika 
> (you may ask on their mailing list). So a similar approach as for the PDFs 
> could be applicable. Alternatively you can check in the Apache POI mailing 
> list because Tika uses internally for Ms Office formats Apache POI.
> 
> Feel free to contribute a solution to those problems to the Apache Tika 
> project.
> 
>> Am 04.09.2019 um 05:42 schrieb Khare, Kushal (MIND) 
>> <kushal.kh...@mind-infotech.com>:
>> 
>> Guys, could I get any help ? Or it's useless posting queries over here ?
>> 
>> On Sep 3, 2019 4:00 PM, "Khare, Kushal (MIND)" 
>> <kushal.kh...@mind-infotech.com> wrote:
>> Hello, mates !
>> I am extracting content from my documents using Apache Tika.
>> I need to exclude the headers & footers of the documents. I have already 
>> done this for Word & Excel format using OfficeParseConfig, but need to 
>> implement the same for PPT & PDF.
>> How to achieve that ?
>> 
>> 
>> ________________________________
>> 
>> The information contained in this electronic message and any attachments to 
>> this message are intended for the exclusive use of the addressee(s) and may 
>> contain proprietary, confidential or privileged information. If you are not 
>> the intended recipient, you should not disseminate, distribute or copy this 
>> e-mail. Please notify the sender immediately and destroy all copies of this 
>> message and any attachments. WARNING: Computer viruses can be transmitted 
>> via email. The recipient should check this email and any attachments for the 
>> presence of viruses. The company accepts no liability for any damage caused 
>> by any virus/trojan/worms/malicious code transmitted by this email. 
>> www.motherson.com
> 
> ________________________________
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus/trojan/worms/malicious code transmitted by this email. 
> www.motherson.com

Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

Reply via email to