Thanks for your reply. I am calling Apache Tika in Java code like this:
public String extractPDFText(String faInputFileName) throws
IOException,TikaException {
//Handler for body text of the PDF article
BodyContentHandler handler = new BodyContentHandler();
//Metadata of the article
Metadata metadata = new Metadata();
//Input file path
FileInputStream inputstream = new FileInputStream(new
File(faInputFileName));
//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();
try
{
//parsing the document using PDF parser from Tika. Case statement
will be added for handling other file types.
PDFParser pdfparser = new PDFParser();
//Do the parsing by calling the parse function of pdfparser
pdfparser.parse(inputstream, handler, metadata,pcontext);
}catch(Exception e)
{
System.out.println("Exception caught:");
}
//Convert the body handler to string and return the string to the
calling function
return handler.toString();
}
Regards,
On Thu, Jun 8, 2017 at 4:29 PM, Nick Burch <[email protected]> wrote:
> On Thu, 8 Jun 2017, [email protected] wrote:
>
>> My tika code is not extracting full body text of larger PDF files.
>>
>> Files more than 1 MB in size and around 20 pages are partially extracted.
>> Is there any limit on input PDF file size in tika
>>
>
> How are you calling Apache Tika? Direct java calls to TikaConfig +
> AutoDetectParser? Using the Tika facade class? Using the Tika App on the
> command line? Tika Server? Other?
>
> Nick
>