Re: Problem when extracting text from a pdf file

Maruan Sahyoun Tue, 20 May 2014 23:52:32 -0700

Dear Mehmet,

did you supply the correct PDF’s? I can manual copy & paste text from both as 
well as extract the text using PDFBox for both.


BR

Maruan Sahyoun

Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu 
<[email protected]>:

> Dear Maruan,
> 
> Thanks for your reply. Below you can find the related links for the pdf 
> files. As you state, from the first pdf (dnm1) I can manually copy paste the 
> text while this is not possible for the second one (pdf) which shows that the 
> later one contains no real text.
> 
> Is there any other ways to extract text from such pdfs like dnm2?
> 
> dnm1.pdf:
> http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf
> 
> dnm2.pdf:
> http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf
> 
> Regards,
> Mehmet
> 
> 
> 
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:[email protected]] 
> Sent: Friday 16 May 2014 10:20 AM
> To: [email protected]
> Subject: Re: Problem when extracting text from a pdf file
> 
> Hi Mehmet,
> 
> it could well be that text extraction works for one PDF and doesn't for 
> another as it might not contain real text but what you see on screen is 
> drawn. As the attachments didn't make it through because of restrictions on 
> the mailing list could you upload these to a public location to take a look 
> at the files so the answer can be more specific for your case?
> 
> BR
> 
> Maruan Sahyoun
> 
> Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu 
> <[email protected]>:
> 
>> Dear all,
>> 
>> As part of my research, I am trying to convert pdf files to text files. I 
>> have applied both itext and pdfbox but I encounter the same issue.
>> 
>> When I try extracting text from dnm1.pdf file (attached) both approaches 
>> work well. However when applying them for dnm2.pdf they fail.
>> 
>> I retrieve a text file with full of NULL values. Is it normal for such 
>> differently shaped pdfs or am I missing something else?
>> 
>> Thanks in advance.
>> 
>> Regards,
>> Mehmet
>> 
>> 
>> My code:
>> 
>> package retrievingfulltetxsfromweb;
>> 
>> import connectingurl.PlacesApi;
>> 
>> import java.io.File;
>> import java.io.FileInputStream;
>> import java.io.IOException;
>> import org.apache.pdfbox.cos.COSDocument;
>> import org.apache.pdfbox.pdfparser.PDFParser;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.util.PDFTextStripper;
>> 
>> public class PdfBox {
>> 
>>    // Extract text from PDF Document
>>            public PdfBox(String fileName) {
>>                    //PDFParser parser = new PDFParser();
>>                    String parsedText = null;;
>>                    PDFTextStripper pdfStripper = null;
>>                    PDDocument pdDoc = null;
>>                    COSDocument cosDoc = null;
>>                    File file = new File(fileName);
>>                    if (!file.isFile()) {
>>                            System.err.println("File " + fileName + " does 
>> not exist.");
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new 
>> FileInputStream(file));
>>                    } catch (IOException e) {
>>                            System.err.println("Unable to open PDF Parser. " 
>> + e.getMessage());
>>                            //return null;
>>                    }
>>                    try {
>>                            PDFParser parser = new PDFParser(new 
>> FileInputStream(file));
>>                            parser.parse();
>>                            cosDoc = parser.getDocument();
>>                            pdfStripper = new PDFTextStripper();
>>                            pdDoc = new PDDocument(cosDoc);
>>                            pdfStripper.setStartPage(1);
>>                            pdfStripper.setEndPage(5);
>>                            parsedText = pdfStripper.getText(pdDoc);
>>                        System.out.println(parsedText);
>>                    } catch (Exception e) {
>>                            System.err
>>                                            .println("An exception occured in 
>> parsing the PDF Document."
>>                                                            + e.getMessage());
>>                    } finally {
>>                            try {
>>                                    if (cosDoc != null)
>>                                            cosDoc.close();
>>                                    if (pdDoc != null)
>>                                            pdDoc.close();
>>                            } catch (Exception e) {
>>                                    e.printStackTrace();
>>                            }
>>                    }
>>                    //return parsedText;
>>            }
>>            public static void main(String args[]){
>> 
>>                PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>>                   // System.out.println(pdftoText("C:/dnm1.pdf"));
>>            }
>> 
>> }
>

Re: Problem when extracting text from a pdf file

Reply via email to