RE: Problem when extracting text from a pdf file

Mehmet Ali Abdulhayoglu Tue, 20 May 2014 02:57:08 -0700

Dear Maruan,

Thanks for your reply. Below you can find the related links for the pdf files. 
As you state, from the first pdf (dnm1) I can manually copy paste the text 
while this is not possible for the second one (pdf) which shows that the later 
one contains no real text.


Is there any other ways to extract text from such pdfs like dnm2?

dnm1.pdf:
http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf

dnm2.pdf:
http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf

Regards,
Mehmet




-----Original Message-----
From: Maruan Sahyoun [mailto:[email protected]] 
Sent: Friday 16 May 2014 10:20 AM
To: [email protected]
Subject: Re: Problem when extracting text from a pdf file

Hi Mehmet,

it could well be that text extraction works for one PDF and doesn't for another 
as it might not contain real text but what you see on screen is drawn. As the 
attachments didn't make it through because of restrictions on the mailing list 
could you upload these to a public location to take a look at the files so the 
answer can be more specific for your case?

BR

Maruan Sahyoun

Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu 
<[email protected]>:

> Dear all,
>  
> As part of my research, I am trying to convert pdf files to text files. I 
> have applied both itext and pdfbox but I encounter the same issue.
>  
> When I try extracting text from dnm1.pdf file (attached) both approaches work 
> well. However when applying them for dnm2.pdf they fail.
>  
> I retrieve a text file with full of NULL values. Is it normal for such 
> differently shaped pdfs or am I missing something else?
>  
> Thanks in advance.
>  
> Regards,
> Mehmet
>  
>  
> My code:
>  
> package retrievingfulltetxsfromweb;
>  
> import connectingurl.PlacesApi;
>  
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import org.apache.pdfbox.cos.COSDocument;
> import org.apache.pdfbox.pdfparser.PDFParser;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.util.PDFTextStripper;
>  
> public class PdfBox {
>    
>     // Extract text from PDF Document
>             public PdfBox(String fileName) {
>                     //PDFParser parser = new PDFParser();
>                     String parsedText = null;;
>                     PDFTextStripper pdfStripper = null;
>                     PDDocument pdDoc = null;
>                     COSDocument cosDoc = null;
>                     File file = new File(fileName);
>                     if (!file.isFile()) {
>                             System.err.println("File " + fileName + " does 
> not exist.");
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new 
> FileInputStream(file));
>                     } catch (IOException e) {
>                             System.err.println("Unable to open PDF Parser. " 
> + e.getMessage());
>                             //return null;
>                     }
>                     try {
>                             PDFParser parser = new PDFParser(new 
> FileInputStream(file));
>                             parser.parse();
>                             cosDoc = parser.getDocument();
>                             pdfStripper = new PDFTextStripper();
>                             pdDoc = new PDDocument(cosDoc);
>                             pdfStripper.setStartPage(1);
>                             pdfStripper.setEndPage(5);
>                             parsedText = pdfStripper.getText(pdDoc);
>                         System.out.println(parsedText);
>                     } catch (Exception e) {
>                             System.err
>                                             .println("An exception occured in 
> parsing the PDF Document."
>                                                             + e.getMessage());
>                     } finally {
>                             try {
>                                     if (cosDoc != null)
>                                             cosDoc.close();
>                                     if (pdDoc != null)
>                                             pdDoc.close();
>                             } catch (Exception e) {
>                                     e.printStackTrace();
>                             }
>                     }
>                     //return parsedText;
>             }
>             public static void main(String args[]){
>                    
>                 PdfBox pdf = new PdfBox("C:/dnm1.pdf");
>                    // System.out.println(pdftoText("C:/dnm1.pdf"));
>             }
>  
> }

RE: Problem when extracting text from a pdf file

Reply via email to