Dear Mehmet, did you supply the correct PDF’s? I can manual copy & paste text from both as well as extract the text using PDFBox for both.
BR Maruan Sahyoun Am 20.05.2014 um 11:56 schrieb Mehmet Ali Abdulhayoglu <[email protected]>: > Dear Maruan, > > Thanks for your reply. Below you can find the related links for the pdf > files. As you state, from the first pdf (dnm1) I can manually copy paste the > text while this is not possible for the second one (pdf) which shows that the > later one contains no real text. > > Is there any other ways to extract text from such pdfs like dnm2? > > dnm1.pdf: > http://www.researchgate.net/publication/8333207_Olfactory_learning-induced_increase_in_spine_density_along_the_apical_dendrites_of_CA1_hippocampal_neurons/file/79e41503b71b66dabb.pdf > > dnm2.pdf: > http://www.researchgate.net/publication/222569912_Relationship_between_intercepted_radiation_net_photosynthesis_respiration_and_rate_of_stem_volume_growth_of_Pinus_taeda_and_Pinus_elliottii_stands_of_different_densities/file/9fcfd5064592b3d098.pdf > > Regards, > Mehmet > > > > > -----Original Message----- > From: Maruan Sahyoun [mailto:[email protected]] > Sent: Friday 16 May 2014 10:20 AM > To: [email protected] > Subject: Re: Problem when extracting text from a pdf file > > Hi Mehmet, > > it could well be that text extraction works for one PDF and doesn't for > another as it might not contain real text but what you see on screen is > drawn. As the attachments didn't make it through because of restrictions on > the mailing list could you upload these to a public location to take a look > at the files so the answer can be more specific for your case? > > BR > > Maruan Sahyoun > > Am 14.05.2014 um 16:31 schrieb Mehmet Ali Abdulhayoglu > <[email protected]>: > >> Dear all, >> >> As part of my research, I am trying to convert pdf files to text files. I >> have applied both itext and pdfbox but I encounter the same issue. >> >> When I try extracting text from dnm1.pdf file (attached) both approaches >> work well. However when applying them for dnm2.pdf they fail. >> >> I retrieve a text file with full of NULL values. Is it normal for such >> differently shaped pdfs or am I missing something else? >> >> Thanks in advance. >> >> Regards, >> Mehmet >> >> >> My code: >> >> package retrievingfulltetxsfromweb; >> >> import connectingurl.PlacesApi; >> >> import java.io.File; >> import java.io.FileInputStream; >> import java.io.IOException; >> import org.apache.pdfbox.cos.COSDocument; >> import org.apache.pdfbox.pdfparser.PDFParser; >> import org.apache.pdfbox.pdmodel.PDDocument; >> import org.apache.pdfbox.util.PDFTextStripper; >> >> public class PdfBox { >> >> // Extract text from PDF Document >> public PdfBox(String fileName) { >> //PDFParser parser = new PDFParser(); >> String parsedText = null;; >> PDFTextStripper pdfStripper = null; >> PDDocument pdDoc = null; >> COSDocument cosDoc = null; >> File file = new File(fileName); >> if (!file.isFile()) { >> System.err.println("File " + fileName + " does >> not exist."); >> //return null; >> } >> try { >> PDFParser parser = new PDFParser(new >> FileInputStream(file)); >> } catch (IOException e) { >> System.err.println("Unable to open PDF Parser. " >> + e.getMessage()); >> //return null; >> } >> try { >> PDFParser parser = new PDFParser(new >> FileInputStream(file)); >> parser.parse(); >> cosDoc = parser.getDocument(); >> pdfStripper = new PDFTextStripper(); >> pdDoc = new PDDocument(cosDoc); >> pdfStripper.setStartPage(1); >> pdfStripper.setEndPage(5); >> parsedText = pdfStripper.getText(pdDoc); >> System.out.println(parsedText); >> } catch (Exception e) { >> System.err >> .println("An exception occured in >> parsing the PDF Document." >> + e.getMessage()); >> } finally { >> try { >> if (cosDoc != null) >> cosDoc.close(); >> if (pdDoc != null) >> pdDoc.close(); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> } >> //return parsedText; >> } >> public static void main(String args[]){ >> >> PdfBox pdf = new PdfBox("C:/dnm1.pdf"); >> // System.out.println(pdftoText("C:/dnm1.pdf")); >> } >> >> } >

