Yes, I can access the file using cli. On Fri, Sep 28, 2018 at 1:24 PM kathleen li <kathleenli...@gmail.com> wrote:
> The error message is “file not found” > Are you able to use the following command line to assess the file with the > user you submitted the job? > hdfs dfs -ls /tmp/sample.pdf > > Sent from my iPhone > > On Sep 28, 2018, at 12:10 PM, Joel D <games2013....@gmail.com> wrote: > > I'm trying to extract text from pdf files in hdfs using pdfBox. > > However it throws an error: > > "Exception in thread "main" org.apache.spark.SparkException: ... > > java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf > > (No such file or directory)" > > > > > What am I missing? Should I be working with PortableDataStream instead of > the string part of: > > val files: RDD[(String, PortableDataStream)]? > > def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: > SparkSession) = { > > val file: File = new File(fileNameFromRDD._1.drop(5)) > > val document = PDDocument.load(file); //It throws an error here. > > > if (!document.isEncrypted()) { > > val stripper = new PDFTextStripper() > > val text = stripper.getText(document) > > println("Text:" + text) > > > } > > document.close() > > > } > > > //This is where I call the above pdf to text converter method. > > val files = > sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") > > files.foreach(println) > > > files.foreach(f => println(f._1)) > > > files.foreach(fileStream => pdfRead(fileStream, sparkSession)) > > > Thanks. > > > > > > > >