Yes, I can access the file using cli.

On Fri, Sep 28, 2018 at 1:24 PM kathleen li <kathleenli...@gmail.com> wrote:

> The error message is “file not found”
> Are you able to use the following command line to assess the file with the
> user you submitted the job?
> hdfs dfs -ls /tmp/sample.pdf
>
> Sent from my iPhone
>
> On Sep 28, 2018, at 12:10 PM, Joel D <games2013....@gmail.com> wrote:
>
> I'm trying to extract text from pdf files in hdfs using pdfBox.
>
> However it throws an error:
>
> "Exception in thread "main" org.apache.spark.SparkException: ...
>
> java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf
>
> (No such file or directory)"
>
>
>
>
> What am I missing? Should I be working with PortableDataStream instead of
> the string part of:
>
> val files: RDD[(String, PortableDataStream)]?
>
> def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
> SparkSession) = {
>
> val file: File = new File(fileNameFromRDD._1.drop(5))
>
> val document = PDDocument.load(file); //It throws an error here.
>
>
> if (!document.isEncrypted()) {
>
>   val stripper = new PDFTextStripper()
>
>   val text = stripper.getText(document)
>
>   println("Text:" + text)
>
>
> }
>
>     document.close()
>
>
>   }
>
>
> //This is where I call the above pdf to text converter method.
>
>      val files =
> sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")
>
>     files.foreach(println)
>
>
>     files.foreach(f => println(f._1))
>
>
>     files.foreach(fileStream => pdfRead(fileStream, sparkSession))
>
>
> Thanks.
>
>
>
>
>
>
>
>

Reply via email to