Hi I put million files into a har archive on hdfs. I d'like to iterate over their file paths, and read them. (Basically they are pdf, and I want to transform them into text with apache pdfbox)
My first attempts has been to list them with hadoop command `hdfs dfs -ls har:///user/<my_user>/har/pdf.har` and this works fine. However, when I try to replicate this in spark, I get an error: ``` val hconf = SparkHadoopUtil.get.newConfiguration(sc.getConf) val hdfs = FileSystem.get(hconf) val test = hdfs.listFiles(new Path("har:///user/<my_user>/har/pdf.har"), false) java.lang.IllegalArgumentException: Wrong FS: har:/user/<my_user>/har/pdf.har, expected: hdfs://<my_cluster>:<my_port> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:661) ``` However, I had been able to use the `sc.textFile` without problem: ``` val test = sc.textFile("har:///user/<my_ser>/har/pdf.har").count 80000000 ``` -------------------------------------------------------------- 1) Is it easily solvable ? 2) Do I need to implement my own pdfFile reader, inspired from textFile ? 2) If not, does har the best way ? I have been looking at AVRO too Thanks for any advice, -- Nicolas --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org