Hello, I have managed to speed up the read stage when loading CSV files using the classic "newAPIHadoopFile" method, the issue is that I would like to use the spark-csv package and it seams that its not taking into consideration the LZO Index file / Splittable reads.
/# Using the classic method the read is fully parallelized (Splittable)/ sc.newAPIHadoopFile("/user/sy/data.csv.lzo", .... ).count /# When spark-csv is used the file is read only from one node (No Splittable reads)/ sqlContext.read.format("com.databricks.spark.csv").options(Map("path" -> "/user/sy/data.csv.lzo", "header" -> "true", "inferSchema" -> "false")).load().count() Does anyone know if this is currently supported? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org