We encountered a problem of loading a huge number of small files (hundred thousands of files) from HDFS in Spark. Our jobs were failed over time.
This one forced us to write own loader with combining by means of Hadoop CombineFileInputFormat. It significantly reduced number of mappers from 100000 to about 3000. We made it as an open source library: https://github.com/RetailRocket/SparkMultiTool <https://github.com/RetailRocket/SparkMultiTool> Example: import ru.retailrocket.spark.multitool.Loaders val sessions = Loaders.combineTextFile(sc, "file:///test/*") // or val sessions = Loaders.combineTextFile(sc, conf.weblogs(), size = 256, delim = "\n") // where size is split size in Megabytes, delim - line break character println(sessions.count()) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Solution-for-small-files-in-HDFS-tp15477.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org