We encountered a problem of loading a huge number of small files (hundred
thousands of files) from HDFS in Spark. Our jobs were failed over time.

This one forced us to write own loader with combining by means of Hadoop
CombineFileInputFormat.
It significantly reduced number of mappers from 100000 to about 3000.

We made it as an open source library:

https://github.com/RetailRocket/SparkMultiTool
<https://github.com/RetailRocket/SparkMultiTool>  

Example:

import ru.retailrocket.spark.multitool.Loaders

val sessions = Loaders.combineTextFile(sc, "file:///test/*")
// or val sessions = Loaders.combineTextFile(sc, conf.weblogs(), size = 256,
delim = "\n")
// where size is split size in Megabytes, delim - line break character

println(sessions.count())





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Solution-for-small-files-in-HDFS-tp15477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to