You could use combineTextFile from
https://github.com/RetailRocket/SparkMultiTool
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
100000 to approx 3000 and made job significantly faster.
Example:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import ru.retailrocket.spark.multitool.Loaders._
object Tst{
def main(args: Array[String]) ={
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = new SparkContext("local", "My App")
val path = "file:///test/*"
val sessions = sc.combineTextFile(path)
// or val sessions = sc.combineTextFile(path, size = 256, delim = "\n")
// where size is split size in Megabytes, delim - line break string
println(sessions.count())
}
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-text-file-parsing-many-small-files-versus-few-big-files-tp19266p19354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]