Re: Optimizing text file parsing, many small files versus few big files

rzykov Thu, 20 Nov 2014 00:52:35 -0800

You could use combineTextFile  from
https://github.com/RetailRocket/SparkMultiTool
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
100000 to approx 3000 and made job significantly faster.


Example:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import ru.retailrocket.spark.multitool.Loaders._

object Tst{
    def main(args: Array[String]) ={
    val conf = new SparkConf().setMaster("local").setAppName("My App")
    val sc = new SparkContext("local", "My App")

    val path = "file:///test/*"
    val sessions = sc.combineTextFile(path)
  // or val sessions = sc.combineTextFile(path, size = 256, delim = "\n")
  // where size is split size in Megabytes, delim - line break string

    println(sessions.count())
   }
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-text-file-parsing-many-small-files-versus-few-big-files-tp19266p19354.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Optimizing text file parsing, many small files versus few big files

Reply via email to