Spark is an framework to do things in parallel very easy, it definitely will help your cases.
def read_file(path): lines = open(path).readlines() # bzip2 return lines filesRDD = sc.parallelize(path_to_files, N) lines = filesRDD.flatMap(read_file) Then you could do other transforms on lines. On Tue, Feb 10, 2015 at 12:32 PM, soupacabana <eiersalat...@gmail.com> wrote: > Hi all, > > I have the following use case: > One job consists of reading from 500-2000 small bzipped logs that are on an > nfs. > (Small means, that the zipped logs are between 0-100KB, average file size is > 20KB.) > > We read the log lines, do some transformations, and write them to one output > file. > > When we do it in pure Python (running the Python script on one core): > -the time for 500 bzipped log files (6.5MB altogether) is about 5 seconds. > -the time for 2000 bzipped log files (25MB altogether) is about 20 seconds. > > Because there will be many such jobs, I was thinking of trying Spark for > that purpose. > My preliminary findings and my questions: > > *Even only counting the number of log lines with Spark is about 10 times > slower than the entire transformation done by the Python script. > *sc.textfile(list_of_filenames) appear to not perform well on small files, > why? > *sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but does > not support bzipped files. However, also wholeTextFiles does not nearly > provide the speed of the Python script. > > *The initialization of a Spark Context takes about 4 seconds. Sending a > Spark job to a cluster takes even longer. Is there a way to decrease this > initialization phase? The JVM take about 4 seconds to start up, but a task take only 0.1 second to start. > *Is my use case actually an appropriate use case for Spark? > > Many thanks for your help and comments! > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-very-small-files-appropriate-use-case-tp21583.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org