I had a similar use case before. I found: 1. textFile() produced one partition per file. It can result in many partitions. I found that calling coalecse() without shuffle helped.
2. If you used persist(), count() will do I/O and put the result into cache. Transformation later did computation out of the memory cache which could be much faster. And, in general, small files hurt I/O performance. On Tue, Feb 10, 2015 at 12:52 PM, Davies Liu <dav...@databricks.com> wrote: > Spark is an framework to do things in parallel very easy, it > definitely will help your cases. > > def read_file(path): > lines = open(path).readlines() # bzip2 > return lines > > filesRDD = sc.parallelize(path_to_files, N) > lines = filesRDD.flatMap(read_file) > > Then you could do other transforms on lines. > > On Tue, Feb 10, 2015 at 12:32 PM, soupacabana <eiersalat...@gmail.com> > wrote: > > Hi all, > > > > I have the following use case: > > One job consists of reading from 500-2000 small bzipped logs that are on > an > > nfs. > > (Small means, that the zipped logs are between 0-100KB, average file > size is > > 20KB.) > > > > We read the log lines, do some transformations, and write them to one > output > > file. > > > > When we do it in pure Python (running the Python script on one core): > > -the time for 500 bzipped log files (6.5MB altogether) is about 5 > seconds. > > -the time for 2000 bzipped log files (25MB altogether) is about 20 > seconds. > > > > Because there will be many such jobs, I was thinking of trying Spark for > > that purpose. > > My preliminary findings and my questions: > > > > *Even only counting the number of log lines with Spark is about 10 times > > slower than the entire transformation done by the Python script. > > *sc.textfile(list_of_filenames) appear to not perform well on small > files, > > why? > > *sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but > does > > not support bzipped files. However, also wholeTextFiles does not nearly > > provide the speed of the Python script. > > > > *The initialization of a Spark Context takes about 4 seconds. Sending a > > Spark job to a cluster takes even longer. Is there a way to decrease this > > initialization phase? > > The JVM take about 4 seconds to start up, but a task take only 0.1 > second to start. > > > *Is my use case actually an appropriate use case for Spark? > > > > Many thanks for your help and comments! > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-very-small-files-appropriate-use-case-tp21583.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >