I had a similar use case before. I found:

1. textFile() produced one partition per file. It can result in many
partitions. I found that calling coalecse() without shuffle helped.

2. If you used persist(), count() will do I/O and put the result into
cache. Transformation later did computation out of the memory cache which
could be much faster.

And, in general, small files hurt I/O performance.

On Tue, Feb 10, 2015 at 12:52 PM, Davies Liu <dav...@databricks.com> wrote:

> Spark is an framework to do things in parallel very easy, it
> definitely will help your cases.
>
> def read_file(path):
>     lines = open(path).readlines()  # bzip2
>     return lines
>
> filesRDD = sc.parallelize(path_to_files, N)
> lines = filesRDD.flatMap(read_file)
>
> Then you could do other transforms on lines.
>
> On Tue, Feb 10, 2015 at 12:32 PM, soupacabana <eiersalat...@gmail.com>
> wrote:
> > Hi all,
> >
> > I have the following use case:
> > One job consists of reading from 500-2000 small bzipped logs that are on
> an
> > nfs.
> > (Small means, that the zipped logs are between 0-100KB, average file
> size is
> > 20KB.)
> >
> > We read the log lines, do some transformations, and write them to one
> output
> > file.
> >
> > When we do it in pure Python (running the Python script on one core):
> > -the time for 500 bzipped log files (6.5MB altogether) is about 5
> seconds.
> > -the time for 2000 bzipped log files (25MB altogether) is about 20
> seconds.
> >
> > Because there will be many such jobs, I was thinking of trying Spark for
> > that purpose.
> > My preliminary findings and my questions:
> >
> > *Even only counting the number of log lines with Spark is about 10 times
> > slower than the entire transformation done by the Python script.
> > *sc.textfile(list_of_filenames) appear to not perform well on small
> files,
> > why?
> > *sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but
> does
> > not support bzipped files. However, also wholeTextFiles does not nearly
> > provide the speed of the Python script.
> >
> > *The initialization of a Spark Context takes about 4 seconds. Sending a
> > Spark job to a cluster takes even longer. Is there a way to decrease this
> > initialization phase?
>
> The JVM take about 4 seconds to start up, but a task take only 0.1
> second to start.
>
> > *Is my use case actually an appropriate use case for Spark?
> >
> > Many thanks for your help and comments!
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-very-small-files-appropriate-use-case-tp21583.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to