Spark is an framework to do things in parallel very easy, it
definitely will help your cases.

def read_file(path):
    lines = open(path).readlines()  # bzip2
    return lines

filesRDD = sc.parallelize(path_to_files, N)
lines = filesRDD.flatMap(read_file)

Then you could do other transforms on lines.

On Tue, Feb 10, 2015 at 12:32 PM, soupacabana <eiersalat...@gmail.com> wrote:
> Hi all,
>
> I have the following use case:
> One job consists of reading from 500-2000 small bzipped logs that are on an
> nfs.
> (Small means, that the zipped logs are between 0-100KB, average file size is
> 20KB.)
>
> We read the log lines, do some transformations, and write them to one output
> file.
>
> When we do it in pure Python (running the Python script on one core):
> -the time for 500 bzipped log files (6.5MB altogether) is about 5 seconds.
> -the time for 2000 bzipped log files (25MB altogether) is about 20 seconds.
>
> Because there will be many such jobs, I was thinking of trying Spark for
> that purpose.
> My preliminary findings and my questions:
>
> *Even only counting the number of log lines with Spark is about 10 times
> slower than the entire transformation done by the Python script.
> *sc.textfile(list_of_filenames) appear to not perform well on small files,
> why?
> *sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but does
> not support bzipped files. However, also wholeTextFiles does not nearly
> provide the speed of the Python script.
>
> *The initialization of a Spark Context takes about 4 seconds. Sending a
> Spark job to a cluster takes even longer. Is there a way to decrease this
> initialization phase?

The JVM take about 4 seconds to start up, but a task take only 0.1
second to start.

> *Is my use case actually an appropriate use case for Spark?
>
> Many thanks for your help and comments!
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-very-small-files-appropriate-use-case-tp21583.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to