Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

Richard Eggert Tue, 22 Sep 2015 17:03:07 -0700

Maybe it's just my phone,  but I don't see any code.
On Sep 22, 2015 11:46 AM, "juljoin" <juliende...@hotmail.com> wrote:


> Hello,
>
> I am trying to figure Spark out and I still have some problems with its
> speed, I can't figure them out. In short, I wrote two programs that loop
> through a 3.8Gb file and filter each line depending of if a certain word is
> present.
>
> I wrote a one-thread python program doing the job and I obtain:
>     - for the 3.8Gb file:
> /         lines found: 82100
>          in: *10.54 seconds*/
>      - no filter, just looping through the file:
> /         in: 01.65 seconds/
>
> The Spark app doing the same and executed on 8 threads gives:
>      - for the 3.8Gb file:
> /         lines found: 82100
>          in: *18.27 seconds*/
>      - for a 38Mb file:
> /        lines found: 821
>         in: 2.53 seconds/
>
> I must do something wrong to obtain a result twice as slow on the 8 threads
> than on 1 thread.
>
> 1. First, I thought it might be because of the setting-up cost of Spark.
> But
> for smaller files it only takes 2 seconds which makes this option
> improbable.
> 2. Looping through the file takes up 1.65 seconds (thank you SSD ^_^ ),
> processing takes up the other 9seconds (for the python app).
> -> This is why I thought splitting it up on the different processes will
> definitely speed it up.
>
> Note: Increasing the number of threads in Spark improves the speed (from 57
> seconds with 1 thread to 18 seconds with 8 threads). But still, there is a
> big difference in performance between simple python and Spark, it must be
> my
> doing!
>
> Can someone point me out on what I am doing wrong? That would be greatly
> appreciated :) I am new with all this big data stuff.
>
>
>
> *Here is the code for the Spark app:*
>
>
>
>
>
> *And the python code:*
>
>
>
> Thank you for reading up to this point :)
>
> Have a nice day!
>
>
> - Julien
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-job-in-local-is-slower-than-regular-1-thread-Python-program-tp24771.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Apache Spark job in local[*] is slower than regular 1-thread Python program

Reply via email to