Apache Spark job in local[*] is slower than regular 1-thread Python program

juljoin Tue, 22 Sep 2015 08:48:25 -0700

Hello,

I am trying to figure Spark out and I still have some problems with its
speed, I can't figure them out. In short, I wrote two programs that loop
through a 3.8Gb file and filter each line depending of if a certain word is
present.

I wrote a one-thread python program doing the job and I obtain:
- for the 3.8Gb file:
/ lines found: 82100
in: *10.54 seconds*/
- no filter, just looping through the file:
/ in: 01.65 seconds/

The Spark app doing the same and executed on 8 threads gives:
- for the 3.8Gb file:
/ lines found: 82100

in: *18.27 seconds*/
- for a 38Mb file:
/ lines found: 821
in: 2.53 seconds/

I must do something wrong to obtain a result twice as slow on the 8 threads
than on 1 thread.

1. First, I thought it might be because of the setting-up cost of Spark. But
for smaller files it only takes 2 seconds which makes this option
improbable.
2. Looping through the file takes up 1.65 seconds (thank you SSD ^_^ ),
processing takes up the other 9seconds (for the python app).
-> This is why I thought splitting it up on the different processes will
definitely speed it up.

Note: Increasing the number of threads in Spark improves the speed (from 57
seconds with 1 thread to 18 seconds with 8 threads). But still, there is a
big difference in performance between simple python and Spark, it must be my
doing!

Can someone point me out on what I am doing wrong? That would be greatly
appreciated :) I am new with all this big data stuff.

*Here is the code for the Spark app:*

*And the python code:*

Thank you for reading up to this point :)

Have a nice day!

- Julien

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-job-in-local-is-slower-than-regular-1-thread-Python-program-tp24771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Apache Spark job in local[*] is slower than regular 1-thread Python program

Reply via email to