Hello,

I am trying to figure Spark out and I still have some problems with its
speed, I can't figure them out. In short, I wrote two programs that loop
through a 3.8Gb file and filter each line depending of if a certain word is
present. 

I wrote a one-thread python program doing the job and I obtain:
    - for the 3.8Gb file:
/         lines found: 82100
         in: *10.54 seconds*/
     - no filter, just looping through the file:
/         in: 01.65 seconds/

The Spark app doing the same and executed on 8 threads gives:
     - for the 3.8Gb file:
/         lines found: 82100                                                    
          
         in: *18.27 seconds*/
     - for a 38Mb file:
/        lines found: 821
        in: 2.53 seconds/

I must do something wrong to obtain a result twice as slow on the 8 threads
than on 1 thread. 

1. First, I thought it might be because of the setting-up cost of Spark. But
for smaller files it only takes 2 seconds which makes this option
improbable. 
2. Looping through the file takes up 1.65 seconds (thank you SSD ^_^ ),
processing takes up the other 9seconds (for the python app). 
-> This is why I thought splitting it up on the different processes will
definitely speed it up.

Note: Increasing the number of threads in Spark improves the speed (from 57
seconds with 1 thread to 18 seconds with 8 threads). But still, there is a
big difference in performance between simple python and Spark, it must be my
doing!

Can someone point me out on what I am doing wrong? That would be greatly
appreciated :) I am new with all this big data stuff.



*Here is the code for the Spark app:*





*And the python code:*



Thank you for reading up to this point :)

Have a nice day!


- Julien





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-job-in-local-is-slower-than-regular-1-thread-Python-program-tp24771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to