Maybe it's just my phone, but I don't see any code. On Sep 22, 2015 11:46 AM, "juljoin" <juliende...@hotmail.com> wrote:
> Hello, > > I am trying to figure Spark out and I still have some problems with its > speed, I can't figure them out. In short, I wrote two programs that loop > through a 3.8Gb file and filter each line depending of if a certain word is > present. > > I wrote a one-thread python program doing the job and I obtain: > - for the 3.8Gb file: > / lines found: 82100 > in: *10.54 seconds*/ > - no filter, just looping through the file: > / in: 01.65 seconds/ > > The Spark app doing the same and executed on 8 threads gives: > - for the 3.8Gb file: > / lines found: 82100 > in: *18.27 seconds*/ > - for a 38Mb file: > / lines found: 821 > in: 2.53 seconds/ > > I must do something wrong to obtain a result twice as slow on the 8 threads > than on 1 thread. > > 1. First, I thought it might be because of the setting-up cost of Spark. > But > for smaller files it only takes 2 seconds which makes this option > improbable. > 2. Looping through the file takes up 1.65 seconds (thank you SSD ^_^ ), > processing takes up the other 9seconds (for the python app). > -> This is why I thought splitting it up on the different processes will > definitely speed it up. > > Note: Increasing the number of threads in Spark improves the speed (from 57 > seconds with 1 thread to 18 seconds with 8 threads). But still, there is a > big difference in performance between simple python and Spark, it must be > my > doing! > > Can someone point me out on what I am doing wrong? That would be greatly > appreciated :) I am new with all this big data stuff. > > > > *Here is the code for the Spark app:* > > > > > > *And the python code:* > > > > Thank you for reading up to this point :) > > Have a nice day! > > > - Julien > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-job-in-local-is-slower-than-regular-1-thread-Python-program-tp24771.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >