It's highly conceivable to be able to beat spark in performance on tiny
data sets like this. That's not really what it has been optimized for.

El martes, 22 de septiembre de 2015, juljoin <juliende...@hotmail.com>
escribió:

> Hello,
>
> I am trying to figure Spark out and I still have some problems with its
> speed, I can't figure them out. In short, I wrote two programs that loop
> through a 3.8Gb file and filter each line depending of if a certain word is
> present.
>
> I wrote a one-thread python program doing the job and I obtain:
>     - for the 3.8Gb file:
> /         lines found: 82100
>          in: *10.54 seconds*/
>      - no filter, just looping through the file:
> /         in: 01.65 seconds/
>
> The Spark app doing the same and executed on 8 threads gives:
>      - for the 3.8Gb file:
> /         lines found: 82100
>          in: *18.27 seconds*/
>      - for a 38Mb file:
> /        lines found: 821
>         in: 2.53 seconds/
>
> I must do something wrong to obtain a result twice as slow on the 8 threads
> than on 1 thread.
>
> 1. First, I thought it might be because of the setting-up cost of Spark.
> But
> for smaller files it only takes 2 seconds which makes this option
> improbable.
> 2. Looping through the file takes up 1.65 seconds (thank you SSD ^_^ ),
> processing takes up the other 9seconds (for the python app).
> -> This is why I thought splitting it up on the different processes will
> definitely speed it up.
>
> Note: Increasing the number of threads in Spark improves the speed (from 57
> seconds with 1 thread to 18 seconds with 8 threads). But still, there is a
> big difference in performance between simple python and Spark, it must be
> my
> doing!
>
> Can someone point me out on what I am doing wrong? That would be greatly
> appreciated :) I am new with all this big data stuff.
>
>
>
> *Here is the code for the Spark app:*
>
>
>
>
>
> *And the python code:*
>
>
>
> Thank you for reading up to this point :)
>
> Have a nice day!
>
>
> - Julien
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-job-in-local-is-slower-than-regular-1-thread-Python-program-tp24771.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Reply via email to