Re: Why does spark take so much time for simple task without calculation?
Hi xiefeng, Even if your RDDs are tiny and reduced to one partition, there is always orchestration overhead (sending tasks to executor(s), reducing results, etc., these things are not free). If you need fast, [near] real-time processing, look towards spark-streaming. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Mon, Sep 5, 2016, at 04:36, xiefeng wrote: > The spark context will be reused, so the spark context initialization > won't > affect the throughput test. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Why does spark take so much time for simple task without calculation?
Hi, I think you can refer to spark history server to figure out how the time was spent. 2016-09-05 10:36 GMT+08:00 xiefeng <fx...@statestreet.com>: > The spark context will be reused, so the spark context initialization won't > affect the throughput test. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Why-does-spark-take-so-much-time- > for-simple-task-without-calculation-tp27628p27657.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Why does spark take so much time for simple task without calculation?
The spark context will be reused, so the spark context initialization won't affect the throughput test. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Why does spark take so much time for simple task without calculation?
My Detail test process: 1. In initialization, it will create 100 string RDDs and distribute them in spark workers. for (int i = 1; i <= numOfRDDs; i++) { JavaRDD rddData = sc.parallelize(Arrays.asList(Integer.toString(i))).coalesce(1); rddData.cache().count(); simpleRDDs.put(Integer.toString(i), rddData); } 2. In Jmeter, configure 100 threads and loop 100 times, each thread will send the get method use its number as RDDId: 3. This function simply return the RDD string, note: the dictionary simpleRDDs is initialized at first with 100 RDDs. public static String simpleRDDTest(String keyOfRDD) { JavaRDD rddData = simpleRDDs.get(keyOfRDD); return rddData.first(); } 4. Test three cases for different number of workers: During the test, I run several times to get the stable throughput. The throughput in three cases vary between 85-95/sec. There is no significantly difference between different worker number. 5. I think this result means even if there is no calculation, the through put has a limitation because spark job initialization and dispatch. Add more workers can’t help improve this situation. Is anyone can explain this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
RE: Why does spark take so much time for simple task without calculation?
Hi Aliaksandr, Thank you very much for your answer. And in my test, I would reuse the spark context, it is initialized when I start the application, for the later throughput test, it won't be initialized again. And when I increase the number of workers, the through put doesn't increase. I read the link you post, it only described use command line tools that faster than Hadoop cluster, I didn't really get the key point that explain my question. If spark context initialization isn't affect my test case, is there anything else? Does the job initialization or dispatch take time? Thank you! -Original Message- From: Bedrytski Aliaksandr [mailto:sp...@bedryt.ski] Sent: Wednesday, August 31, 2016 8:45 PM To: Xie, Feng Cc: user@spark.apache.org Subject: Re: Why does spark take so much time for simple task without calculation? Hi xiefeng, Spark Context initialization takes some time and the tool does not really shine for small data computations: http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html But, when working with terabytes (petabytes) of data, those 35 seconds of initialization don't really matter. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 31, 2016, at 11:45, xiefeng wrote: > I install a spark standalone and run the spark cluster(one master and > one > worker) in a windows 2008 server with 16cores and 24GB memory. > > I have done a simple test: Just create a string RDD and simply return > it. I use JMeter to test throughput but the highest is around 35/sec. > I think spark is powerful at distribute calculation, but why the > throughput is so limit in such simple test scenario only contains > simple task dispatch and no calculation? > > 1. In JMeter I test both 10 threads or 100 threads, there is little > difference around 2-3/sec. > 2. I test both cache/not cache the RDD, there is little difference. > 3. During the test, the cpu and memory are in low level. > > Below is my test code: > @RestController > public class SimpleTest { > @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET) > @ResponseBody > public String testProcessTransaction() { > return SparkShardTest.simpleRDDTest(); > } > } > > final static Map<String, JavaRDDString>> simpleRDDs = > initSimpleRDDs(); public static Map<String, JavaRDDString>> > initSimpleRDDs() > { > Map<String, JavaRDDString>> result = new > ConcurrentHashMap<String, JavaRDDString>>(); > JavaRDD rddData = JavaSC.parallelize(data); > rddData.cache().count();//this cache will improve 1-2/sec > result.put("MyRDD", rddData); > return result; > } > > public static String simpleRDDTest() > { > JavaRDD rddData = simpleRDDs.get("MyRDD"); > return rddData.first(); > } > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-tak > e-so-much-time-for-simple-task-without-calculation-tp27628.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Why does spark take so much time for simple task without calculation?
Hi xiefeng, Spark Context initialization takes some time and the tool does not really shine for small data computations: http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html But, when working with terabytes (petabytes) of data, those 35 seconds of initialization don't really matter. Regards, -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 31, 2016, at 11:45, xiefeng wrote: > I install a spark standalone and run the spark cluster(one master and one > worker) in a windows 2008 server with 16cores and 24GB memory. > > I have done a simple test: Just create a string RDD and simply return > it. I > use JMeter to test throughput but the highest is around 35/sec. I think > spark is powerful at distribute calculation, but why the throughput is so > limit in such simple test scenario only contains simple task dispatch and > no > calculation? > > 1. In JMeter I test both 10 threads or 100 threads, there is little > difference around 2-3/sec. > 2. I test both cache/not cache the RDD, there is little difference. > 3. During the test, the cpu and memory are in low level. > > Below is my test code: > @RestController > public class SimpleTest { > @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET) > @ResponseBody > public String testProcessTransaction() { > return SparkShardTest.simpleRDDTest(); > } > } > > final static Map<String, JavaRDDString>> simpleRDDs = > initSimpleRDDs(); > public static Map<String, JavaRDDString>> initSimpleRDDs() > { > Map<String, JavaRDDString>> result = new > ConcurrentHashMap<String, > JavaRDDString>>(); > JavaRDD rddData = JavaSC.parallelize(data); > rddData.cache().count();//this cache will improve 1-2/sec > result.put("MyRDD", rddData); > return result; > } > > public static String simpleRDDTest() > { > JavaRDD rddData = simpleRDDs.get("MyRDD"); > return rddData.first(); > } > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Why does spark take so much time for simple task without calculation?
I install a spark standalone and run the spark cluster(one master and one worker) in a windows 2008 server with 16cores and 24GB memory. I have done a simple test: Just create a string RDD and simply return it. I use JMeter to test throughput but the highest is around 35/sec. I think spark is powerful at distribute calculation, but why the throughput is so limit in such simple test scenario only contains simple task dispatch and no calculation? 1. In JMeter I test both 10 threads or 100 threads, there is little difference around 2-3/sec. 2. I test both cache/not cache the RDD, there is little difference. 3. During the test, the cpu and memory are in low level. Below is my test code: @RestController public class SimpleTest { @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET) @ResponseBody public String testProcessTransaction() { return SparkShardTest.simpleRDDTest(); } } final static Map<String, JavaRDDString>> simpleRDDs = initSimpleRDDs(); public static Map<String, JavaRDDString>> initSimpleRDDs() { Map<String, JavaRDDString>> result = new ConcurrentHashMap<String, JavaRDDString>>(); JavaRDD rddData = JavaSC.parallelize(data); rddData.cache().count();//this cache will improve 1-2/sec result.put("MyRDD", rddData); return result; } public static String simpleRDDTest() { JavaRDD rddData = simpleRDDs.get("MyRDD"); return rddData.first(); } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org