Re: Why does spark take so much time for simple task without calculation?

2016-09-09 Thread Bedrytski Aliaksandr
Hi xiefeng,

Even if your RDDs are tiny and reduced to one partition, there is always
orchestration overhead (sending tasks to executor(s), reducing results,
etc., these things are not free).

If you need fast, [near] real-time processing, look towards
spark-streaming.

Regards,
-- 
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Mon, Sep 5, 2016, at 04:36, xiefeng wrote:
> The spark context will be reused, so the spark context initialization
> won't
> affect the throughput test.
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread 刘虓
Hi,
I think you can refer to spark history server to figure out how the time
was spent.

2016-09-05 10:36 GMT+08:00 xiefeng <fx...@statestreet.com>:

> The spark context will be reused, so the spark context initialization won't
> affect the throughput test.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Why-does-spark-take-so-much-time-
> for-simple-task-without-calculation-tp27628p27657.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
The spark context will be reused, so the spark context initialization won't
affect the throughput test.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
My Detail test process:
1.   In initialization, it will create 100 string RDDs and distribute
them in spark workers.
for (int i = 1; i <= numOfRDDs; i++) {
JavaRDD rddData =
sc.parallelize(Arrays.asList(Integer.toString(i))).coalesce(1);
rddData.cache().count();
simpleRDDs.put(Integer.toString(i), rddData);
}
2.   In Jmeter, configure 100 threads and loop 100 times, each thread
will send the get method use its number as RDDId:

3.   This function simply return the RDD string, note: the dictionary
simpleRDDs is initialized at first with 100 RDDs.
   public static String simpleRDDTest(String keyOfRDD) {
JavaRDD rddData = simpleRDDs.get(keyOfRDD);
return rddData.first();
}
 
4.   Test three cases for different number of workers:
During the test, I run several times to get the stable throughput. 
The throughput in three cases vary between 85-95/sec. There is no
significantly difference between different worker number.
5.   I think this result means even if there is no calculation, the
through put has a limitation because spark job initialization and dispatch.
Add more workers can’t help improve this situation. Is anyone can explain
this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread Xie, Feng
Hi Aliaksandr,

Thank you very much for your answer.
And in my test, I would reuse the spark context, it is initialized when I start 
the application, for the later throughput test, it won't be initialized again. 
And when I increase the number of workers, the through put doesn't increase.
I read the link you post, it only described use command line tools that faster 
than Hadoop cluster, I didn't really get the key point that explain my 
question. 
If spark context initialization isn't affect my test case, is there anything 
else? Does the job initialization or dispatch take time? Thank you!


-Original Message-
From: Bedrytski Aliaksandr [mailto:sp...@bedryt.ski] 
Sent: Wednesday, August 31, 2016 8:45 PM
To: Xie, Feng
Cc: user@spark.apache.org
Subject: Re: Why does spark take so much time for simple task without 
calculation?

Hi xiefeng,

Spark Context initialization takes some time and the tool does not really shine 
for small data computations:
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

But, when working with terabytes (petabytes) of data, those 35 seconds of 
initialization don't really matter. 

Regards,

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Wed, Aug 31, 2016, at 11:45, xiefeng wrote:
> I install a spark standalone and run the spark cluster(one master and 
> one
> worker) in a windows 2008 server with 16cores and 24GB memory.
> 
> I have done a simple test: Just create  a string RDD and simply return 
> it. I use JMeter to test throughput but the highest is around 35/sec. 
> I think spark is powerful at distribute calculation, but why the 
> throughput is so limit in such simple test scenario only contains 
> simple task dispatch and no calculation?
> 
> 1. In JMeter I test both 10 threads or 100 threads, there is little 
> difference around 2-3/sec.
> 2. I test both cache/not cache the RDD, there is little difference. 
> 3. During the test, the cpu and memory are in low level.
> 
> Below is my test code:
> @RestController
> public class SimpleTest {   
>   @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
>   @ResponseBody
>   public String testProcessTransaction() {
>   return SparkShardTest.simpleRDDTest();
>   }
> }
> 
> final static Map<String, JavaRDDString>> simpleRDDs = 
> initSimpleRDDs(); public static Map<String, JavaRDDString>> 
> initSimpleRDDs()
>   {
>   Map<String, JavaRDDString>> result = new 
> ConcurrentHashMap<String, JavaRDDString>>();
>   JavaRDD rddData = JavaSC.parallelize(data);
>   rddData.cache().count();//this cache will improve 1-2/sec
>   result.put("MyRDD", rddData);
>   return result;
>   }
>   
>   public static String simpleRDDTest()
>   {   
>   JavaRDD rddData = simpleRDDs.get("MyRDD");
>   return rddData.first();
>   }
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-tak
> e-so-much-time-for-simple-task-without-calculation-tp27628.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why does spark take so much time for simple task without calculation?

2016-08-31 Thread Bedrytski Aliaksandr
Hi xiefeng,

Spark Context initialization takes some time and the tool does not
really shine for small data computations:
http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

But, when working with terabytes (petabytes) of data, those 35 seconds
of initialization don't really matter. 

Regards,

-- 
  Bedrytski Aliaksandr
  sp...@bedryt.ski

On Wed, Aug 31, 2016, at 11:45, xiefeng wrote:
> I install a spark standalone and run the spark cluster(one master and one
> worker) in a windows 2008 server with 16cores and 24GB memory.
> 
> I have done a simple test: Just create  a string RDD and simply return
> it. I
> use JMeter to test throughput but the highest is around 35/sec. I think
> spark is powerful at distribute calculation, but why the throughput is so
> limit in such simple test scenario only contains simple task dispatch and
> no
> calculation?
> 
> 1. In JMeter I test both 10 threads or 100 threads, there is little
> difference around 2-3/sec.
> 2. I test both cache/not cache the RDD, there is little difference. 
> 3. During the test, the cpu and memory are in low level.
> 
> Below is my test code:
> @RestController
> public class SimpleTest {   
>   @RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
>   @ResponseBody
>   public String testProcessTransaction() {
>   return SparkShardTest.simpleRDDTest();
>   }
> }
> 
> final static Map<String, JavaRDDString>> simpleRDDs =
> initSimpleRDDs();
> public static Map<String, JavaRDDString>> initSimpleRDDs()
>   {
>   Map<String, JavaRDDString>> result = new 
> ConcurrentHashMap<String,
> JavaRDDString>>();
>   JavaRDD rddData = JavaSC.parallelize(data);
>   rddData.cache().count();//this cache will improve 1-2/sec
>   result.put("MyRDD", rddData);
>   return result;
>   }
>   
>   public static String simpleRDDTest()
>   {   
>   JavaRDD rddData = simpleRDDs.get("MyRDD");
>       return rddData.first();
>   }
> 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Why does spark take so much time for simple task without calculation?

2016-08-31 Thread xiefeng
I install a spark standalone and run the spark cluster(one master and one
worker) in a windows 2008 server with 16cores and 24GB memory.

I have done a simple test: Just create  a string RDD and simply return it. I
use JMeter to test throughput but the highest is around 35/sec. I think
spark is powerful at distribute calculation, but why the throughput is so
limit in such simple test scenario only contains simple task dispatch and no
calculation?

1. In JMeter I test both 10 threads or 100 threads, there is little
difference around 2-3/sec.
2. I test both cache/not cache the RDD, there is little difference. 
3. During the test, the cpu and memory are in low level.

Below is my test code:
@RestController
public class SimpleTest {   
@RequestMapping(value = "/SimpleTest", method = RequestMethod.GET)
@ResponseBody
public String testProcessTransaction() {
return SparkShardTest.simpleRDDTest();
}
}

final static Map<String, JavaRDDString>> simpleRDDs = initSimpleRDDs();
public static Map<String, JavaRDDString>> initSimpleRDDs()
{
Map<String, JavaRDDString>> result = new 
ConcurrentHashMap<String,
JavaRDDString>>();
JavaRDD rddData = JavaSC.parallelize(data);
rddData.cache().count();//this cache will improve 1-2/sec
result.put("MyRDD", rddData);
return result;
}

public static String simpleRDDTest()
{   
JavaRDD rddData = simpleRDDs.get("MyRDD");
return rddData.first();
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org