Re: When will Spark Streaming supports Kafka-simple consumer API?

2015-02-05 Thread Xuelin Cao.2015
://issues.apache.org/jira/browse/SPARK-4964 Can you elaborate on why you have to use SimpleConsumer in your environment? TD On Wed, Feb 4, 2015 at 7:44 PM, Xuelin Cao [hidden email] http:///user/SendEmail.jtp?type=nodenode=10477i=0 wrote: Hi, In our environment, Kafka can only

When will Spark Streaming supports Kafka-simple consumer API?

2015-02-04 Thread Xuelin Cao
Hi, In our environment, Kafka can only be used with simple consumer API, like storm spout does. And, also, I found there are suggestions that Kafka connector of Spark should not be used in production http://markmail.org/message/2lb776ta5sq6lgtw because it is based on the high-level

Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned

Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Xuelin Cao
Hi, Correct me if I were wrong. It looks like, the current version of Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical operator produces a tuple by recursively call child-execute . There are papers that illustrate the benefits of vectorized query engine. And

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)      For our

Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao
In our experimental cluster (1 driver, 5 workers), we tried the simplest example:   sc.parallelize(Range(0, 100), 2).count  In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms.  Our servers are with 2.3G Hz CPU

Re: Why Executor Deserialize Time takes more than 300ms?

2014-11-22 Thread Xuelin Cao
Thanks Imran, The problems is, *every time* I run the same task, the deserialization time is around 300~500ms. I don't know if this is a normal case. -- View this message in context:

Why Executor Deserialize Time takes more than 300ms?

2014-11-21 Thread Xuelin Cao
In our experimental cluster (1 driver, 5 workers), we tried the simplest example: sc.parallelize(Range(0, 100), 2).count In the event log, we found the executor takes too much time on deserialization, about 300 ~ 500ms, and the execution time is only 1ms. Our servers are with 2.3G Hz CPU * 24