Re: Hbase in spark

2016-02-26 Thread Ted Malaska
Yes, and I have used HBASE-15271 and successful loaded over 20 billion
records into HBase even with node failures.

On Fri, Feb 26, 2016 at 11:55 AM, Ted Yu  wrote:

> In hbase, there is hbase-spark module which supports bulk load.
> This module is to be backported in the upcoming 1.3.0 release.
>
> There is some pending work, such as HBASE-15271 .
>
> FYI
>
> On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav  wrote:
>
>> Has anybody implemented bulk load into hbase using spark?
>>
>> I need help to optimize its performance.
>>
>> Please help.
>>
>>
>> Thanks & Regards,
>> Renu Yadav
>>
>
>


Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
I've always used HBaseTestingUtility and never really had much trouble. I
use that for all my unit testing between Spark and HBase.

Here are some code examples if your interested

--Main HBase-Spark Module
https://github.com/apache/hbase/tree/master/hbase-spark

--Unit test that cover all basic connections
https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala

--If you want to look at the old stuff before it went into HBase
https://github.com/cloudera-labs/SparkOnHBase

Let me know if that helps

On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you log the contents of the Configuration you pass from Spark ?
 The output would give you some clue.

 Cheers



 On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Ted,

 I'll check Zookeeper connection but another test method which runs on
 hbase without Spark works without any error. Hbase version is
 0.98.8-hadoop2 and I use Spark 1.3.1

 Kind Regards,
 Furkan KAMACI
 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı:

 The connection failure was to zookeeper.

 Have you verified that localhost:2181 can serve requests ?
 What version of hbase was Gora built against ?

 Cheers



 On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi,

 I start an Hbase cluster for my test class. I use that helper class:


 https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java

 and use it as like that:

 private static final HBaseClusterSingleton cluster =
 HBaseClusterSingleton.build(1);

 I retrieve configuration object as follows:

 cluster.getConf()

 and I use it at Spark as follows:

 sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK,
 clazzV);

 When I run my test there is no need to startup an Hbase cluster because
 Spark will connect to my dummy cluster. However when I run my test method
 it throws an error:

 2015-08-26 01:19:59,558 INFO [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to
 server localhost/127.0.0.1:2181. Will not attempt to authenticate using
 SASL (unknown error)

 2015-08-26 01:19:59,559 WARN [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected
 error, closing socket connection and attempting reconnect
 java.net.ConnectException: Connection refused at
 sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
 Hbase tests, which do not run on Spark, works well. When I check the logs
 I see that cluster and Spark is started up correctly:

 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster
 (MiniDFSCluster.java:waitActive(2055)) - Cluster is active

 2015-08-26 01:35:40,334 INFO [main] util.Utils
 (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on
 port 56941.
 I realized that when I start up an hbase from command line my test method
 for Spark connects to it!

 So, does it means that it doesn't care about the conf I passed to it? Any
 ideas about how to solve it?




Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where is the input format class.  When every I use the search on your
github it says We couldn’t find any issues matching 'GoraInputFormat'



On Wed, Aug 26, 2015 at 9:48 AM, Furkan KAMACI furkankam...@gmail.com
wrote:

 Hi,

 Here is the MapReduceTestUtils.testSparkWordCount()


 https://github.com/kamaci/gora/blob/master/gora-core/src/test/java/org/apache/gora/mapreduce/MapReduceTestUtils.java#L108

 Here is SparkWordCount


 https://github.com/kamaci/gora/blob/8f1acc6d4ef6c192e8fc06287558b7bc7c39b040/gora-core/src/examples/java/org/apache/gora/examples/spark/SparkWordCount.java

 Lastly, here is GoraSparkEngine:


 https://github.com/kamaci/gora/blob/master/gora-core/src/main/java/org/apache/gora/spark/GoraSparkEngine.java

 Kind Regards,
 Furkan KAMACI

 On Wed, Aug 26, 2015 at 4:40 PM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 Where can I find the code for MapReduceTestUtils.testSparkWordCount?

 On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi,

 Here is the test method I've ignored due to Connection Refused problem
 failure:


 https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65

 I've implemented a Spark backend for Apache Gora as GSoC project and
 this is the latest obstacle that I should solve. If you can help me, you
 are welcome.

 Kind Regards,
 Furkan KAMACI

 On Wed, Aug 26, 2015 at 3:45 PM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I've always used HBaseTestingUtility and never really had much
 trouble. I use that for all my unit testing between Spark and HBase.

 Here are some code examples if your interested

 --Main HBase-Spark Module
 https://github.com/apache/hbase/tree/master/hbase-spark

 --Unit test that cover all basic connections

 https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala

 --If you want to look at the old stuff before it went into HBase
 https://github.com/cloudera-labs/SparkOnHBase

 Let me know if that helps

 On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you log the contents of the Configuration you pass from Spark ?
 The output would give you some clue.

 Cheers



 On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi Ted,

 I'll check Zookeeper connection but another test method which runs on
 hbase without Spark works without any error. Hbase version is
 0.98.8-hadoop2 and I use Spark 1.3.1

 Kind Regards,
 Furkan KAMACI
 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı:

 The connection failure was to zookeeper.

 Have you verified that localhost:2181 can serve requests ?
 What version of hbase was Gora built against ?

 Cheers



 On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi,

 I start an Hbase cluster for my test class. I use that helper class:


 https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java

 and use it as like that:

 private static final HBaseClusterSingleton cluster =
 HBaseClusterSingleton.build(1);

 I retrieve configuration object as follows:

 cluster.getConf()

 and I use it at Spark as follows:

 sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK,
 clazzV);

 When I run my test there is no need to startup an Hbase cluster
 because Spark will connect to my dummy cluster. However when I run my 
 test
 method it throws an error:

 2015-08-26 01:19:59,558 INFO [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to
 server localhost/127.0.0.1:2181. Will not attempt to authenticate
 using SASL (unknown error)

 2015-08-26 01:19:59,559 WARN [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected
 error, closing socket connection and attempting reconnect
 java.net.ConnectException: Connection refused at
 sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
 Hbase tests, which do not run on Spark, works well. When I check the
 logs I see that cluster and Spark is started up correctly:

 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster
 (MiniDFSCluster.java:waitActive(2055)) - Cluster is active

 2015-08-26 01:35:40,334 INFO [main] util.Utils
 (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' 
 on
 port 56941.
 I realized that when I start up an hbase from command line my test
 method for Spark connects to it!

 So, does it means that it doesn't care about the conf I passed to it?
 Any ideas about how to solve it?








Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
Where can I find the code for MapReduceTestUtils.testSparkWordCount?

On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI furkankam...@gmail.com
wrote:

 Hi,

 Here is the test method I've ignored due to Connection Refused problem
 failure:


 https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65

 I've implemented a Spark backend for Apache Gora as GSoC project and this
 is the latest obstacle that I should solve. If you can help me, you are
 welcome.

 Kind Regards,
 Furkan KAMACI

 On Wed, Aug 26, 2015 at 3:45 PM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I've always used HBaseTestingUtility and never really had much trouble.
 I use that for all my unit testing between Spark and HBase.

 Here are some code examples if your interested

 --Main HBase-Spark Module
 https://github.com/apache/hbase/tree/master/hbase-spark

 --Unit test that cover all basic connections

 https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala

 --If you want to look at the old stuff before it went into HBase
 https://github.com/cloudera-labs/SparkOnHBase

 Let me know if that helps

 On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote:

 Can you log the contents of the Configuration you pass from Spark ?
 The output would give you some clue.

 Cheers



 On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi Ted,

 I'll check Zookeeper connection but another test method which runs on
 hbase without Spark works without any error. Hbase version is
 0.98.8-hadoop2 and I use Spark 1.3.1

 Kind Regards,
 Furkan KAMACI
 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı:

 The connection failure was to zookeeper.

 Have you verified that localhost:2181 can serve requests ?
 What version of hbase was Gora built against ?

 Cheers



 On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 Hi,

 I start an Hbase cluster for my test class. I use that helper class:


 https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java

 and use it as like that:

 private static final HBaseClusterSingleton cluster =
 HBaseClusterSingleton.build(1);

 I retrieve configuration object as follows:

 cluster.getConf()

 and I use it at Spark as follows:

 sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK,
 clazzV);

 When I run my test there is no need to startup an Hbase cluster because
 Spark will connect to my dummy cluster. However when I run my test method
 it throws an error:

 2015-08-26 01:19:59,558 INFO [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to
 server localhost/127.0.0.1:2181. Will not attempt to authenticate
 using SASL (unknown error)

 2015-08-26 01:19:59,559 WARN [Executor task launch
 worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn
 (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected
 error, closing socket connection and attempting reconnect
 java.net.ConnectException: Connection refused at
 sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
 Hbase tests, which do not run on Spark, works well. When I check the
 logs I see that cluster and Spark is started up correctly:

 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster
 (MiniDFSCluster.java:waitActive(2055)) - Cluster is active

 2015-08-26 01:35:40,334 INFO [main] util.Utils
 (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on
 port 56941.
 I realized that when I start up an hbase from command line my test
 method for Spark connects to it!

 So, does it means that it doesn't care about the conf I passed to it?
 Any ideas about how to solve it?






Re: 答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-13 Thread Ted Malaska
Cool seems like the design are very close.

Here is my latest blog on my work with HBase and Spark.  Let me know if you
have any questions.  There should be two more blogs next month talking
about bulk load through spark 14150 which is committed, and SparkSQL 14181
which should be done next week.

http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

On Wed, Aug 12, 2015 at 12:18 AM, Yan Zhou.sc yan.zhou...@huawei.com
wrote:

 We are using MR-based bulk loading on Spark.



 For filter pushdown, Astro does partition-pruning, scan range pruning, and
 use Gets as much as possible.



 Thanks,





 *发件人:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *发送时间:* 2015年8月12日 9:14
 *收件人:* Yan Zhou.sc
 *抄送:* dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user
 *主题:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 There a number of ways to bulk load.

 There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
 which is spark shuffle bulk load.

 Let me know if I have missed a bulk loading option.  All these r possible
 with the new hbase-spark module.

 As for the filter push down discussion in the past email.  U will note in
 14181 that the filter push will also limit the scan range or drop scan all
 together for gets.

 Ted Malaska

 On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 No, Astro bulkloader does not use its own shuffle. But map/reduce-side
 processing is somewhat different from HBase’s bulk loader that are used by
 many HBase apps I believe.



 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 8:56 AM
 *To:* Yan Zhou.sc
 *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 The bulk load code is 14150 if u r interested.  Let me know how it can be
 made faster.

 It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
 own shuffle the times should be very close.

 On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
The bulk load code is 14150 if u r interested.  Let me know how it can be
made faster.

It's just a spark shuffle and writing hfiles.   Unless astro wrote it's own
shuffle the times should be very close.
On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
There a number of ways to bulk load.

There is bulk put, partition bulk put, mr bulk load, and now hbase-14150
which is spark shuffle bulk load.

Let me know if I have missed a bulk loading option.  All these r possible
with the new hbase-spark module.

As for the filter push down discussion in the past email.  U will note in
14181 that the filter push will also limit the scan range or drop scan all
together for gets.

Ted Malaska
On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 No, Astro bulkloader does not use its own shuffle. But map/reduce-side
 processing is somewhat different from HBase’s bulk loader that are used by
 many HBase apps I believe.



 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 8:56 AM
 *To:* Yan Zhou.sc
 *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 The bulk load code is 14150 if u r interested.  Let me know how it can be
 made faster.

 It's just a spark shuffle and writing hfiles.   Unless astro wrote it's
 own shuffle the times should be very close.

 On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 Thanks for pointing out more details of HBase-14181. I am afraid I may
 still need to learn more before I can make very accurate and pointed
 comments.



 As for filter push down, Astro has a powerful approach to basically break
 down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT

 to generate partition-specific predicates to be pushed down to HBase. This
 may not be a significant performance improvement if the filter logic is
 simple and/or the processing is IO-bound,

 but could be so for online ad-hoc analysis.



 For UDFs, Astro supports it both in and out of HBase custom filter.



 For secondary index, Astro do not support it now. With the probable
 support by HBase in the future(thanks to Ted Yu’s comments a while ago), we
 could add this support along with its specific optimizations.



 For bulk load, Astro has a much faster way to load the tabular data, we
 believe.



 Right now, Astro’s filter pushdown is through HBase built-in filters and
 custom filter.



 As for HBase-14181, I see some overlaps with Astro. Both have dependences
 on Spark SQL, and both supports Spark Dataframe as an access interface,
 both supports predicate pushdown.

 Astro is not designed for MR (or Spark’s equivalent) access though.



 If HBase-14181 is shooting for access to HBase data through a subset of
 DataFrame functionalities like filter, projection, and other map-side ops,
 would it be feasible to decouple it from Spark?

 My understanding is that 14181 does not run Spark execution engine at all,
 but will make use of Spark Dataframe semantic and/or logic planning to pass
 a logic (sub-)plan to the HBase. If true, it might

 be desirable to directly support Dataframe in HBase.



 Thanks,





 *From:* Ted Malaska [mailto:ted.mala...@cloudera.com]
 *Sent:* Wednesday, August 12, 2015 7:28 AM
 *To:* Yan Zhou.sc
 *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu
 *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Hey Yan,

 I've been the one building out this spark functionality in hbase so maybe
 I can help clarify.

 The hbase-spark module is just focused on making spark integration with
 hbase easy and out of the box for both spark and spark streaming.

 I and I believe the hbase team has no desire to build a sql engine in
 hbase.  This jira comes the closest to that line.  The main thing here is
 filter push down logic for basic sql operation like =, 
 , and .  User define functions and secondary indexes are not in my scope.

 Another main goal of hbase-spark module is to be able to allow a user to
 do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
 load.

 Let me know if u have any questions

 Ted Malaska

 On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-11 Thread Ted Malaska
Hey Yan,

I've been the one building out this spark functionality in hbase so maybe I
can help clarify.

The hbase-spark module is just focused on making spark integration with
hbase easy and out of the box for both spark and spark streaming.

I and I believe the hbase team has no desire to build a sql engine in
hbase.  This jira comes the closest to that line.  The main thing here is
filter push down logic for basic sql operation like =, 
, and .  User define functions and secondary indexes are not in my scope.

Another main goal of hbase-spark module is to be able to allow a user to
do  anything they did with MR/HBase now with Spark/Hbase.  Things like bulk
load.

Let me know if u have any questions

Ted Malaska
On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 We have not “formally” published any numbers yet. A good reference is a
 slide deck we posted for the meetup in March.

 , or better yet for interested parties to run performance comparisons by
 themselves for now.



 As for status quo of Astro, we have been focusing on fixing bugs
 (UDF-related bug in some coprocessor/custom filter combos), and add support
 of querying string columns in HBase as integers from Astro.



 Thanks,



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, August 12, 2015 7:02 AM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase
 Astro



 Yan:

 Where can I find performance numbers for Astro (it's close to middle of
 August) ?



 Cheers



 On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com
 wrote:

 Finally I can take a look at HBASE-14181 now. Unfortunately there is no
 design doc mentioned. Superficially it is very similar to Astro with a
 difference of

 this being part of HBase client library; while Astro works as a Spark
 package so will evolve and function more closely with Spark SQL/Dataframe
 instead of HBase.



 In terms of architecture, my take is loosely-coupled query engines on top
 of KV store vs. an array of query engines supported by, and packaged as
 part of, a KV store.



 Functionality-wise the two could be close but Astro also supports Python
 as a result of tight integration with Spark.

 It will be interesting to see performance comparisons when HBase-14181 is
 ready.



 Thanks,





 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Tuesday, August 11, 2015 3:28 PM
 *To:* Yan Zhou.sc
 *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org
 *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro



 HBase will not have query engine.



 It will provide better support to query engines.



 Cheers


 On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:

 Ted,



 I’m in China now, and seem to experience difficulty to access Apache Jira.
 Anyways, it appears to me  that HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 attempts to support
 Spark DataFrame inside HBase.

 If true, one question to me is whether HBase is intended to have a
 built-in query engine or not. Or it will stick with the current way as

 a k-v store with some built-in processing capabilities in the forms of
 coprocessor, custom filter, …, etc., which allows for loosely-coupled query
 engines

 built on top of it.



 Thanks,



 *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com yuzhih...@gmail.com]
 *发送时间**:* 2015年8月11日 8:54
 *收件人**:* Bing Xiao (Bing)
 *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
 *主题**:* Re: Package Release Annoucement: Spark SQL on HBase Astro



 Yan / Bing:

 Mind taking a look at HBASE-14181
 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame
 DataSource to HBase-Spark Module' ?



 Thanks



 On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com
 wrote:

 We are happy to announce the availability of the Spark SQL on HBase 1.0.0
 release.
 http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase

 The main features in this package, dubbed “Astro”, include:

 · Systematic and powerful handling of data pruning and
 intelligent scan, based on partial evaluation technique

 · HBase pushdown capabilities like custom filters and coprocessor
 to support ultra low latency processing

 · SQL, Data Frame support

 · More SQL capabilities made possible (Secondary index, bloom
 filter, Primary Key, Bulk load, Update)

 · Joins with data from other sources

 · Python/Java/Scala support

 · Support latest Spark 1.4.0 release



 The tests by Huawei team and community contributors covered the areas:
 bulk load; projection pruning; partition pruning; partial evaluation; code
 generation; coprocessor; customer filtering; DML; complex filtering on keys
 and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
 family test.  We will post the test results including

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Thanks Michal,

Just to share what I'm working on in a related topic.  So a long time ago I
build SparkOnHBase and put it into Cloudera Labs in this link.
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

Also recently I am working on getting this into HBase core.  It will
hopefully be in HBase core with in the next couple of weeks.

https://issues.apache.org/jira/browse/HBASE-13992

Then I'm planing on adding dataframe and bulk load support through

https://issues.apache.org/jira/browse/HBASE-14149
https://issues.apache.org/jira/browse/HBASE-14150

Also if you are interested this is running today a at least a half a dozen
companies with Spark Streaming.  Here is one blog post of successful
implementation

http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

Also here is an additional example blog I also put together

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

Let me know if you have any questions, also let me know if you want to
connect to join efforts.

Ted Malaska

On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com
wrote:

 Hi all, last couple of months I've been working on a large graph analytics
 and along the way have written from scratch a HBase-Spark integration as
 none of the ones out there worked either in terms of scale or in the way
 they integrated with the RDD interface. This week I have generalised it
 into an (almost) spark module, which works with the latest spark and the
 new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Yup you should be able to do that with the APIs that are going into HBase.

Let me know if you need to chat about the problem and how to implement it
with the HBase apis.

We have tried to cover any possible way to use HBase with Spark.  Let us
know if we missed anything if we did we will add it.

On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time ago
 I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Stuff that people are using is here.

https://github.com/cloudera-labs/SparkOnHBase

The stuff going into HBase is here
https://issues.apache.org/jira/browse/HBASE-13992

If you want to add things to the hbase ticket lets do it in another jira.
Like these jira

https://issues.apache.org/jira/browse/HBASE-14149
https://issues.apache.org/jira/browse/HBASE-14150

This first jira is mainly getting the Spark dependancies and separate
module set up so we can start making additional jiras to add additional
functionality.

The goal is to have the following in HBase by end of summer:

RDD and DStream Functions
1. BulkPut
2. BulkGet
3. BulkDelete
4. Foreach with connection
5. Map with connection
6. Distributed Scan
7. BulkLoad

DataFrame Functions
1. BulkPut
2. BulkGet
6. Distributed Scan
7. BulkLoad

If you think there should be more let me know

Ted Malaska


On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com
wrote:

 Cool, will revisit, is your latest code visible publicly somewhere ?

 On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:

 Yup you should be able to do that with the APIs that are going into HBase.

 Let me know if you need to chat about the problem and how to implement it
 with the HBase apis.

 We have tried to cover any possible way to use HBase with Spark.  Let us
 know if we missed anything if we did we will add it.

 On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time
 ago I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR



Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Sorry this is more correct

RDD and DStream Functions
1. BulkPut
2. BulkGet
3. BulkDelete
4. Foreach with connection
5. Map with connection
6. Distributed Scan
7. BulkLoad

DataFrame Functions
1. BulkPut
2. BulkGet
3. Foreach with connection
4. Map with connection
5. Distributed Scan
6. BulkLoad


On Tue, Jul 28, 2015 at 12:23 PM, Ted Malaska ted.mala...@cloudera.com
wrote:

 Stuff that people are using is here.

 https://github.com/cloudera-labs/SparkOnHBase

 The stuff going into HBase is here
 https://issues.apache.org/jira/browse/HBASE-13992

 If you want to add things to the hbase ticket lets do it in another jira.
 Like these jira

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 This first jira is mainly getting the Spark dependancies and separate
 module set up so we can start making additional jiras to add additional
 functionality.

 The goal is to have the following in HBase by end of summer:

 RDD and DStream Functions
 1. BulkPut
 2. BulkGet
 3. BulkDelete
 4. Foreach with connection
 5. Map with connection
 6. Distributed Scan
 7. BulkLoad

 DataFrame Functions
 1. BulkPut
 2. BulkGet
 6. Distributed Scan
 7. BulkLoad

 If you think there should be more let me know

 Ted Malaska


 On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com
  wrote:

 Cool, will revisit, is your latest code visible publicly somewhere ?

 On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote:

 Yup you should be able to do that with the APIs that are going into
 HBase.

 Let me know if you need to chat about the problem and how to implement
 it with the HBase apis.

 We have tried to cover any possible way to use HBase with Spark.  Let us
 know if we missed anything if we did we will add it.

 On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi Ted, yes, cloudera blog and your code was my starting point - but I
 needed something more spark-centric rather than on hbase. Basically doing a
 lot of ad-hoc transformations with RDDs that were based on HBase tables and
 then mutating them after series of iterative (bsp-like) steps.

 On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote:

 Thanks Michal,

 Just to share what I'm working on in a related topic.  So a long time
 ago I build SparkOnHBase and put it into Cloudera Labs in this link.
 http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/

 Also recently I am working on getting this into HBase core.  It will
 hopefully be in HBase core with in the next couple of weeks.

 https://issues.apache.org/jira/browse/HBASE-13992

 Then I'm planing on adding dataframe and bulk load support through

 https://issues.apache.org/jira/browse/HBASE-14149
 https://issues.apache.org/jira/browse/HBASE-14150

 Also if you are interested this is running today a at least a half a
 dozen companies with Spark Streaming.  Here is one blog post of successful
 implementation


 http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/

 Also here is an additional example blog I also put together


 http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

 Let me know if you have any questions, also let me know if you want to
 connect to join efforts.

 Ted Malaska

 On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris 
 michal.ha...@visualdna.com wrote:

 Hi all, last couple of months I've been working on a large graph
 analytics and along the way have written from scratch a HBase-Spark
 integration as none of the ones out there worked either in terms of scale
 or in the way they integrated with the RDD interface. This week I have
 generalised it into an (almost) spark module, which works with the latest
 spark and the new hbase api, so... sharing! :
 https://github.com/michal-harish/spark-on-hbase


 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





 --
 Michal Haris
 Technical Architect
 direct line: +44 (0) 207 749 0229
 www.visualdna.com | t: +44 (0) 207 734 7033
 31 Old Nichol Street
 London
 E2 7HR





Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it.  Who a good person to review the design with.
All I need is a quick chat about the design and approach and I'll create
the jira and push a patch.

Ted Malaska

On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add it
 to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get 
 the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira

https://issues.apache.org/jira/browse/SPARK-9237

Please help me get it assigned to myself thanks.

Ted Malaska

On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com
wrote:

 Cool I will make a jira after I check in to my hotel.  And try to get a
 patch early next week.
 On Jul 21, 2015 5:15 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 yes and freqItems does not give you an ordered count (right ?) + the
 threshold makes it difficult to calibrate it + we noticed some strange
 behaviour when testing it on small datasets.

 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:

 Look at the implementation for frequently items.  It is a different from
 true count.
 On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and
 my wish would be to be able to apply it on multiple columns at the same
 time and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe
 we could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname
 */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool
 for that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to
 add it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of 
 each
 column value as the key (which is what the groupBy is doing by 
 default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, 
 StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94




Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Look at the implementation for frequently items.  It is a different from
true count.
On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add
 it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of each
 column value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Cool I will make a jira after I check in to my hotel.  And try to get a
patch early next week.
On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com
wrote:

 yes and freqItems does not give you an ordered count (right ?) + the
 threshold makes it difficult to calibrate it + we noticed some strange
 behaviour when testing it on small datasets.

 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:

 Look at the implementation for frequently items.  It is a different from
 true count.
 On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and
 my wish would be to be able to apply it on multiple columns at the same
 time and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to
 add it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of 
 each
 column value as the key (which is what the groupBy is doing by 
 default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, 
 StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post.

http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

This is a very common use case.  If there is a +1 I would love to add it to
dataframes.

Let me know
Ted Malaska

On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: Welcoming some new committers

2015-06-20 Thread Ted Malaska
Super congrats.  Well earned.
On Jun 20, 2015 12:48 PM, Andrew Or and...@databricks.com wrote:

 Welcome!

 2015-06-20 7:30 GMT-07:00 Debasish Das debasish.da...@gmail.com:

 Congratulations to All.

 DB great work in bringing quasi newton methods to Spark !

 On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen ches...@alpinenow.com
 wrote:

 Congratulations to All.

 DB and Sandy, great works !


 On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hey all,

 Over the past 1.5 months we added a number of new committers to the
 project, and I wanted to welcome them now that all of their respective
 forms, accounts, etc are in. Join me in welcoming the following new
 committers:

 - Davies Liu
 - DB Tsai
 - Kousuke Saruta
 - Sandy Ryza
 - Yin Huai

 Looking forward to more great contributions from all of these folks.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org







Re: Spark-Submit issues

2014-11-12 Thread Ted Malaska
Hey this is Ted

Are you using Shade when you build your jar and are you using the bigger
jar?  Looks like classes are not included in you jar.

On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson 
jeniba.john...@lntinfotech.com wrote:

 Hi Hari,

 Now Iam trying out the same FlumeEventCount example running with
 spark-submit Instead of run example. The steps I followed is that I have
 exported the JavaFlumeEventCount.java into jar.

 The command used is
 ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar
 --master local --class org.JavaFlumeEventCount  bin/flumeeventcnt2.jar
 localhost 2323

 The output is
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 If I use this command
  ./bin/spark-submit --master local --class org.JavaFlumeEventCount
 bin/flumeeventcnt2.jar  localhost 2323

 Then I get an error
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/examples/streaming/StreamingExamples
 at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.examples.streaming.StreamingExamples
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 ... 8 more


 I Just wanted to ask is  that it is able to  find spark-assembly.jar but
 why not spark-example.jar.
 The next doubt is  while running FlumeEventCount example through runexample

 I get an output as
 Received 4 flume events.

 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming job
 1415797214000 ms.0 from job set of time 1415797214000 ms
 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence list

 But If I run the same program through Spark-Submit

 I get an output as
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 So I need a clarification, since in the program the printing statement is
 written as  Received n flume events. So how come Iam able to see as
 Stream 0 received n blocks.
 And what is the difference of running the program through spark-submit and
 run-example.

 Awaiting for your kind reply

 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Re: Spark-Submit issues

2014-11-12 Thread Ted Malaska
Other wish include them at the time of execution.  here is an example.

spark-submit --jars
/opt/cloudera/parcels/CDH/lib/zookeeper/zookeeper-3.4.5-cdh5.1.0.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
--class org.apache.spark.hbase.example.HBaseBulkDeleteExample --master yarn
--deploy-mode client --executor-memory 512M --num-executors 4
--driver-java-options
-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/*
SparkHBase.jar t1 c

On Wed, Nov 12, 2014 at 4:25 PM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 Yep, you’d need to shade jars to ensure all your dependencies are in the
 classpath.

 Thanks,
 Hari


 On Wed, Nov 12, 2014 at 3:23 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 Hey this is Ted

 Are you using Shade when you build your jar and are you using the bigger
 jar?  Looks like classes are not included in you jar.

 On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson 
 jeniba.john...@lntinfotech.com wrote:

 Hi Hari,

 Now Iam trying out the same FlumeEventCount example running with
 spark-submit Instead of run example. The steps I followed is that I have
 exported the JavaFlumeEventCount.java into jar.

 The command used is
 ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar
 --master local --class org.JavaFlumeEventCount  bin/flumeeventcnt2.jar
 localhost 2323

 The output is
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 If I use this command
  ./bin/spark-submit --master local --class org.JavaFlumeEventCount
 bin/flumeeventcnt2.jar  localhost 2323

 Then I get an error
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/examples/streaming/StreamingExamples
 at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at
 org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.examples.streaming.StreamingExamples
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 ... 8 more


 I Just wanted to ask is  that it is able to  find spark-assembly.jar but
 why not spark-example.jar.
 The next doubt is  while running FlumeEventCount example through
 runexample

 I get an output as
 Received 4 flume events.

 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming
 job 1415797214000 ms.0 from job set of time 1415797214000 ms
 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence
 list

 But If I run the same program through Spark-Submit

 I get an output as
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 So I need a clarification, since in the program the printing statement
 is written as  Received n flume events. So how come Iam able to see as
 Stream 0 received n blocks.
 And what is the difference of running the program through spark-submit
 and run-example.

 Awaiting for your kind reply

 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain
 confidential or privileged information for the intended recipient(s).
 Unintended recipients are prohibited from taking action on the basis of
 information in this e-mail and using or disseminating the information, and
 must notify the sender and delete it from their system. LT Infotech will
 not accept responsibility or liability for the accuracy or completeness of,
 or the presence of any virus

Re: Compile error when compiling for cloudera

2014-07-17 Thread Ted Malaska
Don't make this change yet.  I have a 1642 that needs to get through around
the same code.

I can make this change after 1642 is through.


On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen so...@cloudera.com wrote:

 CC tmalaska since he touched the line in question. This is a fun one.
 So, here's the line of code added last week:

 val channelFactory = new NioServerSocketChannelFactory
   (Executors.newCachedThreadPool(), Executors.newCachedThreadPool());

 Scala parses this as two statements, one invoking a no-arg constructor
 and one making a tuple for fun. Put it on one line and it's fine.

 It works with newer Netty since there is a no-arg constructor. It
 fails with older Netty, which is what you get with older Hadoop.

 The fix is obvious. I'm away and if nobody beats me to a PR in the
 meantime, I'll propose one as an addendum to the recent JIRA.

 Sean

 *

 On Thu, Jul 17, 2014 at 3:58 PM, Nathan Kronenfeld
 nkronenf...@oculusinfo.com wrote:
  My full build command is:
  ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly
 
 
  I've changed one line in RDD.scala, nothing else.
 
 
 
  On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen so...@cloudera.com wrote:
 
  This looks like a Jetty version problem actually. Are you bringing in
  something that might be changing the version of Jetty used by Spark?
  It depends a lot on how you are building things.
 
  Good to specify exactly how your'e building here.
 
  On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld
  nkronenf...@oculusinfo.com wrote:
   I'm trying to compile the latest code, with the hadoop-version set for
   2.0.0-mr1-cdh4.6.0.
  
   I'm getting the following error, which I don't get when I don't set
 the
   hadoop version:
  
   [error]
  
 
 /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156:
   overloaded method constructor NioServerSocketChannelFactory with
   alternatives:
   [error]   (x$1: java.util.concurrent.Executor,x$2:
   java.util.concurrent.Executor,x$3:
   Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
  and
   [error]   (x$1: java.util.concurrent.Executor,x$2:
  
 
 java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory
   [error]  cannot be applied to ()
   [error]   val channelFactory = new NioServerSocketChannelFactory
   [error]^
   [error] one error found
  
  
   I don't know flume from a hole in the wall - does anyone know what I
 can
  do
   to fix this?
  
  
   Thanks,
-Nathan
  
  
   --
   Nathan Kronenfeld
   Senior Visualization Developer
   Oculus Info Inc
   2 Berkeley Street, Suite 600,
   Toronto, Ontario M5A 4J5
   Phone:  +1-416-203-3003 x 238
   Email:  nkronenf...@oculusinfo.com
 
 
 
 
  --
  Nathan Kronenfeld
  Senior Visualization Developer
  Oculus Info Inc
  2 Berkeley Street, Suite 600,
  Toronto, Ontario M5A 4J5
  Phone:  +1-416-203-3003 x 238
  Email:  nkronenf...@oculusinfo.com