Re: Hbase in spark
Yes, and I have used HBASE-15271 and successful loaded over 20 billion records into HBase even with node failures. On Fri, Feb 26, 2016 at 11:55 AM, Ted Yuwrote: > In hbase, there is hbase-spark module which supports bulk load. > This module is to be backported in the upcoming 1.3.0 release. > > There is some pending work, such as HBASE-15271 . > > FYI > > On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav wrote: > >> Has anybody implemented bulk load into hbase using spark? >> >> I need help to optimize its performance. >> >> Please help. >> >> >> Thanks & Regards, >> Renu Yadav >> > >
Re: Spark Cannot Connect to HBaseClusterSingleton
I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic connections https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala --If you want to look at the old stuff before it went into HBase https://github.com/cloudera-labs/SparkOnHBase Let me know if that helps On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote: Can you log the contents of the Configuration you pass from Spark ? The output would give you some clue. Cheers On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark works without any error. Hbase version is 0.98.8-hadoop2 and I use Spark 1.3.1 Kind Regards, Furkan KAMACI 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı: The connection failure was to zookeeper. Have you verified that localhost:2181 can serve requests ? What version of hbase was Gora built against ? Cheers On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, I start an Hbase cluster for my test class. I use that helper class: https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java and use it as like that: private static final HBaseClusterSingleton cluster = HBaseClusterSingleton.build(1); I retrieve configuration object as follows: cluster.getConf() and I use it at Spark as follows: sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK, clazzV); When I run my test there is no need to startup an Hbase cluster because Spark will connect to my dummy cluster. However when I run my test method it throws an error: 2015-08-26 01:19:59,558 INFO [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2015-08-26 01:19:59,559 WARN [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) Hbase tests, which do not run on Spark, works well. When I check the logs I see that cluster and Spark is started up correctly: 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster (MiniDFSCluster.java:waitActive(2055)) - Cluster is active 2015-08-26 01:35:40,334 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 56941. I realized that when I start up an hbase from command line my test method for Spark connects to it! So, does it means that it doesn't care about the conf I passed to it? Any ideas about how to solve it?
Re: Spark Cannot Connect to HBaseClusterSingleton
Where is the input format class. When every I use the search on your github it says We couldn’t find any issues matching 'GoraInputFormat' On Wed, Aug 26, 2015 at 9:48 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, Here is the MapReduceTestUtils.testSparkWordCount() https://github.com/kamaci/gora/blob/master/gora-core/src/test/java/org/apache/gora/mapreduce/MapReduceTestUtils.java#L108 Here is SparkWordCount https://github.com/kamaci/gora/blob/8f1acc6d4ef6c192e8fc06287558b7bc7c39b040/gora-core/src/examples/java/org/apache/gora/examples/spark/SparkWordCount.java Lastly, here is GoraSparkEngine: https://github.com/kamaci/gora/blob/master/gora-core/src/main/java/org/apache/gora/spark/GoraSparkEngine.java Kind Regards, Furkan KAMACI On Wed, Aug 26, 2015 at 4:40 PM, Ted Malaska ted.mala...@cloudera.com wrote: Where can I find the code for MapReduceTestUtils.testSparkWordCount? On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, Here is the test method I've ignored due to Connection Refused problem failure: https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65 I've implemented a Spark backend for Apache Gora as GSoC project and this is the latest obstacle that I should solve. If you can help me, you are welcome. Kind Regards, Furkan KAMACI On Wed, Aug 26, 2015 at 3:45 PM, Ted Malaska ted.mala...@cloudera.com wrote: I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic connections https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala --If you want to look at the old stuff before it went into HBase https://github.com/cloudera-labs/SparkOnHBase Let me know if that helps On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote: Can you log the contents of the Configuration you pass from Spark ? The output would give you some clue. Cheers On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark works without any error. Hbase version is 0.98.8-hadoop2 and I use Spark 1.3.1 Kind Regards, Furkan KAMACI 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı: The connection failure was to zookeeper. Have you verified that localhost:2181 can serve requests ? What version of hbase was Gora built against ? Cheers On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, I start an Hbase cluster for my test class. I use that helper class: https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java and use it as like that: private static final HBaseClusterSingleton cluster = HBaseClusterSingleton.build(1); I retrieve configuration object as follows: cluster.getConf() and I use it at Spark as follows: sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK, clazzV); When I run my test there is no need to startup an Hbase cluster because Spark will connect to my dummy cluster. However when I run my test method it throws an error: 2015-08-26 01:19:59,558 INFO [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2015-08-26 01:19:59,559 WARN [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) Hbase tests, which do not run on Spark, works well. When I check the logs I see that cluster and Spark is started up correctly: 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster (MiniDFSCluster.java:waitActive(2055)) - Cluster is active 2015-08-26 01:35:40,334 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 56941. I realized that when I start up an hbase from command line my test method for Spark connects to it! So, does it means that it doesn't care about the conf I passed to it? Any ideas about how to solve it?
Re: Spark Cannot Connect to HBaseClusterSingleton
Where can I find the code for MapReduceTestUtils.testSparkWordCount? On Wed, Aug 26, 2015 at 9:29 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, Here is the test method I've ignored due to Connection Refused problem failure: https://github.com/kamaci/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65 I've implemented a Spark backend for Apache Gora as GSoC project and this is the latest obstacle that I should solve. If you can help me, you are welcome. Kind Regards, Furkan KAMACI On Wed, Aug 26, 2015 at 3:45 PM, Ted Malaska ted.mala...@cloudera.com wrote: I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic connections https://github.com/apache/hbase/blob/master/hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/HBaseContextSuite.scala --If you want to look at the old stuff before it went into HBase https://github.com/cloudera-labs/SparkOnHBase Let me know if that helps On Wed, Aug 26, 2015 at 5:40 AM, Ted Yu yuzhih...@gmail.com wrote: Can you log the contents of the Configuration you pass from Spark ? The output would give you some clue. Cheers On Aug 26, 2015, at 2:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi Ted, I'll check Zookeeper connection but another test method which runs on hbase without Spark works without any error. Hbase version is 0.98.8-hadoop2 and I use Spark 1.3.1 Kind Regards, Furkan KAMACI 26 Ağu 2015 12:08 tarihinde Ted Yu yuzhih...@gmail.com yazdı: The connection failure was to zookeeper. Have you verified that localhost:2181 can serve requests ? What version of hbase was Gora built against ? Cheers On Aug 26, 2015, at 1:50 AM, Furkan KAMACI furkankam...@gmail.com wrote: Hi, I start an Hbase cluster for my test class. I use that helper class: https://github.com/apache/gora/blob/master/gora-hbase/src/test/java/org/apache/gora/hbase/util/HBaseClusterSingleton.java and use it as like that: private static final HBaseClusterSingleton cluster = HBaseClusterSingleton.build(1); I retrieve configuration object as follows: cluster.getConf() and I use it at Spark as follows: sparkContext.newAPIHadoopRDD(conf, MyInputFormat.class, clazzK, clazzV); When I run my test there is no need to startup an Hbase cluster because Spark will connect to my dummy cluster. However when I run my test method it throws an error: 2015-08-26 01:19:59,558 INFO [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) 2015-08-26 01:19:59,559 WARN [Executor task launch worker-0-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068) Hbase tests, which do not run on Spark, works well. When I check the logs I see that cluster and Spark is started up correctly: 2015-08-26 01:35:21,791 INFO [main] hdfs.MiniDFSCluster (MiniDFSCluster.java:waitActive(2055)) - Cluster is active 2015-08-26 01:35:40,334 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkDriver' on port 56941. I realized that when I start up an hbase from command line my test method for Spark connects to it! So, does it means that it doesn't care about the conf I passed to it? Any ideas about how to solve it?
Re: 答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro
Cool seems like the design are very close. Here is my latest blog on my work with HBase and Spark. Let me know if you have any questions. There should be two more blogs next month talking about bulk load through spark 14150 which is committed, and SparkSQL 14181 which should be done next week. http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ On Wed, Aug 12, 2015 at 12:18 AM, Yan Zhou.sc yan.zhou...@huawei.com wrote: We are using MR-based bulk loading on Spark. For filter pushdown, Astro does partition-pruning, scan range pruning, and use Gets as much as possible. Thanks, *发件人:* Ted Malaska [mailto:ted.mala...@cloudera.com] *发送时间:* 2015年8月12日 9:14 *收件人:* Yan Zhou.sc *抄送:* dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user *主题:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro There a number of ways to bulk load. There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which is spark shuffle bulk load. Let me know if I have missed a bulk loading option. All these r possible with the new hbase-spark module. As for the filter push down discussion in the past email. U will note in 14181 that the filter push will also limit the scan range or drop scan all together for gets. Ted Malaska On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: No, Astro bulkloader does not use its own shuffle. But map/reduce-side processing is somewhat different from HBase’s bulk loader that are used by many HBase apps I believe. *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] *Sent:* Wednesday, August 12, 2015 8:56 AM *To:* Yan Zhou.sc *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro The bulk load code is 14150 if u r interested. Let me know how it can be made faster. It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, Thanks for pointing out more details of HBase-14181. I am afraid I may still need to learn more before I can make very accurate and pointed comments. As for filter push down, Astro has a powerful approach to basically break down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT to generate partition-specific predicates to be pushed down to HBase. This may not be a significant performance improvement if the filter logic is simple and/or the processing is IO-bound, but could be so for online ad-hoc analysis. For UDFs, Astro supports it both in and out of HBase custom filter. For secondary index, Astro do not support it now. With the probable support by HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this support along with its specific optimizations. For bulk load, Astro has a much faster way to load the tabular data, we believe. Right now, Astro’s filter pushdown is through HBase built-in filters and custom filter. As for HBase-14181, I see some overlaps with Astro. Both have dependences on Spark SQL, and both supports Spark Dataframe as an access interface, both supports predicate pushdown. Astro is not designed for MR (or Spark’s equivalent) access though. If HBase-14181 is shooting for access to HBase data through a subset of DataFrame functionalities like filter, projection, and other map-side ops, would it be feasible to decouple it from Spark? My understanding is that 14181 does not run Spark execution engine at all, but will make use of Spark Dataframe semantic and/or logic planning to pass a logic (sub-)plan to the HBase. If true, it might be desirable to directly support Dataframe in HBase. Thanks, *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] *Sent:* Wednesday, August 12, 2015 7:28 AM *To:* Yan Zhou.sc *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build a sql engine in hbase. This jira comes the closest to that line. The main thing here is filter push down logic for basic sql operation like =, , and . User define functions and secondary indexes are not in my scope. Another main goal of hbase-spark module is to be able to allow a user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load. Let me know if u have any questions Ted Malaska On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: We have not “formally” published any numbers yet. A good reference
RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro
The bulk load code is 14150 if u r interested. Let me know how it can be made faster. It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, Thanks for pointing out more details of HBase-14181. I am afraid I may still need to learn more before I can make very accurate and pointed comments. As for filter push down, Astro has a powerful approach to basically break down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT to generate partition-specific predicates to be pushed down to HBase. This may not be a significant performance improvement if the filter logic is simple and/or the processing is IO-bound, but could be so for online ad-hoc analysis. For UDFs, Astro supports it both in and out of HBase custom filter. For secondary index, Astro do not support it now. With the probable support by HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this support along with its specific optimizations. For bulk load, Astro has a much faster way to load the tabular data, we believe. Right now, Astro’s filter pushdown is through HBase built-in filters and custom filter. As for HBase-14181, I see some overlaps with Astro. Both have dependences on Spark SQL, and both supports Spark Dataframe as an access interface, both supports predicate pushdown. Astro is not designed for MR (or Spark’s equivalent) access though. If HBase-14181 is shooting for access to HBase data through a subset of DataFrame functionalities like filter, projection, and other map-side ops, would it be feasible to decouple it from Spark? My understanding is that 14181 does not run Spark execution engine at all, but will make use of Spark Dataframe semantic and/or logic planning to pass a logic (sub-)plan to the HBase. If true, it might be desirable to directly support Dataframe in HBase. Thanks, *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] *Sent:* Wednesday, August 12, 2015 7:28 AM *To:* Yan Zhou.sc *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build a sql engine in hbase. This jira comes the closest to that line. The main thing here is filter push down logic for basic sql operation like =, , and . User define functions and secondary indexes are not in my scope. Another main goal of hbase-spark module is to be able to allow a user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load. Let me know if u have any questions Ted Malaska On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March. , or better yet for interested parties to run performance comparisons by themselves for now. As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug in some coprocessor/custom filter combos), and add support of querying string columns in HBase as integers from Astro. Thanks, *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Wednesday, August 12, 2015 7:02 AM *To:* Yan Zhou.sc *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very similar to Astro with a difference of this being part of HBase client library; while Astro works as a Spark package so will evolve and function more closely with Spark SQL/Dataframe instead of HBase. In terms of architecture, my take is loosely-coupled query engines on top of KV store vs. an array of query engines supported by, and packaged as part of, a KV store. Functionality-wise the two could be close but Astro also supports Python as a result of tight integration with Spark. It will be interesting to see performance comparisons when HBase-14181 is ready. Thanks, *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Tuesday, August 11, 2015 3:28 PM *To:* Yan Zhou.sc *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro HBase will not have query engine. It will provide
RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro
There a number of ways to bulk load. There is bulk put, partition bulk put, mr bulk load, and now hbase-14150 which is spark shuffle bulk load. Let me know if I have missed a bulk loading option. All these r possible with the new hbase-spark module. As for the filter push down discussion in the past email. U will note in 14181 that the filter push will also limit the scan range or drop scan all together for gets. Ted Malaska On Aug 11, 2015 9:06 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: No, Astro bulkloader does not use its own shuffle. But map/reduce-side processing is somewhat different from HBase’s bulk loader that are used by many HBase apps I believe. *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] *Sent:* Wednesday, August 12, 2015 8:56 AM *To:* Yan Zhou.sc *Cc:* dev@spark.apache.org; Ted Yu; Bing Xiao (Bing); user *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro The bulk load code is 14150 if u r interested. Let me know how it can be made faster. It's just a spark shuffle and writing hfiles. Unless astro wrote it's own shuffle the times should be very close. On Aug 11, 2015 8:49 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, Thanks for pointing out more details of HBase-14181. I am afraid I may still need to learn more before I can make very accurate and pointed comments. As for filter push down, Astro has a powerful approach to basically break down arbitrarily complex logic expressions comprising of AND/OR/IN/NOT to generate partition-specific predicates to be pushed down to HBase. This may not be a significant performance improvement if the filter logic is simple and/or the processing is IO-bound, but could be so for online ad-hoc analysis. For UDFs, Astro supports it both in and out of HBase custom filter. For secondary index, Astro do not support it now. With the probable support by HBase in the future(thanks to Ted Yu’s comments a while ago), we could add this support along with its specific optimizations. For bulk load, Astro has a much faster way to load the tabular data, we believe. Right now, Astro’s filter pushdown is through HBase built-in filters and custom filter. As for HBase-14181, I see some overlaps with Astro. Both have dependences on Spark SQL, and both supports Spark Dataframe as an access interface, both supports predicate pushdown. Astro is not designed for MR (or Spark’s equivalent) access though. If HBase-14181 is shooting for access to HBase data through a subset of DataFrame functionalities like filter, projection, and other map-side ops, would it be feasible to decouple it from Spark? My understanding is that 14181 does not run Spark execution engine at all, but will make use of Spark Dataframe semantic and/or logic planning to pass a logic (sub-)plan to the HBase. If true, it might be desirable to directly support Dataframe in HBase. Thanks, *From:* Ted Malaska [mailto:ted.mala...@cloudera.com] *Sent:* Wednesday, August 12, 2015 7:28 AM *To:* Yan Zhou.sc *Cc:* user; dev@spark.apache.org; Bing Xiao (Bing); Ted Yu *Subject:* RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build a sql engine in hbase. This jira comes the closest to that line. The main thing here is filter push down logic for basic sql operation like =, , and . User define functions and secondary indexes are not in my scope. Another main goal of hbase-spark module is to be able to allow a user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load. Let me know if u have any questions Ted Malaska On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March. , or better yet for interested parties to run performance comparisons by themselves for now. As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug in some coprocessor/custom filter combos), and add support of querying string columns in HBase as integers from Astro. Thanks, *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Wednesday, August 12, 2015 7:02 AM *To:* Yan Zhou.sc *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Finally I can take a look at HBASE-14181 now. Unfortunately there is no design
RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro
Hey Yan, I've been the one building out this spark functionality in hbase so maybe I can help clarify. The hbase-spark module is just focused on making spark integration with hbase easy and out of the box for both spark and spark streaming. I and I believe the hbase team has no desire to build a sql engine in hbase. This jira comes the closest to that line. The main thing here is filter push down logic for basic sql operation like =, , and . User define functions and secondary indexes are not in my scope. Another main goal of hbase-spark module is to be able to allow a user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load. Let me know if u have any questions Ted Malaska On Aug 11, 2015 7:13 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: We have not “formally” published any numbers yet. A good reference is a slide deck we posted for the meetup in March. , or better yet for interested parties to run performance comparisons by themselves for now. As for status quo of Astro, we have been focusing on fixing bugs (UDF-related bug in some coprocessor/custom filter combos), and add support of querying string columns in HBase as integers from Astro. Thanks, *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Wednesday, August 12, 2015 7:02 AM *To:* Yan Zhou.sc *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org *Subject:* Re: 答复: 答复: Package Release Annoucement: Spark SQL on HBase Astro Yan: Where can I find performance numbers for Astro (it's close to middle of August) ? Cheers On Tue, Aug 11, 2015 at 3:58 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Finally I can take a look at HBASE-14181 now. Unfortunately there is no design doc mentioned. Superficially it is very similar to Astro with a difference of this being part of HBase client library; while Astro works as a Spark package so will evolve and function more closely with Spark SQL/Dataframe instead of HBase. In terms of architecture, my take is loosely-coupled query engines on top of KV store vs. an array of query engines supported by, and packaged as part of, a KV store. Functionality-wise the two could be close but Astro also supports Python as a result of tight integration with Spark. It will be interesting to see performance comparisons when HBase-14181 is ready. Thanks, *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Tuesday, August 11, 2015 3:28 PM *To:* Yan Zhou.sc *Cc:* Bing Xiao (Bing); dev@spark.apache.org; u...@spark.apache.org *Subject:* Re: 答复: Package Release Annoucement: Spark SQL on HBase Astro HBase will not have query engine. It will provide better support to query engines. Cheers On Aug 10, 2015, at 11:11 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Ted, I’m in China now, and seem to experience difficulty to access Apache Jira. Anyways, it appears to me that HBASE-14181 https://issues.apache.org/jira/browse/HBASE-14181 attempts to support Spark DataFrame inside HBase. If true, one question to me is whether HBase is intended to have a built-in query engine or not. Or it will stick with the current way as a k-v store with some built-in processing capabilities in the forms of coprocessor, custom filter, …, etc., which allows for loosely-coupled query engines built on top of it. Thanks, *发件人**:* Ted Yu [mailto:yuzhih...@gmail.com yuzhih...@gmail.com] *发送时间**:* 2015年8月11日 8:54 *收件人**:* Bing Xiao (Bing) *抄送**:* dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc *主题**:* Re: Package Release Annoucement: Spark SQL on HBase Astro Yan / Bing: Mind taking a look at HBASE-14181 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame DataSource to HBase-Spark Module' ? Thanks On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote: We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase The main features in this package, dubbed “Astro”, include: · Systematic and powerful handling of data pruning and intelligent scan, based on partial evaluation technique · HBase pushdown capabilities like custom filters and coprocessor to support ultra low latency processing · SQL, Data Frame support · More SQL capabilities made possible (Secondary index, bloom filter, Primary Key, Bulk load, Update) · Joins with data from other sources · Python/Java/Scala support · Support latest Spark 1.4.0 release The tests by Huawei team and community contributors covered the areas: bulk load; projection pruning; partition pruning; partial evaluation; code generation; coprocessor; customer filtering; DML; complex filtering on keys and non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. We will post the test results including
Re: Generalised Spark-HBase integration
Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Stuff that people are using is here. https://github.com/cloudera-labs/SparkOnHBase The stuff going into HBase is here https://issues.apache.org/jira/browse/HBASE-13992 If you want to add things to the hbase ticket lets do it in another jira. Like these jira https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 This first jira is mainly getting the Spark dependancies and separate module set up so we can start making additional jiras to add additional functionality. The goal is to have the following in HBase by end of summer: RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 6. Distributed Scan 7. BulkLoad If you think there should be more let me know Ted Malaska On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com wrote: Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: Generalised Spark-HBase integration
Sorry this is more correct RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 3. Foreach with connection 4. Map with connection 5. Distributed Scan 6. BulkLoad On Tue, Jul 28, 2015 at 12:23 PM, Ted Malaska ted.mala...@cloudera.com wrote: Stuff that people are using is here. https://github.com/cloudera-labs/SparkOnHBase The stuff going into HBase is here https://issues.apache.org/jira/browse/HBASE-13992 If you want to add things to the hbase ticket lets do it in another jira. Like these jira https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 This first jira is mainly getting the Spark dependancies and separate module set up so we can start making additional jiras to add additional functionality. The goal is to have the following in HBase by end of summer: RDD and DStream Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 6. Distributed Scan 7. BulkLoad If you think there should be more let me know Ted Malaska On Tue, Jul 28, 2015 at 12:17 PM, Michal Haris michal.ha...@visualdna.com wrote: Cool, will revisit, is your latest code visible publicly somewhere ? On 28 July 2015 at 17:14, Ted Malaska ted.mala...@cloudera.com wrote: Yup you should be able to do that with the APIs that are going into HBase. Let me know if you need to chat about the problem and how to implement it with the HBase apis. We have tried to cover any possible way to use HBase with Spark. Let us know if we missed anything if we did we will add it. On Tue, Jul 28, 2015 at 12:12 PM, Michal Haris michal.ha...@visualdna.com wrote: Hi Ted, yes, cloudera blog and your code was my starting point - but I needed something more spark-centric rather than on hbase. Basically doing a lot of ad-hoc transformations with RDDs that were based on HBase tables and then mutating them after series of iterative (bsp-like) steps. On 28 July 2015 at 17:06, Ted Malaska ted.mala...@cloudera.com wrote: Thanks Michal, Just to share what I'm working on in a related topic. So a long time ago I build SparkOnHBase and put it into Cloudera Labs in this link. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Also recently I am working on getting this into HBase core. It will hopefully be in HBase core with in the next couple of weeks. https://issues.apache.org/jira/browse/HBASE-13992 Then I'm planing on adding dataframe and bulk load support through https://issues.apache.org/jira/browse/HBASE-14149 https://issues.apache.org/jira/browse/HBASE-14150 Also if you are interested this is running today a at least a half a dozen companies with Spark Streaming. Here is one blog post of successful implementation http://blog.cloudera.com/blog/2015/03/how-edmunds-com-used-spark-streaming-to-build-a-near-real-time-dashboard/ Also here is an additional example blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Haris michal.ha...@visualdna.com wrote: Hi all, last couple of months I've been working on a large graph analytics and along the way have written from scratch a HBase-Spark integration as none of the ones out there worked either in terms of scale or in the way they integrated with the RDD interface. This week I have generalised it into an (almost) spark module, which works with the latest spark and the new hbase api, so... sharing! : https://github.com/michal-harish/spark-on-hbase -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR -- Michal Haris Technical Architect direct line: +44 (0) 207 749 0229 www.visualdna.com | t: +44 (0) 207 734 7033 31 Old Nichol Street London E2 7HR
Re: countByValue on dataframe with multiple columns
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com wrote: Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: Welcoming some new committers
Super congrats. Well earned. On Jun 20, 2015 12:48 PM, Andrew Or and...@databricks.com wrote: Welcome! 2015-06-20 7:30 GMT-07:00 Debasish Das debasish.da...@gmail.com: Congratulations to All. DB great work in bringing quasi newton methods to Spark ! On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen ches...@alpinenow.com wrote: Congratulations to All. DB and Sandy, great works ! On Wed, Jun 17, 2015 at 3:12 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey all, Over the past 1.5 months we added a number of new committers to the project, and I wanted to welcome them now that all of their respective forms, accounts, etc are in. Join me in welcoming the following new committers: - Davies Liu - DB Tsai - Kousuke Saruta - Sandy Ryza - Yin Huai Looking forward to more great contributions from all of these folks. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark-Submit issues
Hey this is Ted Are you using Shade when you build your jar and are you using the bigger jar? Looks like classes are not included in you jar. On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi Hari, Now Iam trying out the same FlumeEventCount example running with spark-submit Instead of run example. The steps I followed is that I have exported the JavaFlumeEventCount.java into jar. The command used is ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 The output is 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 If I use this command ./bin/spark-submit --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 Then I get an error Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/examples/streaming/StreamingExamples at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.streaming.StreamingExamples at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 8 more I Just wanted to ask is that it is able to find spark-assembly.jar but why not spark-example.jar. The next doubt is while running FlumeEventCount example through runexample I get an output as Received 4 flume events. 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming job 1415797214000 ms.0 from job set of time 1415797214000 ms 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence list But If I run the same program through Spark-Submit I get an output as 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 So I need a clarification, since in the program the printing statement is written as Received n flume events. So how come Iam able to see as Stream 0 received n blocks. And what is the difference of running the program through spark-submit and run-example. Awaiting for your kind reply Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: Spark-Submit issues
Other wish include them at the time of execution. here is an example. spark-submit --jars /opt/cloudera/parcels/CDH/lib/zookeeper/zookeeper-3.4.5-cdh5.1.0.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar --class org.apache.spark.hbase.example.HBaseBulkDeleteExample --master yarn --deploy-mode client --executor-memory 512M --num-executors 4 --driver-java-options -Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hbase/lib/* SparkHBase.jar t1 c On Wed, Nov 12, 2014 at 4:25 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Yep, you’d need to shade jars to ensure all your dependencies are in the classpath. Thanks, Hari On Wed, Nov 12, 2014 at 3:23 AM, Ted Malaska ted.mala...@cloudera.com wrote: Hey this is Ted Are you using Shade when you build your jar and are you using the bigger jar? Looks like classes are not included in you jar. On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi Hari, Now Iam trying out the same FlumeEventCount example running with spark-submit Instead of run example. The steps I followed is that I have exported the JavaFlumeEventCount.java into jar. The command used is ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 The output is 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 If I use this command ./bin/spark-submit --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 Then I get an error Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/examples/streaming/StreamingExamples at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.streaming.StreamingExamples at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 8 more I Just wanted to ask is that it is able to find spark-assembly.jar but why not spark-example.jar. The next doubt is while running FlumeEventCount example through runexample I get an output as Received 4 flume events. 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming job 1415797214000 ms.0 from job set of time 1415797214000 ms 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence list But If I run the same program through Spark-Submit I get an output as 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 So I need a clarification, since in the program the printing statement is written as Received n flume events. So how come Iam able to see as Stream 0 received n blocks. And what is the difference of running the program through spark-submit and run-example. Awaiting for your kind reply Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus
Re: Compile error when compiling for cloudera
Don't make this change yet. I have a 1642 that needs to get through around the same code. I can make this change after 1642 is through. On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen so...@cloudera.com wrote: CC tmalaska since he touched the line in question. This is a fun one. So, here's the line of code added last week: val channelFactory = new NioServerSocketChannelFactory (Executors.newCachedThreadPool(), Executors.newCachedThreadPool()); Scala parses this as two statements, one invoking a no-arg constructor and one making a tuple for fun. Put it on one line and it's fine. It works with newer Netty since there is a no-arg constructor. It fails with older Netty, which is what you get with older Hadoop. The fix is obvious. I'm away and if nobody beats me to a PR in the meantime, I'll propose one as an addendum to the recent JIRA. Sean * On Thu, Jul 17, 2014 at 3:58 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: My full build command is: ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly I've changed one line in RDD.scala, nothing else. On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen so...@cloudera.com wrote: This looks like a Jetty version problem actually. Are you bringing in something that might be changing the version of Jetty used by Spark? It depends a lot on how you are building things. Good to specify exactly how your'e building here. On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: I'm trying to compile the latest code, with the hadoop-version set for 2.0.0-mr1-cdh4.6.0. I'm getting the following error, which I don't get when I don't set the hadoop version: [error] /data/hdfs/1/home/nkronenfeld/git/spark-ndk/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeInputDStream.scala:156: overloaded method constructor NioServerSocketChannelFactory with alternatives: [error] (x$1: java.util.concurrent.Executor,x$2: java.util.concurrent.Executor,x$3: Int)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory and [error] (x$1: java.util.concurrent.Executor,x$2: java.util.concurrent.Executor)org.jboss.netty.channel.socket.nio.NioServerSocketChannelFactory [error] cannot be applied to () [error] val channelFactory = new NioServerSocketChannelFactory [error]^ [error] one error found I don't know flume from a hole in the wall - does anyone know what I can do to fix this? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com