On Mon, Oct 10, 2016 at 4:11 PM, Dan Burkert <[email protected]> wrote:
> Hi Ben, > > SparkSQL relies on Hive for DDL statements, so having support for this > requires adding support to Hive for manipulating Kudu tables. This is > something that we would like to do in the long term, but there are no > concrete plans (that I know of) to make it happen in the near term. > To be fair there's https://issues.apache.org/jira/browse/HIVE-12971 with a link to https://github.com/BimalTandel/HiveKudu-Handler which I think Bimal said he was going to update soon. But we're still far, I think, from any Kudu support in a released version of Hive. > > - Dan > > On Thu, Oct 6, 2016 at 4:38 PM, Benjamin Kim <[email protected]> wrote: > >> Anyone know if the Spark package will ever allow for creating tables in >> Spark SQL? >> >> Such as: >> CREATE EXTERNAL TABLE <table-name> >> USING org.apache.kudu.spark.kudu >> OPTIONS (Map("kudu.master" -> “<kudu-master>", "kudu.table" -> >> “table-name”)); >> >> In this way, plain SQL can be used to do DDL, DML statements whether in >> Spark SQL code or using JDBC to interface with Spark SQL Thriftserver. >> >> By the way, we are trying to create a DMP in Kudu with the a farm of >> RESTful Endpoints to do cookie sync, ad serving, segmentation data >> exchange. And, the Spark compute cluster and the Kudu cluster will reside >> on the same racks in the same datacenter. >> >> Thanks, >> Ben >> >> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell <[email protected]> >> wrote: >> >> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark >> >> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <[email protected]> wrote: >> >>> I see that the API has changed a bit so my old code doesn’t work >>> anymore. Can someone direct me to some code samples? >>> >>> Thanks, >>> Ben >>> >>> >>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <[email protected]> wrote: >>> >>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <[email protected]> wrote >>> : >>> >>>> Now that Kudu 1.0.0 is officially out and ready for production use, >>>> where do we find the spark connector jar for this release? >>>> >>>> >>> It's available in the official ASF maven repository: >>> https://repository.apache.org/#nexus-search;quick~kudu-spark >>> >>> <dependency> >>> <groupId>org.apache.kudu</groupId> >>> <artifactId>kudu-spark_2.10</artifactId> >>> <version>1.0.0</version> >>> </dependency> >>> >>> >>> -Todd >>> >>> >>> >>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <[email protected]> wrote: >>>> >>>> Hi Ben, >>>> >>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, >>>> I do not think we support that at this point. I haven't looked deeply into >>>> it, but we may hit issues specifying Kudu-specific options (partitioning, >>>> column encoding, etc.). Probably issues that can be worked through >>>> eventually, though. If you are interested in contributing to Kudu, this is >>>> an area that could obviously use improvement! Most or all of our Spark >>>> features have been completely community driven to date. >>>> >>>> >>>>> I am assuming that more Spark support along with semantic changes >>>>> below will be incorporated into Kudu 0.9.1. >>>>> >>>> >>>> As a rule we do not release new features in patch releases, but the >>>> good news is that we are releasing regularly, and our next scheduled >>>> release is for the August timeframe (see JD's roadmap >>>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E> >>>> email >>>> about what we are aiming to include). Also, Cloudera does publish snapshot >>>> versions of the Spark connector here >>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, >>>> so the jars are available if you don't mind using snapshots. >>>> >>>> >>>>> Anyone know of a better way to make unique primary keys other than >>>>> using UUID to make every row unique if there is no unique column (or >>>>> combination thereof) to use. >>>>> >>>> >>>> Not that I know of. In general it's pretty rare to have a dataset >>>> without a natural primary key (even if it's just all of the columns), but >>>> in those cases UUID is a good solution. >>>> >>>> >>>>> This is what I am using. I know auto incrementing is coming down the >>>>> line (don’t know when), but is there a way to simulate this in Kudu using >>>>> Spark out of curiosity? >>>>> >>>> >>>> To my knowledge there is no plan to have auto increment in Kudu. >>>> Distributed, consistent, auto incrementing counters is a difficult problem, >>>> and I don't think there are any known solutions that would be fast enough >>>> for Kudu (happy to be proven wrong, though!). >>>> >>>> - Dan >>>> >>>> >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <[email protected]> wrote: >>>>> >>>>> I'm not sure exactly what the semantics will be, but at least one of >>>>> them will be upsert. These modes come from spark, and they were really >>>>> designed for file-backed storage and not table storage. We may want to do >>>>> append = upsert, and overwrite = truncate + insert. I think that may >>>>> match >>>>> the normal spark semantics more closely. >>>>> >>>>> - Dan >>>>> >>>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <[email protected]> >>>>> wrote: >>>>> >>>>>> Dan, >>>>>> >>>>>> Thanks for the information. That would mean both “append” and >>>>>> “overwrite” modes would be combined or not needed in the future. >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>>> >>>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <[email protected]> wrote: >>>>>> >>>>>> Right now append uses an update Kudu operation, which requires the >>>>>> row already be present in the table. Overwrite maps to insert. Kudu very >>>>>> recently got upsert support baked in, but it hasn't yet been integrated >>>>>> into the Spark connector. So pretty soon these sharp edges will get a >>>>>> lot >>>>>> better, since upsert is the way to go for most spark workloads. >>>>>> >>>>>> - Dan >>>>>> >>>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I tried to use the “append” mode, and it worked. Over 3.8 million >>>>>>> rows in 64s. I would assume that now I can use the “overwrite” mode on >>>>>>> existing data. Now, I have to find answers to these questions. What >>>>>>> would >>>>>>> happen if I “append” to the data in the Kudu table if the data already >>>>>>> exists? What would happen if I “overwrite” existing data when the >>>>>>> DataFrame >>>>>>> has data in it that does not exist in the Kudu table? I need to evaluate >>>>>>> the best way to simulate the UPSERT behavior in HBase because this is >>>>>>> what >>>>>>> our use case is. >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Now, I’m getting this error when trying to write to the table. >>>>>>> >>>>>>> import scala.collection.JavaConverters._ >>>>>>> val key_seq = Seq(“my_id") >>>>>>> val key_list = List(“my_id”).asJava >>>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new >>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, >>>>>>> 100)) >>>>>>> >>>>>>> df.write >>>>>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> >>>>>>> tableName)) >>>>>>> .mode("overwrite") >>>>>>> .kudu >>>>>>> >>>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame >>>>>>> to Kudu; sample errors: Not found: key not found (error 0)Not found: key >>>>>>> not found (error 0)Not found: key not found (error 0)Not found: key not >>>>>>> found (error 0)Not found: key not found (error 0) >>>>>>> >>>>>>> Does the key field need to be first in the DataFrame? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Dan, >>>>>>>> >>>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a >>>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this? >>>>>>>> >>>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new >>>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) >>>>>>>> >>>>>>>> java.lang.IllegalArgumentException: Table partitioning must be >>>>>>>> specified using setRangePartitionColumns or addHashPartitions >>>>>>>> >>>>>>> >>>>>>> Yep. The `Seq("my_id")` part of that call is specifying the set of >>>>>>> primary key columns, so in this case you have specified the single PK >>>>>>> column "my_id". The `addHashPartitions` call adds hash partitioning to >>>>>>> the >>>>>>> table, in this case over the column "my_id" (which is good, it must be >>>>>>> over >>>>>>> one or more PK columns, so in this case "my_id" is the one and only >>>>>>> valid >>>>>>> combination). However, the call to `addHashPartition` also takes the >>>>>>> number of buckets as the second param. You shouldn't get the >>>>>>> IllegalArgumentException as long as you are specifying either >>>>>>> `addHashPartitions` or `setRangePartitionColumns`. >>>>>>> >>>>>>> - Dan >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <[email protected]> wrote: >>>>>>>> >>>>>>>> Looks like we're missing an import statement in that example. >>>>>>>> Could you try: >>>>>>>> >>>>>>>> import org.kududb.client._ >>>>>>>> >>>>>>>> and try again? >>>>>>>> >>>>>>>> - Dan >>>>>>>> >>>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I encountered an error trying to create a table based on the >>>>>>>>> documentation from a DataFrame. >>>>>>>>> >>>>>>>>> <console>:49: error: not found: type CreateTableOptions >>>>>>>>> kuduContext.createTable(tableName, df.schema, >>>>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1)) >>>>>>>>> >>>>>>>>> Is there something I’m missing? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>> It's only in Cloudera's maven repo: https://repository.cloud >>>>>>>>> era.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/ >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi J-D, >>>>>>>>>> >>>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark >>>>>>>>>> jar for spark-shell to use. Can you show me where to find it? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> What's in this doc is what's gonna get released: >>>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/ >>>>>>>>>> developing.adoc#kudu-integration-with-spark >>>>>>>>>> >>>>>>>>>> J-D >>>>>>>>>> >>>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Will this be documented with examples once 0.9.0 comes out? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ben >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> It will be in 0.9.0. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Chris, >>>>>>>>>>>> >>>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> There is some code in review that needs some more refinement. >>>>>>>>>>>> It will allow upsert/insert from a dataframe using the >>>>>>>>>>>> datasource api. It will also allow the creation and deletion of >>>>>>>>>>>> tables from >>>>>>>>>>>> a dataframe >>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>>>>>> >>>>>>>>>>>> Example usages will look something like: >>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>>>>>>>>> >>>>>>>>>>>> -Chris George >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Can someone tell me what the state is of this Spark work? >>>>>>>>>>>> >>>>>>>>>>>> Also, does anyone have any sample code on how to update/insert >>>>>>>>>>>> data in Kudu using DataFrames? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be >>>>>>>>>>>> able to implement similar functionality through the api. >>>>>>>>>>>> -Chris >>>>>>>>>>>> >>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an >>>>>>>>>>>> “upsert” if it were to be implemented. >>>>>>>>>>>> >>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>>>>>>>> WHEN MATCHED THEN >>>>>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>>>>>>>> WHEN NOT MATCHED THEN >>>>>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it >>>>>>>>>>>> into gerrit if you want to take a look. >>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>>>>>>>>> It does pushdown predicates which the existing input formatter >>>>>>>>>>>> based rdd does not. >>>>>>>>>>>> >>>>>>>>>>>> Within the next two weeks I’m planning to implement a >>>>>>>>>>>> datasource for spark that will have pushdown predicates and >>>>>>>>>>>> insertion/update functionality (need to look more at cassandra and >>>>>>>>>>>> the >>>>>>>>>>>> hbase datasource for best way to do this) I agree that server side >>>>>>>>>>>> upsert >>>>>>>>>>>> would be helpful. >>>>>>>>>>>> Having a datasource would give us useful data frames and also >>>>>>>>>>>> make spark sql usable for kudu. >>>>>>>>>>>> >>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala >>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high >>>>>>>>>>>> concurrency >>>>>>>>>>>> when compared to spark 2. We interact with datasources which do not >>>>>>>>>>>> integrate with impala. 3. We have custom sql query planners for >>>>>>>>>>>> extended >>>>>>>>>>>> sql functionality. >>>>>>>>>>>> >>>>>>>>>>>> -Chris George >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> You guys make a convincing point, although on the upsert side >>>>>>>>>>>> we'll need more support from the servers. Right now all you can do >>>>>>>>>>>> is an >>>>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could >>>>>>>>>>>> at least >>>>>>>>>>>> add an API on the client side that would manage it, but it >>>>>>>>>>>> wouldn't be >>>>>>>>>>>> atomic. >>>>>>>>>>>> >>>>>>>>>>>> J-D >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra < >>>>>>>>>>>> [email protected]>wrote: >>>>>>>>>>>> >>>>>>>>>>>>> It's pretty simple, actually. I need to support versioned >>>>>>>>>>>>> datasets in a Spark SQL environment. Instead of a hack on top of >>>>>>>>>>>>> a Parquet >>>>>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use >>>>>>>>>>>>> Kudu's write >>>>>>>>>>>>> and timestamp-based read operations to support not only appending >>>>>>>>>>>>> data, but >>>>>>>>>>>>> also updating existing data, and even some schema migration. The >>>>>>>>>>>>> most >>>>>>>>>>>>> typical use case is a dataset that is updated periodically (e.g., >>>>>>>>>>>>> weekly or >>>>>>>>>>>>> monthly) in which the the preliminary data in the previous window >>>>>>>>>>>>> (week or >>>>>>>>>>>>> month) is updated with values that are expected to remain >>>>>>>>>>>>> unchanged from >>>>>>>>>>>>> then on, and a new set of preliminary values for the current >>>>>>>>>>>>> window need to >>>>>>>>>>>>> be added/appended. >>>>>>>>>>>>> >>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality >>>>>>>>>>>>> on top of what Kudu has to offer isn't too much to ask, but the >>>>>>>>>>>>> ease of >>>>>>>>>>>>> integration with Spark SQL will gate how quickly we would move to >>>>>>>>>>>>> using >>>>>>>>>>>>> Kudu and how seriously we'd look at alternatives before making >>>>>>>>>>>>> that >>>>>>>>>>>>> decision. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans < >>>>>>>>>>>>> [email protected]>wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Mark, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it >>>>>>>>>>>>>> caught the attention of other folks! >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra< >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently >>>>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert >>>>>>>>>>>>>>> functionality >>>>>>>>>>>>>>> while trying to evaluate what to expect from Kudu. Whether >>>>>>>>>>>>>>> Kudu does a >>>>>>>>>>>>>>> good job supporting inserts with Spark SQL will be a key >>>>>>>>>>>>>>> consideration as >>>>>>>>>>>>>>> to whether we adopt Kudu. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary >>>>>>>>>>>>>> for you. Is it just that you currently do it that way into some >>>>>>>>>>>>>> database or >>>>>>>>>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? >>>>>>>>>>>>>> Would >>>>>>>>>>>>>> re-writing those SQL lines into Scala and directly use the Java >>>>>>>>>>>>>> API's >>>>>>>>>>>>>> KuduSession be too much work? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS >>>>>>>>>>>>>> your current solution? If it's not completely clear, I'd love to >>>>>>>>>>>>>> help you >>>>>>>>>>>>>> think through it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yup, starting to get a good idea. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What are your DS folks looking for in terms of >>>>>>>>>>>>>>>> functionality related to Spark? A SparkSQL integration that's >>>>>>>>>>>>>>>> as fully >>>>>>>>>>>>>>>> featured as Impala's? Do they care being able to insert into >>>>>>>>>>>>>>>> Kudu with >>>>>>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more >>>>>>>>>>>>>>>> specific to >>>>>>>>>>>>>>>> Spark that I'm missing? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. >>>>>>>>>>>>>>>> At Cloudera all our resources are committed to making things >>>>>>>>>>>>>>>> happen in >>>>>>>>>>>>>>>> time, and a more fully featured Spark integration isn't in our >>>>>>>>>>>>>>>> plans during >>>>>>>>>>>>>>>> that period. I'm really hoping someone in the community will >>>>>>>>>>>>>>>> help with >>>>>>>>>>>>>>>> Spark, the same way we got a big contribution for the Flume >>>>>>>>>>>>>>>> sink. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim < >>>>>>>>>>>>>>>> [email protected]>wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 >>>>>>>>>>>>>>>>> versions. But, since it’s not “production-ready”, upper >>>>>>>>>>>>>>>>> management doesn’t >>>>>>>>>>>>>>>>> want to fully deploy it yet. They just want to keep an eye on >>>>>>>>>>>>>>>>> it though. >>>>>>>>>>>>>>>>> Kudu was so much simpler and easier to use in every aspect >>>>>>>>>>>>>>>>> compared to >>>>>>>>>>>>>>>>> HBase. Impala was great for the report writers and analysts >>>>>>>>>>>>>>>>> to experiment >>>>>>>>>>>>>>>>> with for the short time it was up. But, once again, the only >>>>>>>>>>>>>>>>> blocker was >>>>>>>>>>>>>>>>> the lack of Spark support for our Data Developers/Scientists. >>>>>>>>>>>>>>>>> So, >>>>>>>>>>>>>>>>> production-level data population won’t happen until then. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an >>>>>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken >>>>>>>>>>>>>>>>>> care of and >>>>>>>>>>>>>>>>>> idempotency is maintained. Whether data was directly >>>>>>>>>>>>>>>>>> retrieved from >>>>>>>>>>>>>>>>>> Cassandra for analytics, reports, or searches, it was not >>>>>>>>>>>>>>>>>> clear as to what >>>>>>>>>>>>>>>>>> was its main use. Some also just used it for a staging area >>>>>>>>>>>>>>>>>> to populate >>>>>>>>>>>>>>>>>> downstream tables in parquet format. The last thing I heard >>>>>>>>>>>>>>>>>> was that CQL >>>>>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries >>>>>>>>>>>>>>>>>> against it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the >>>>>>>>>>>>>>>>> APIs. Even then, >>>>>>>>>>>>>>>>> Kudu should beat it easily on big scans. Same for HBase. >>>>>>>>>>>>>>>>> We've done >>>>>>>>>>>>>>>>> benchmarks against the latter, not the former. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> As for our company, we have been looking for an updatable >>>>>>>>>>>>>>>>>> data store for a long time that can be quickly queried >>>>>>>>>>>>>>>>>> directly either >>>>>>>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still >>>>>>>>>>>>>>>>>> handle TB or >>>>>>>>>>>>>>>>>> PB of data without performance degradation and many >>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>> headaches. For now, we are using HBase to take on this role >>>>>>>>>>>>>>>>>> with Phoenix as >>>>>>>>>>>>>>>>>> a fast way to directly query the data. I can see Kudu as the >>>>>>>>>>>>>>>>>> best way to >>>>>>>>>>>>>>>>>> fill this gap easily, especially being the closest thing to >>>>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>>> relational databases out there in familiarity for the many >>>>>>>>>>>>>>>>>> SQL analytics >>>>>>>>>>>>>>>>>> people in our company. The other alternative would be to go >>>>>>>>>>>>>>>>>> with AWS >>>>>>>>>>>>>>>>>> Redshift for the same reasons, but it would come at a cost, >>>>>>>>>>>>>>>>>> of course. If >>>>>>>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would >>>>>>>>>>>>>>>>>> get rid of the >>>>>>>>>>>>>>>>>> need to extract from HBase to parquet tables or export to >>>>>>>>>>>>>>>>>> PostgreSQL to >>>>>>>>>>>>>>>>>> support more of the SQL language using by analysts or the >>>>>>>>>>>>>>>>>> reporting >>>>>>>>>>>>>>>>>> software we use.. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far >>>>>>>>>>>>>>>>> off with Kudu. Have you folks tried Kudu with Impala yet with >>>>>>>>>>>>>>>>> those use >>>>>>>>>>>>>>>>> cases? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I hope this helps. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It does, thanks for nice reply. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we >>>>>>>>>>>>>>>>>> like to refer to "Impala + Kudu" as Kimpala, but yeah it's >>>>>>>>>>>>>>>>>> not as sexy. My >>>>>>>>>>>>>>>>>> colleagues who were also there did say that the hype around >>>>>>>>>>>>>>>>>> Spark isn't >>>>>>>>>>>>>>>>>> dying down. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that >>>>>>>>>>>>>>>>>> Cassandra, HBase, and Kudu cater to. I wouldn't go as far as >>>>>>>>>>>>>>>>>> saying that C* >>>>>>>>>>>>>>>>>> is just an interim solution for the use case you describe. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, >>>>>>>>>>>>>>>>>> it's a storage engine so things move slowly *smile*. I'd >>>>>>>>>>>>>>>>>> love to see more >>>>>>>>>>>>>>>>>> contributions on the Spark front. I know there's code out >>>>>>>>>>>>>>>>>> there that could >>>>>>>>>>>>>>>>>> be integrated in kudu-spark, it just needs to land in >>>>>>>>>>>>>>>>>> gerrit. I'm sure >>>>>>>>>>>>>>>>>> folks will happily review it. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love >>>>>>>>>>>>>>>>>> to learn more about the use cases for which you envision >>>>>>>>>>>>>>>>>> using Kudu as a C* >>>>>>>>>>>>>>>>>> replacement. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San >>>>>>>>>>>>>>>>>>> Jose. They told me that everything was about Spark and >>>>>>>>>>>>>>>>>>> there is a big buzz >>>>>>>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, >>>>>>>>>>>>>>>>>>> Kafka). I still think >>>>>>>>>>>>>>>>>>> that Cassandra is just an interim solution as a >>>>>>>>>>>>>>>>>>> low-latency, easily queried >>>>>>>>>>>>>>>>>>> data store. I was wondering if anything significant >>>>>>>>>>>>>>>>>>> happened in regards to >>>>>>>>>>>>>>>>>>> Kudu, especially on the Spark front. Plus, can you come up >>>>>>>>>>>>>>>>>>> with your own >>>>>>>>>>>>>>>>>>> proposed stack acronym to promote? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Ben, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any >>>>>>>>>>>>>>>>>>> timeline. I know of one person on the Kudu Slack who's >>>>>>>>>>>>>>>>>>> working on a better >>>>>>>>>>>>>>>>>>> RDD, but that's about it. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want >>>>>>>>>>>>>>>>>>>> to target a version of Kudu to begin real testing of Spark >>>>>>>>>>>>>>>>>>>> against it for >>>>>>>>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to >>>>>>>>>>>>>>>>>>>> anticipate. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Just curious, >>>>>>>>>>>>>>>>>>>> *Benjamin Kim* >>>>>>>>>>>>>>>>>>>> *Data Solutions Architect* >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital >>>>>>>>>>>>>>>>>>>> marketing. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>* >>>>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA >>>>>>>>>>>>>>>>>>>> 90405 | www.amobee.com >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if >>>>>>>>>>>>>>>>>>>> it's needed either. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, >>>>>>>>>>>>>>>>>>>> ideally we'd use scans directly. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort >>>>>>>>>>>>>>>>>>>> of pushdown. It's really basic. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The goal was to provide something for others to >>>>>>>>>>>>>>>>>>>> contribute to. We have some basic unit tests that others >>>>>>>>>>>>>>>>>>>> can easily extend. >>>>>>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be >>>>>>>>>>>>>>>>>>>> really happy to >>>>>>>>>>>>>>>>>>>> assist one improve the kudu-spark code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic >>>>>>>>>>>>>>>>>>>>> requirements (kudu RDD, kudu DStream) in KUDU-1214. Am I >>>>>>>>>>>>>>>>>>>>> right? Besides >>>>>>>>>>>>>>>>>>>>> shoring up more Spark SQL functionality (Dataframes) and >>>>>>>>>>>>>>>>>>>>> doing the >>>>>>>>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark >>>>>>>>>>>>>>>>>>>>> with Kudu and compare it to HBase with Spark (not clean). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to >>>>>>>>>>>>>>>>>>>>> get this in for 0.7.0: https://issues.cloudera >>>>>>>>>>>>>>>>>>>>> .org/browse/KUDU-1321 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use >>>>>>>>>>>>>>>>>>>>> SparkSQL on Kudu, but it will require a lot more work to >>>>>>>>>>>>>>>>>>>>> make it >>>>>>>>>>>>>>>>>>>>> fast/useful. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hope this helps, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted >>>>>>>>>>>>>>>>>>>>>> for 0.8.0, but I see no progress on it. When this is >>>>>>>>>>>>>>>>>>>>>> complete, will this >>>>>>>>>>>>>>>>>>>>>> mean that Spark will be able to work with Kudu both >>>>>>>>>>>>>>>>>>>>>> programmatically and as >>>>>>>>>>>>>>>>>>>>>> a client via Spark SQL? Or is there more work that needs >>>>>>>>>>>>>>>>>>>>>> to be done on the >>>>>>>>>>>>>>>>>>>>>> Spark side for it to work? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Just curious. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >>> >>> >> >
