Hi Benjamin, The spark connector jar can be found on the Apache maven repository.
Maven Coordinates: <dependency> <groupId>org.apache.kudu</groupId> <artifactId>kudu-spark_2.10</artifactId> <version>1.0.0</version> </dependency> <repository> <id>apache.releases</id> <name>Apache Release Repository</name> <url>https://repository.apache.org/releases</url> </repository> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > Now that Kudu 1.0.0 is officially out and ready for production use, where > do we find the spark connector jar for this release? > > Thanks, > Ben > > > On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com> wrote: > > Hi Ben, > > To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I > do not think we support that at this point. I haven't looked deeply into > it, but we may hit issues specifying Kudu-specific options (partitioning, > column encoding, etc.). Probably issues that can be worked through > eventually, though. If you are interested in contributing to Kudu, this is > an area that could obviously use improvement! Most or all of our Spark > features have been completely community driven to date. > > >> I am assuming that more Spark support along with semantic changes below >> will be incorporated into Kudu 0.9.1. >> > > As a rule we do not release new features in patch releases, but the good > news is that we are releasing regularly, and our next scheduled release is > for the August timeframe (see JD's roadmap > <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E> > email > about what we are aiming to include). Also, Cloudera does publish snapshot > versions of the Spark connector here > <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so > the jars are available if you don't mind using snapshots. > > >> Anyone know of a better way to make unique primary keys other than using >> UUID to make every row unique if there is no unique column (or combination >> thereof) to use. >> > > Not that I know of. In general it's pretty rare to have a dataset without > a natural primary key (even if it's just all of the columns), but in those > cases UUID is a good solution. > > >> This is what I am using. I know auto incrementing is coming down the line >> (don’t know when), but is there a way to simulate this in Kudu using Spark >> out of curiosity? >> > > To my knowledge there is no plan to have auto increment in Kudu. > Distributed, consistent, auto incrementing counters is a difficult problem, > and I don't think there are any known solutions that would be fast enough > for Kudu (happy to be proven wrong, though!). > > - Dan > > >> >> Thanks, >> Ben >> >> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote: >> >> I'm not sure exactly what the semantics will be, but at least one of them >> will be upsert. These modes come from spark, and they were really designed >> for file-backed storage and not table storage. We may want to do append = >> upsert, and overwrite = truncate + insert. I think that may match the >> normal spark semantics more closely. >> >> - Dan >> >> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> Dan, >>> >>> Thanks for the information. That would mean both “append” and >>> “overwrite” modes would be combined or not needed in the future. >>> >>> Cheers, >>> Ben >>> >>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com> wrote: >>> >>> Right now append uses an update Kudu operation, which requires the row >>> already be present in the table. Overwrite maps to insert. Kudu very >>> recently got upsert support baked in, but it hasn't yet been integrated >>> into the Spark connector. So pretty soon these sharp edges will get a lot >>> better, since upsert is the way to go for most spark workloads. >>> >>> - Dan >>> >>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com> >>> wrote: >>> >>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows >>>> in 64s. I would assume that now I can use the “overwrite” mode on existing >>>> data. Now, I have to find answers to these questions. What would happen if >>>> I “append” to the data in the Kudu table if the data already exists? What >>>> would happen if I “overwrite” existing data when the DataFrame has data in >>>> it that does not exist in the Kudu table? I need to evaluate the best way >>>> to simulate the UPSERT behavior in HBase because this is what our use case >>>> is. >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>> >>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> Now, I’m getting this error when trying to write to the table. >>>> >>>> import scala.collection.JavaConverters._ >>>> val key_seq = Seq(“my_id") >>>> val key_list = List(“my_id”).asJava >>>> kuduContext.createTable(tableName, df.schema, key_seq, new >>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, >>>> 100)) >>>> >>>> df.write >>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName)) >>>> .mode("overwrite") >>>> .kudu >>>> >>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to >>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not >>>> found (error 0)Not found: key not found (error 0)Not found: key not found >>>> (error 0)Not found: key not found (error 0) >>>> >>>> Does the key field need to be first in the DataFrame? >>>> >>>> Thanks, >>>> Ben >>>> >>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote: >>>> >>>> >>>> >>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com> >>>> wrote: >>>> >>>>> Dan, >>>>> >>>>> Thanks! It got further. Now, how do I set the Primary Key to be a >>>>> column(s) in the DataFrame and set the partitioning? Is it like this? >>>>> >>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new >>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id")) >>>>> >>>>> java.lang.IllegalArgumentException: Table partitioning must be >>>>> specified using setRangePartitionColumns or addHashPartitions >>>>> >>>> >>>> Yep. The `Seq("my_id")` part of that call is specifying the set of >>>> primary key columns, so in this case you have specified the single PK >>>> column "my_id". The `addHashPartitions` call adds hash partitioning to the >>>> table, in this case over the column "my_id" (which is good, it must be over >>>> one or more PK columns, so in this case "my_id" is the one and only valid >>>> combination). However, the call to `addHashPartition` also takes the >>>> number of buckets as the second param. You shouldn't get the >>>> IllegalArgumentException as long as you are specifying either >>>> `addHashPartitions` or `setRangePartitionColumns`. >>>> >>>> - Dan >>>> >>>> >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>> >>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote: >>>>> >>>>> Looks like we're missing an import statement in that example. Could >>>>> you try: >>>>> >>>>> import org.kududb.client._ >>>>> >>>>> and try again? >>>>> >>>>> - Dan >>>>> >>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com> >>>>> wrote: >>>>> >>>>>> I encountered an error trying to create a table based on the >>>>>> documentation from a DataFrame. >>>>>> >>>>>> <console>:49: error: not found: type CreateTableOptions >>>>>> kuduContext.createTable(tableName, df.schema, >>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1)) >>>>>> >>>>>> Is there something I’m missing? >>>>>> >>>>>> Thanks, >>>>>> Ben >>>>>> >>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org> >>>>>> wrote: >>>>>> >>>>>> It's only in Cloudera's maven repo: https://repository. >>>>>> cloudera.com/cloudera/cloudera-repos/org/kududb/ >>>>>> kudu-spark_2.10/0.9.0/ >>>>>> >>>>>> J-D >>>>>> >>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi J-D, >>>>>>> >>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar >>>>>>> for spark-shell to use. Can you show me where to find it? >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>> What's in this doc is what's gonna get released: https://github.com/ >>>>>>> cloudera/kudu/blob/master/docs/developing.adoc#kudu- >>>>>>> integration-with-spark >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Will this be documented with examples once 0.9.0 comes out? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans < >>>>>>>> jdcry...@apache.org> wrote: >>>>>>>> >>>>>>>> It will be in 0.9.0. >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Chris, >>>>>>>>> >>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George < >>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>> >>>>>>>>> There is some code in review that needs some more refinement. >>>>>>>>> It will allow upsert/insert from a dataframe using the datasource >>>>>>>>> api. It will also allow the creation and deletion of tables from a >>>>>>>>> dataframe >>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>>> >>>>>>>>> Example usages will look something like: >>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>>>>>>>> >>>>>>>>> -Chris George >>>>>>>>> >>>>>>>>> >>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Can someone tell me what the state is of this Spark work? >>>>>>>>> >>>>>>>>> Also, does anyone have any sample code on how to update/insert >>>>>>>>> data in Kudu using DataFrames? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George < >>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>> >>>>>>>>> SparkSQL cannot support these type of statements but we may be >>>>>>>>> able to implement similar functionality through the api. >>>>>>>>> -Chris >>>>>>>>> >>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an >>>>>>>>> “upsert” if it were to be implemented. >>>>>>>>> >>>>>>>>> MERGE INTO table_name USING table_reference ON (condition) >>>>>>>>> WHEN MATCHED THEN >>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>>>>>>> WHEN NOT MATCHED THEN >>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George < >>>>>>>>> christopher.geo...@rms.com> wrote: >>>>>>>>> >>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it >>>>>>>>> into gerrit if you want to take a look. http://gerrit.cloudera. >>>>>>>>> org:8080/#/c/2754/ >>>>>>>>> It does pushdown predicates which the existing input formatter >>>>>>>>> based rdd does not. >>>>>>>>> >>>>>>>>> Within the next two weeks I’m planning to implement a datasource >>>>>>>>> for spark that will have pushdown predicates and insertion/update >>>>>>>>> functionality (need to look more at cassandra and the hbase >>>>>>>>> datasource for >>>>>>>>> best way to do this) I agree that server side upsert would be helpful. >>>>>>>>> Having a datasource would give us useful data frames and also make >>>>>>>>> spark sql usable for kudu. >>>>>>>>> >>>>>>>>> My reasoning for having a spark datasource and not using Impala >>>>>>>>> is: 1. We have had trouble getting impala to run fast with high >>>>>>>>> concurrency >>>>>>>>> when compared to spark 2. We interact with datasources which do not >>>>>>>>> integrate with impala. 3. We have custom sql query planners for >>>>>>>>> extended >>>>>>>>> sql functionality. >>>>>>>>> >>>>>>>>> -Chris George >>>>>>>>> >>>>>>>>> >>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> You guys make a convincing point, although on the upsert side >>>>>>>>> we'll need more support from the servers. Right now all you can do is >>>>>>>>> an >>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could at >>>>>>>>> least >>>>>>>>> add an API on the client side that would manage it, but it wouldn't be >>>>>>>>> atomic. >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra < >>>>>>>>> m...@clearstorydata.com> wrote: >>>>>>>>> >>>>>>>>>> It's pretty simple, actually. I need to support versioned >>>>>>>>>> datasets in a Spark SQL environment. Instead of a hack on top of a >>>>>>>>>> Parquet >>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use >>>>>>>>>> Kudu's write >>>>>>>>>> and timestamp-based read operations to support not only appending >>>>>>>>>> data, but >>>>>>>>>> also updating existing data, and even some schema migration. The >>>>>>>>>> most >>>>>>>>>> typical use case is a dataset that is updated periodically (e.g., >>>>>>>>>> weekly or >>>>>>>>>> monthly) in which the the preliminary data in the previous window >>>>>>>>>> (week or >>>>>>>>>> month) is updated with values that are expected to remain unchanged >>>>>>>>>> from >>>>>>>>>> then on, and a new set of preliminary values for the current window >>>>>>>>>> need to >>>>>>>>>> be added/appended. >>>>>>>>>> >>>>>>>>>> Using Kudu's Java API and developing additional functionality on >>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease of >>>>>>>>>> integration with Spark SQL will gate how quickly we would move to >>>>>>>>>> using >>>>>>>>>> Kudu and how seriously we'd look at alternatives before making that >>>>>>>>>> decision. >>>>>>>>>> >>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans < >>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Mark, >>>>>>>>>>> >>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it >>>>>>>>>>> caught the attention of other folks! >>>>>>>>>>> >>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra < >>>>>>>>>>> m...@clearstorydata.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently >>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert >>>>>>>>>>>> functionality >>>>>>>>>>>> while trying to evaluate what to expect from Kudu. Whether Kudu >>>>>>>>>>>> does a >>>>>>>>>>>> good job supporting inserts with Spark SQL will be a key >>>>>>>>>>>> consideration as >>>>>>>>>>>> to whether we adopt Kudu. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary >>>>>>>>>>> for you. Is it just that you currently do it that way into some >>>>>>>>>>> database or >>>>>>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? Would >>>>>>>>>>> re-writing those SQL lines into Scala and directly use the Java >>>>>>>>>>> API's >>>>>>>>>>> KuduSession be too much work? >>>>>>>>>>> >>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your >>>>>>>>>>> current solution? If it's not completely clear, I'd love to help >>>>>>>>>>> you think >>>>>>>>>>> through it. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans < >>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yup, starting to get a good idea. >>>>>>>>>>>>> >>>>>>>>>>>>> What are your DS folks looking for in terms of functionality >>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured >>>>>>>>>>>>> as >>>>>>>>>>>>> Impala's? Do they care being able to insert into Kudu with >>>>>>>>>>>>> SparkSQL or just >>>>>>>>>>>>> being able to query real fast? Anything more specific to Spark >>>>>>>>>>>>> that I'm >>>>>>>>>>>>> missing? >>>>>>>>>>>>> >>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At >>>>>>>>>>>>> Cloudera all our resources are committed to making things happen >>>>>>>>>>>>> in time, >>>>>>>>>>>>> and a more fully featured Spark integration isn't in our plans >>>>>>>>>>>>> during that >>>>>>>>>>>>> period. I'm really hoping someone in the community will help with >>>>>>>>>>>>> Spark, >>>>>>>>>>>>> the same way we got a big contribution for the Flume sink. >>>>>>>>>>>>> >>>>>>>>>>>>> J-D >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim < >>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. >>>>>>>>>>>>>> But, since it’s not “production-ready”, upper management doesn’t >>>>>>>>>>>>>> want to >>>>>>>>>>>>>> fully deploy it yet. They just want to keep an eye on it though. >>>>>>>>>>>>>> Kudu was >>>>>>>>>>>>>> so much simpler and easier to use in every aspect compared to >>>>>>>>>>>>>> HBase. Impala >>>>>>>>>>>>>> was great for the report writers and analysts to experiment with >>>>>>>>>>>>>> for the >>>>>>>>>>>>>> short time it was up. But, once again, the only blocker was the >>>>>>>>>>>>>> lack of >>>>>>>>>>>>>> Spark support for our Data Developers/Scientists. So, >>>>>>>>>>>>>> production-level data >>>>>>>>>>>>>> population won’t happen until then. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> Ben >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans < >>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim < >>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an >>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken >>>>>>>>>>>>>>> care of and >>>>>>>>>>>>>>> idempotency is maintained. Whether data was directly retrieved >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> Cassandra for analytics, reports, or searches, it was not clear >>>>>>>>>>>>>>> as to what >>>>>>>>>>>>>>> was its main use. Some also just used it for a staging area to >>>>>>>>>>>>>>> populate >>>>>>>>>>>>>>> downstream tables in parquet format. The last thing I heard was >>>>>>>>>>>>>>> that CQL >>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries >>>>>>>>>>>>>>> against it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. >>>>>>>>>>>>>> Even then, >>>>>>>>>>>>>> Kudu should beat it easily on big scans. Same for HBase. We've >>>>>>>>>>>>>> done >>>>>>>>>>>>>> benchmarks against the latter, not the former. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As for our company, we have been looking for an updatable >>>>>>>>>>>>>>> data store for a long time that can be quickly queried directly >>>>>>>>>>>>>>> either >>>>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still >>>>>>>>>>>>>>> handle TB or >>>>>>>>>>>>>>> PB of data without performance degradation and many >>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>> headaches. For now, we are using HBase to take on this role >>>>>>>>>>>>>>> with Phoenix as >>>>>>>>>>>>>>> a fast way to directly query the data. I can see Kudu as the >>>>>>>>>>>>>>> best way to >>>>>>>>>>>>>>> fill this gap easily, especially being the closest thing to >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>> relational databases out there in familiarity for the many SQL >>>>>>>>>>>>>>> analytics >>>>>>>>>>>>>>> people in our company. The other alternative would be to go >>>>>>>>>>>>>>> with AWS >>>>>>>>>>>>>>> Redshift for the same reasons, but it would come at a cost, of >>>>>>>>>>>>>>> course. If >>>>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would get >>>>>>>>>>>>>>> rid of the >>>>>>>>>>>>>>> need to extract from HBase to parquet tables or export to >>>>>>>>>>>>>>> PostgreSQL to >>>>>>>>>>>>>>> support more of the SQL language using by analysts or the >>>>>>>>>>>>>>> reporting >>>>>>>>>>>>>>> software we use.. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off >>>>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala yet with those >>>>>>>>>>>>>> use cases? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I hope this helps. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> It does, thanks for nice reply. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we >>>>>>>>>>>>>>> like to refer to "Impala + Kudu" as Kimpala, but yeah it's not >>>>>>>>>>>>>>> as sexy. My >>>>>>>>>>>>>>> colleagues who were also there did say that the hype around >>>>>>>>>>>>>>> Spark isn't >>>>>>>>>>>>>>> dying down. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There's definitely an overlap in the use cases that >>>>>>>>>>>>>>> Cassandra, HBase, and Kudu cater to. I wouldn't go as far as >>>>>>>>>>>>>>> saying that C* >>>>>>>>>>>>>>> is just an interim solution for the use case you describe. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, >>>>>>>>>>>>>>> it's a storage engine so things move slowly *smile*. I'd love >>>>>>>>>>>>>>> to see more >>>>>>>>>>>>>>> contributions on the Spark front. I know there's code out there >>>>>>>>>>>>>>> that could >>>>>>>>>>>>>>> be integrated in kudu-spark, it just needs to land in gerrit. >>>>>>>>>>>>>>> I'm sure >>>>>>>>>>>>>>> folks will happily review it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to >>>>>>>>>>>>>>> learn more about the use cases for which you envision using >>>>>>>>>>>>>>> Kudu as a C* >>>>>>>>>>>>>>> replacement. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim < >>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. >>>>>>>>>>>>>>>> They told me that everything was about Spark and there is a >>>>>>>>>>>>>>>> big buzz about >>>>>>>>>>>>>>>> the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I >>>>>>>>>>>>>>>> still think that >>>>>>>>>>>>>>>> Cassandra is just an interim solution as a low-latency, easily >>>>>>>>>>>>>>>> queried data >>>>>>>>>>>>>>>> store. I was wondering if anything significant happened in >>>>>>>>>>>>>>>> regards to Kudu, >>>>>>>>>>>>>>>> especially on the Spark front. Plus, can you come up with your >>>>>>>>>>>>>>>> own proposed >>>>>>>>>>>>>>>> stack acronym to promote? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Ben, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any >>>>>>>>>>>>>>>> timeline. I know of one person on the Kudu Slack who's working >>>>>>>>>>>>>>>> on a better >>>>>>>>>>>>>>>> RDD, but that's about it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim < >>>>>>>>>>>>>>>> b...@amobee.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi J-D, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to >>>>>>>>>>>>>>>>> target a version of Kudu to begin real testing of Spark >>>>>>>>>>>>>>>>> against it for our >>>>>>>>>>>>>>>>> devs. At least, I can tell them what timeframe to anticipate. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just curious, >>>>>>>>>>>>>>>>> *Benjamin Kim* >>>>>>>>>>>>>>>>> *Data Solutions Architect* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital marketing. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>* >>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA >>>>>>>>>>>>>>>>> 90405 | www.amobee.com >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's >>>>>>>>>>>>>>>>> needed either. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, >>>>>>>>>>>>>>>>> ideally we'd use scans directly. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of >>>>>>>>>>>>>>>>> pushdown. It's really basic. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The goal was to provide something for others to contribute >>>>>>>>>>>>>>>>> to. We have some basic unit tests that others can easily >>>>>>>>>>>>>>>>> extend. None of us >>>>>>>>>>>>>>>>> on the team are Spark experts, but we'd be really happy to >>>>>>>>>>>>>>>>> assist one >>>>>>>>>>>>>>>>> improve the kudu-spark code. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim < >>>>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> J-D, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements >>>>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214. Am I right? Besides >>>>>>>>>>>>>>>>>> shoring up more >>>>>>>>>>>>>>>>>> Spark SQL functionality (Dataframes) and doing the >>>>>>>>>>>>>>>>>> documentation, what more >>>>>>>>>>>>>>>>>> needs to be done? Optimizations? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark >>>>>>>>>>>>>>>>>> with Kudu and compare it to HBase with Spark (not clean). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans < >>>>>>>>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get >>>>>>>>>>>>>>>>>> this in for 0.7.0: https://issues. >>>>>>>>>>>>>>>>>> cloudera.org/browse/KUDU-1321 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use >>>>>>>>>>>>>>>>>> SparkSQL on Kudu, but it will require a lot more work to >>>>>>>>>>>>>>>>>> make it >>>>>>>>>>>>>>>>>> fast/useful. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hope this helps, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> J-D >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim < >>>>>>>>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted >>>>>>>>>>>>>>>>>>> for 0.8.0, but I see no progress on it. When this is >>>>>>>>>>>>>>>>>>> complete, will this >>>>>>>>>>>>>>>>>>> mean that Spark will be able to work with Kudu both >>>>>>>>>>>>>>>>>>> programmatically and as >>>>>>>>>>>>>>>>>>> a client via Spark SQL? Or is there more work that needs to >>>>>>>>>>>>>>>>>>> be done on the >>>>>>>>>>>>>>>>>>> Spark side for it to work? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Just curious. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>>>>> Ben >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >> > >