It's only in Cloudera's maven repo: https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
J-D On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > Hi J-D, > > I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for > spark-shell to use. Can you show me where to find it? > > Thanks, > Ben > > > On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org> > wrote: > > What's in this doc is what's gonna get released: > https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark > > J-D > > On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> Will this be documented with examples once 0.9.0 comes out? >> >> Thanks, >> Ben >> >> >> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org> >> wrote: >> >> It will be in 0.9.0. >> >> J-D >> >> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> Hi Chris, >>> >>> Will all this effort be rolled into 0.9.0 and be ready for use? >>> >>> Thanks, >>> Ben >>> >>> >>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com> >>> wrote: >>> >>> There is some code in review that needs some more refinement. >>> It will allow upsert/insert from a dataframe using the datasource api. >>> It will also allow the creation and deletion of tables from a dataframe >>> http://gerrit.cloudera.org:8080/#/c/2992/ >>> >>> Example usages will look something like: >>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>> >>> -Chris George >>> >>> >>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>> >>> Can someone tell me what the state is of this Spark work? >>> >>> Also, does anyone have any sample code on how to update/insert data in >>> Kudu using DataFrames? >>> >>> Thanks, >>> Ben >>> >>> >>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com> >>> wrote: >>> >>> SparkSQL cannot support these type of statements but we may be able to >>> implement similar functionality through the api. >>> -Chris >>> >>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote: >>> >>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if >>> it were to be implemented. >>> >>> MERGE INTO table_name USING table_reference ON (condition) >>> WHEN MATCHED THEN >>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>> WHEN NOT MATCHED THEN >>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>> >>> Cheers, >>> Ben >>> >>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com> >>> wrote: >>> >>> I have a wip kuduRDD that I made a few months ago. I pushed it into >>> gerrit if you want to take a look. >>> http://gerrit.cloudera.org:8080/#/c/2754/ >>> It does pushdown predicates which the existing input formatter based rdd >>> does not. >>> >>> Within the next two weeks I’m planning to implement a datasource for >>> spark that will have pushdown predicates and insertion/update functionality >>> (need to look more at cassandra and the hbase datasource for best way to do >>> this) I agree that server side upsert would be helpful. >>> Having a datasource would give us useful data frames and also make spark >>> sql usable for kudu. >>> >>> My reasoning for having a spark datasource and not using Impala is: 1. >>> We have had trouble getting impala to run fast with high concurrency when >>> compared to spark 2. We interact with datasources which do not integrate >>> with impala. 3. We have custom sql query planners for extended sql >>> functionality. >>> >>> -Chris George >>> >>> >>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org> wrote: >>> >>> You guys make a convincing point, although on the upsert side we'll need >>> more support from the servers. Right now all you can do is an INSERT then, >>> if you get a dup key, do an UPDATE. I guess we could at least add an API on >>> the client side that would manage it, but it wouldn't be atomic. >>> >>> J-D >>> >>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> It's pretty simple, actually. I need to support versioned datasets in >>>> a Spark SQL environment. Instead of a hack on top of a Parquet data store, >>>> I'm hoping (among other reasons) to be able to use Kudu's write and >>>> timestamp-based read operations to support not only appending data, but >>>> also updating existing data, and even some schema migration. The most >>>> typical use case is a dataset that is updated periodically (e.g., weekly or >>>> monthly) in which the the preliminary data in the previous window (week or >>>> month) is updated with values that are expected to remain unchanged from >>>> then on, and a new set of preliminary values for the current window need to >>>> be added/appended. >>>> >>>> Using Kudu's Java API and developing additional functionality on top of >>>> what Kudu has to offer isn't too much to ask, but the ease of integration >>>> with Spark SQL will gate how quickly we would move to using Kudu and how >>>> seriously we'd look at alternatives before making that decision. >>>> >>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans < >>>> jdcry...@apache.org> wrote: >>>> >>>>> Mark, >>>>> >>>>> Thanks for taking some time to reply in this thread, glad it caught >>>>> the attention of other folks! >>>>> >>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra < >>>>> m...@clearstorydata.com> wrote: >>>>> >>>>>> Do they care being able to insert into Kudu with SparkSQL >>>>>> >>>>>> >>>>>> I care about insert into Kudu with Spark SQL. I'm currently delaying >>>>>> a refactoring of some Spark SQL-oriented insert functionality while >>>>>> trying >>>>>> to evaluate what to expect from Kudu. Whether Kudu does a good job >>>>>> supporting inserts with Spark SQL will be a key consideration as to >>>>>> whether >>>>>> we adopt Kudu. >>>>>> >>>>> >>>>> I'd like to know more about why SparkSQL inserts in necessary for you. >>>>> Is it just that you currently do it that way into some database or parquet >>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing >>>>> those SQL lines into Scala and directly use the Java API's KuduSession be >>>>> too much work? >>>>> >>>>> Additionally, what do you expect to gain from using Kudu VS your >>>>> current solution? If it's not completely clear, I'd love to help you think >>>>> through it. >>>>> >>>>> >>>>>> >>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans < >>>>>> jdcry...@apache.org> wrote: >>>>>> >>>>>>> Yup, starting to get a good idea. >>>>>>> >>>>>>> What are your DS folks looking for in terms of functionality related >>>>>>> to Spark? A SparkSQL integration that's as fully featured as Impala's? >>>>>>> Do >>>>>>> they care being able to insert into Kudu with SparkSQL or just being >>>>>>> able >>>>>>> to query real fast? Anything more specific to Spark that I'm missing? >>>>>>> >>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At >>>>>>> Cloudera all our resources are committed to making things happen in >>>>>>> time, >>>>>>> and a more fully featured Spark integration isn't in our plans during >>>>>>> that >>>>>>> period. I'm really hoping someone in the community will help with Spark, >>>>>>> the same way we got a big contribution for the Flume sink. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, >>>>>>>> since it’s not “production-ready”, upper management doesn’t want to >>>>>>>> fully >>>>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so >>>>>>>> much >>>>>>>> simpler and easier to use in every aspect compared to HBase. Impala was >>>>>>>> great for the report writers and analysts to experiment with for the >>>>>>>> short >>>>>>>> time it was up. But, once again, the only blocker was the lack of Spark >>>>>>>> support for our Data Developers/Scientists. So, production-level data >>>>>>>> population won’t happen until then. >>>>>>>> >>>>>>>> I hope this helps you get an idea where I am coming from… >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans < >>>>>>>> jdcry...@apache.org> wrote: >>>>>>>> >>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> J-D, >>>>>>>>> >>>>>>>>> The main thing I hear that Cassandra is being used as an updatable >>>>>>>>> hot data store to ensure that duplicates are taken care of and >>>>>>>>> idempotency >>>>>>>>> is maintained. Whether data was directly retrieved from Cassandra for >>>>>>>>> analytics, reports, or searches, it was not clear as to what was its >>>>>>>>> main >>>>>>>>> use. Some also just used it for a staging area to populate downstream >>>>>>>>> tables in parquet format. The last thing I heard was that CQL was >>>>>>>>> terrible, >>>>>>>>> so that rules out much use of direct queries against it. >>>>>>>>> >>>>>>>> >>>>>>>> I'm no C* expert, but I don't think CQL is meant for real >>>>>>>> analytics, just ease of use instead of plainly using the APIs. Even >>>>>>>> then, >>>>>>>> Kudu should beat it easily on big scans. Same for HBase. We've done >>>>>>>> benchmarks against the latter, not the former. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> As for our company, we have been looking for an updatable data >>>>>>>>> store for a long time that can be quickly queried directly either >>>>>>>>> using >>>>>>>>> Spark SQL or Impala or some other SQL engine and still handle TB or >>>>>>>>> PB of >>>>>>>>> data without performance degradation and many configuration >>>>>>>>> headaches. For >>>>>>>>> now, we are using HBase to take on this role with Phoenix as a fast >>>>>>>>> way to >>>>>>>>> directly query the data. I can see Kudu as the best way to fill this >>>>>>>>> gap >>>>>>>>> easily, especially being the closest thing to other relational >>>>>>>>> databases >>>>>>>>> out there in familiarity for the many SQL analytics people in our >>>>>>>>> company. >>>>>>>>> The other alternative would be to go with AWS Redshift for the same >>>>>>>>> reasons, but it would come at a cost, of course. If we went with >>>>>>>>> either >>>>>>>>> solutions, Kudu or Redshift, it would get rid of the need to extract >>>>>>>>> from >>>>>>>>> HBase to parquet tables or export to PostgreSQL to support more of >>>>>>>>> the SQL >>>>>>>>> language using by analysts or the reporting software we use.. >>>>>>>>> >>>>>>>> >>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with >>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases? >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> I hope this helps. >>>>>>>>> >>>>>>>> >>>>>>>> It does, thanks for nice reply. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans < >>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>> >>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to >>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My >>>>>>>>> colleagues who were also there did say that the hype around Spark >>>>>>>>> isn't >>>>>>>>> dying down. >>>>>>>>> >>>>>>>>> There's definitely an overlap in the use cases that Cassandra, >>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* is >>>>>>>>> just an >>>>>>>>> interim solution for the use case you describe. >>>>>>>>> >>>>>>>>> Nothing significant happened in Kudu over the past month, it's a >>>>>>>>> storage engine so things move slowly *smile*. I'd love to see more >>>>>>>>> contributions on the Spark front. I know there's code out there that >>>>>>>>> could >>>>>>>>> be integrated in kudu-spark, it just needs to land in gerrit. I'm sure >>>>>>>>> folks will happily review it. >>>>>>>>> >>>>>>>>> Do you have relevant experiences you can share? I'd love to learn >>>>>>>>> more about the use cases for which you envision using Kudu as a C* >>>>>>>>> replacement. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi J-D, >>>>>>>>>> >>>>>>>>>> My colleagues recently came back from Strata in San Jose. They >>>>>>>>>> told me that everything was about Spark and there is a big buzz >>>>>>>>>> about the >>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think >>>>>>>>>> that >>>>>>>>>> Cassandra is just an interim solution as a low-latency, easily >>>>>>>>>> queried data >>>>>>>>>> store. I was wondering if anything significant happened in regards >>>>>>>>>> to Kudu, >>>>>>>>>> especially on the Spark front. Plus, can you come up with your own >>>>>>>>>> proposed >>>>>>>>>> stack acronym to promote? >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans < >>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>> Hi Ben, >>>>>>>>>> >>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I >>>>>>>>>> know of one person on the Kudu Slack who's working on a better RDD, >>>>>>>>>> but >>>>>>>>>> that's about it. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> J-D >>>>>>>>>> >>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi J-D, >>>>>>>>>>> >>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target >>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for our >>>>>>>>>>> devs. >>>>>>>>>>> At least, I can tell them what timeframe to anticipate. >>>>>>>>>>> >>>>>>>>>>> Just curious, >>>>>>>>>>> *Benjamin Kim* >>>>>>>>>>> *Data Solutions Architect* >>>>>>>>>>> >>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital marketing. >>>>>>>>>>> >>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>* >>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>>>>>>>> www.amobee.com >>>>>>>>>>> >>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans < >>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's >>>>>>>>>>> needed either. >>>>>>>>>>> >>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd >>>>>>>>>>> use scans directly. >>>>>>>>>>> >>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of >>>>>>>>>>> pushdown. It's really basic. >>>>>>>>>>> >>>>>>>>>>> The goal was to provide something for others to contribute to. >>>>>>>>>>> We have some basic unit tests that others can easily extend. None >>>>>>>>>>> of us on >>>>>>>>>>> the team are Spark experts, but we'd be really happy to assist one >>>>>>>>>>> improve >>>>>>>>>>> the kudu-spark code. >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim < >>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> J-D, >>>>>>>>>>>> >>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu >>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up >>>>>>>>>>>> more Spark >>>>>>>>>>>> SQL functionality (Dataframes) and doing the documentation, what >>>>>>>>>>>> more needs >>>>>>>>>>>> to be done? Optimizations? >>>>>>>>>>>> >>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu >>>>>>>>>>>> and compare it to HBase with Spark (not clean). >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Ben >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans < >>>>>>>>>>>> jdcry...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in >>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>>>> >>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on >>>>>>>>>>>> Kudu, but it will require a lot more work to make it fast/useful. >>>>>>>>>>>> >>>>>>>>>>>> Hope this helps, >>>>>>>>>>>> >>>>>>>>>>>> J-D >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim < >>>>>>>>>>>> bbuil...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I see this KUDU-1214 >>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for >>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, will >>>>>>>>>>>>> this mean >>>>>>>>>>>>> that Spark will be able to work with Kudu both programmatically >>>>>>>>>>>>> and as a >>>>>>>>>>>>> client via Spark SQL? Or is there more work that needs to be done >>>>>>>>>>>>> on the >>>>>>>>>>>>> Spark side for it to work? >>>>>>>>>>>>> >>>>>>>>>>>>> Just curious. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> Ben >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> >> >> > >