Hi J-D, I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for spark-shell to use. Can you show me where to find it?
Thanks, Ben > On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > > What's in this doc is what's gonna get released: > https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark > > <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark> > > J-D > > On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Will this be documented with examples once 0.9.0 comes out? > > Thanks, > Ben > > >> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org >> <mailto:jdcry...@apache.org>> wrote: >> >> It will be in 0.9.0. >> >> J-D >> >> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> Hi Chris, >> >> Will all this effort be rolled into 0.9.0 and be ready for use? >> >> Thanks, >> Ben >> >> >>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com >>> <mailto:christopher.geo...@rms.com>> wrote: >>> >>> There is some code in review that needs some more refinement. >>> It will allow upsert/insert from a dataframe using the datasource api. It >>> will also allow the creation and deletion of tables from a dataframe >>> http://gerrit.cloudera.org:8080/#/c/2992/ >>> <http://gerrit.cloudera.org:8080/#/c/2992/> >>> >>> Example usages will look something like: >>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc >>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc> >>> >>> -Chris George >>> >>> >>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> >>> Can someone tell me what the state is of this Spark work? >>> >>> Also, does anyone have any sample code on how to update/insert data in Kudu >>> using DataFrames? >>> >>> Thanks, >>> Ben >>> >>> >>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com >>>> <mailto:christopher.geo...@rms.com>> wrote: >>>> >>>> SparkSQL cannot support these type of statements but we may be able to >>>> implement similar functionality through the api. >>>> -Chris >>>> >>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> >>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it >>>> were to be implemented. >>>> >>>> MERGE INTO table_name USING table_reference ON (condition) >>>> WHEN MATCHED THEN >>>> UPDATE SET column1 = value1 [, column2 = value2 ...] >>>> WHEN NOT MATCHED THEN >>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …]) >>>> >>>> Cheers, >>>> Ben >>>> >>>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com >>>>> <mailto:christopher.geo...@rms.com>> wrote: >>>>> >>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into >>>>> gerrit if you want to take a look. >>>>> http://gerrit.cloudera.org:8080/#/c/2754/ >>>>> <http://gerrit.cloudera.org:8080/#/c/2754/> >>>>> It does pushdown predicates which the existing input formatter based rdd >>>>> does not. >>>>> >>>>> Within the next two weeks I’m planning to implement a datasource for >>>>> spark that will have pushdown predicates and insertion/update >>>>> functionality (need to look more at cassandra and the hbase datasource >>>>> for best way to do this) I agree that server side upsert would be helpful. >>>>> Having a datasource would give us useful data frames and also make spark >>>>> sql usable for kudu. >>>>> >>>>> My reasoning for having a spark datasource and not using Impala is: 1. We >>>>> have had trouble getting impala to run fast with high concurrency when >>>>> compared to spark 2. We interact with datasources which do not integrate >>>>> with impala. 3. We have custom sql query planners for extended sql >>>>> functionality. >>>>> >>>>> -Chris George >>>>> >>>>> >>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org >>>>> <mailto:jdcry...@apache.org>> wrote: >>>>> >>>>> You guys make a convincing point, although on the upsert side we'll need >>>>> more support from the servers. Right now all you can do is an INSERT >>>>> then, if you get a dup key, do an UPDATE. I guess we could at least add >>>>> an API on the client side that would manage it, but it wouldn't be atomic. >>>>> >>>>> J-D >>>>> >>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com >>>>> <mailto:m...@clearstorydata.com>> wrote: >>>>> It's pretty simple, actually. I need to support versioned datasets in a >>>>> Spark SQL environment. Instead of a hack on top of a Parquet data store, >>>>> I'm hoping (among other reasons) to be able to use Kudu's write and >>>>> timestamp-based read operations to support not only appending data, but >>>>> also updating existing data, and even some schema migration. The most >>>>> typical use case is a dataset that is updated periodically (e.g., weekly >>>>> or monthly) in which the the preliminary data in the previous window >>>>> (week or month) is updated with values that are expected to remain >>>>> unchanged from then on, and a new set of preliminary values for the >>>>> current window need to be added/appended. >>>>> >>>>> Using Kudu's Java API and developing additional functionality on top of >>>>> what Kudu has to offer isn't too much to ask, but the ease of integration >>>>> with Spark SQL will gate how quickly we would move to using Kudu and how >>>>> seriously we'd look at alternatives before making that decision. >>>>> >>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org >>>>> <mailto:jdcry...@apache.org>> wrote: >>>>> Mark, >>>>> >>>>> Thanks for taking some time to reply in this thread, glad it caught the >>>>> attention of other folks! >>>>> >>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com >>>>> <mailto:m...@clearstorydata.com>> wrote: >>>>> Do they care being able to insert into Kudu with SparkSQL >>>>> >>>>> I care about insert into Kudu with Spark SQL. I'm currently delaying a >>>>> refactoring of some Spark SQL-oriented insert functionality while trying >>>>> to evaluate what to expect from Kudu. Whether Kudu does a good job >>>>> supporting inserts with Spark SQL will be a key consideration as to >>>>> whether we adopt Kudu. >>>>> >>>>> I'd like to know more about why SparkSQL inserts in necessary for you. Is >>>>> it just that you currently do it that way into some database or parquet >>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing >>>>> those SQL lines into Scala and directly use the Java API's KuduSession be >>>>> too much work? >>>>> >>>>> Additionally, what do you expect to gain from using Kudu VS your current >>>>> solution? If it's not completely clear, I'd love to help you think >>>>> through it. >>>>> >>>>> >>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>> <mailto:jdcry...@apache.org>> wrote: >>>>> Yup, starting to get a good idea. >>>>> >>>>> What are your DS folks looking for in terms of functionality related to >>>>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do >>>>> they care being able to insert into Kudu with SparkSQL or just being able >>>>> to query real fast? Anything more specific to Spark that I'm missing? >>>>> >>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all >>>>> our resources are committed to making things happen in time, and a more >>>>> fully featured Spark integration isn't in our plans during that period. >>>>> I'm really hoping someone in the community will help with Spark, the same >>>>> way we got a big contribution for the Flume sink. >>>>> >>>>> J-D >>>>> >>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since >>>>> it’s not “production-ready”, upper management doesn’t want to fully >>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so >>>>> much simpler and easier to use in every aspect compared to HBase. Impala >>>>> was great for the report writers and analysts to experiment with for the >>>>> short time it was up. But, once again, the only blocker was the lack of >>>>> Spark support for our Data Developers/Scientists. So, production-level >>>>> data population won’t happen until then. >>>>> >>>>> I hope this helps you get an idea where I am coming from… >>>>> >>>>> Cheers, >>>>> Ben >>>>> >>>>> >>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>> >>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> J-D, >>>>>> >>>>>> The main thing I hear that Cassandra is being used as an updatable hot >>>>>> data store to ensure that duplicates are taken care of and idempotency >>>>>> is maintained. Whether data was directly retrieved from Cassandra for >>>>>> analytics, reports, or searches, it was not clear as to what was its >>>>>> main use. Some also just used it for a staging area to populate >>>>>> downstream tables in parquet format. The last thing I heard was that CQL >>>>>> was terrible, so that rules out much use of direct queries against it. >>>>>> >>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics, >>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu >>>>>> should beat it easily on big scans. Same for HBase. We've done >>>>>> benchmarks against the latter, not the former. >>>>>> >>>>>> >>>>>> As for our company, we have been looking for an updatable data store for >>>>>> a long time that can be quickly queried directly either using Spark SQL >>>>>> or Impala or some other SQL engine and still handle TB or PB of data >>>>>> without performance degradation and many configuration headaches. For >>>>>> now, we are using HBase to take on this role with Phoenix as a fast way >>>>>> to directly query the data. I can see Kudu as the best way to fill this >>>>>> gap easily, especially being the closest thing to other relational >>>>>> databases out there in familiarity for the many SQL analytics people in >>>>>> our company. The other alternative would be to go with AWS Redshift for >>>>>> the same reasons, but it would come at a cost, of course. If we went >>>>>> with either solutions, Kudu or Redshift, it would get rid of the need to >>>>>> extract from HBase to parquet tables or export to PostgreSQL to support >>>>>> more of the SQL language using by analysts or the reporting software we >>>>>> use.. >>>>>> >>>>>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. >>>>>> Have you folks tried Kudu with Impala yet with those use cases? >>>>>> >>>>>> >>>>>> I hope this helps. >>>>>> >>>>>> It does, thanks for nice reply. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Ben >>>>>> >>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>> >>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer >>>>>>> to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues >>>>>>> who were also there did say that the hype around Spark isn't dying down. >>>>>>> >>>>>>> There's definitely an overlap in the use cases that Cassandra, HBase, >>>>>>> and Kudu cater to. I wouldn't go as far as saying that C* is just an >>>>>>> interim solution for the use case you describe. >>>>>>> >>>>>>> Nothing significant happened in Kudu over the past month, it's a >>>>>>> storage engine so things move slowly *smile*. I'd love to see more >>>>>>> contributions on the Spark front. I know there's code out there that >>>>>>> could be integrated in kudu-spark, it just needs to land in gerrit. I'm >>>>>>> sure folks will happily review it. >>>>>>> >>>>>>> Do you have relevant experiences you can share? I'd love to learn more >>>>>>> about the use cases for which you envision using Kudu as a C* >>>>>>> replacement. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> Hi J-D, >>>>>>> >>>>>>> My colleagues recently came back from Strata in San Jose. They told me >>>>>>> that everything was about Spark and there is a big buzz about the SMACK >>>>>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that >>>>>>> Cassandra is just an interim solution as a low-latency, easily queried >>>>>>> data store. I was wondering if anything significant happened in regards >>>>>>> to Kudu, especially on the Spark front. Plus, can you come up with your >>>>>>> own proposed stack acronym to promote? >>>>>>> >>>>>>> Cheers, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>> >>>>>>>> Hi Ben, >>>>>>>> >>>>>>>> AFAIK no one in the dev community committed to any timeline. I know of >>>>>>>> one person on the Kudu Slack who's working on a better RDD, but that's >>>>>>>> about it. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com >>>>>>>> <mailto:b...@amobee.com>> wrote: >>>>>>>> Hi J-D, >>>>>>>> >>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a >>>>>>>> version of Kudu to begin real testing of Spark against it for our >>>>>>>> devs. At least, I can tell them what timeframe to anticipate. >>>>>>>> >>>>>>>> Just curious, >>>>>>>> Benjamin Kim >>>>>>>> Data Solutions Architect >>>>>>>> >>>>>>>> [a•mo•bee] (n.) the company defining digital marketing. >>>>>>>> >>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900> >>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 | >>>>>>>> www.amobee.com <http://www.amobee.com/> >>>>>>>> >>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>> >>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed >>>>>>>>> either. >>>>>>>>> >>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use >>>>>>>>> scans directly. >>>>>>>>> >>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. >>>>>>>>> It's really basic. >>>>>>>>> >>>>>>>>> The goal was to provide something for others to contribute to. We >>>>>>>>> have some basic unit tests that others can easily extend. None of us >>>>>>>>> on the team are Spark experts, but we'd be really happy to assist one >>>>>>>>> improve the kudu-spark code. >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>> J-D, >>>>>>>>> >>>>>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, >>>>>>>>> kudu DStream) in KUDU-1214. Am I right? Besides shoring up more Spark >>>>>>>>> SQL functionality (Dataframes) and doing the documentation, what more >>>>>>>>> needs to be done? Optimizations? >>>>>>>>> >>>>>>>>> I believe that it’s a good place to start using Spark with Kudu and >>>>>>>>> compare it to HBase with Spark (not clean). >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ben >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>>>>>>>> <mailto:jdcry...@apache.org>> wrote: >>>>>>>>>> >>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in for >>>>>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 >>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321> >>>>>>>>>> >>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, >>>>>>>>>> but it will require a lot more work to make it fast/useful. >>>>>>>>>> >>>>>>>>>> Hope this helps, >>>>>>>>>> >>>>>>>>>> J-D >>>>>>>>>> >>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> >>>>>>>>>> targeted for 0.8.0, but I see no progress on it. When this is >>>>>>>>>> complete, will this mean that Spark will be able to work with Kudu >>>>>>>>>> both programmatically and as a client via Spark SQL? Or is there >>>>>>>>>> more work that needs to be done on the Spark side for it to work? >>>>>>>>>> >>>>>>>>>> Just curious. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Ben >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> > >