Re: Spark on Kudu

Benjamin Kim Tue, 14 Jun 2016 15:00:06 -0700

Hi J-D,

I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
spark-shell to use. Can you show me where to find it?


Thanks,
Ben


> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> 
> What's in this doc is what's gonna get released: 
> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>  
> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
> 
> J-D
> 
> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Will this be documented with examples once 0.9.0 comes out?
> 
> Thanks,
> Ben
> 
> 
>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> It will be in 0.9.0.
>> 
>> J-D
>> 
>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Chris,
>> 
>> Will all this effort be rolled into 0.9.0 and be ready for use?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com 
>>> <mailto:christopher.geo...@rms.com>> wrote:
>>> 
>>> There is some code in review that needs some more refinement.
>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>> will also allow the creation and deletion of tables from a dataframe
>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>> 
>>> Example usages will look something like:
>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>> 
>>> -Chris George
>>> 
>>> 
>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Can someone tell me what the state is of this Spark work?
>>> 
>>> Also, does anyone have any sample code on how to update/insert data in Kudu 
>>> using DataFrames?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com 
>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>> 
>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>> implement similar functionality through the api.
>>>> -Chris
>>>> 
>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>>>> were to be implemented.
>>>> 
>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>  WHEN MATCHED THEN
>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>  WHEN NOT MATCHED THEN
>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>> gerrit if you want to take a look. 
>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>> It does pushdown predicates which the existing input formatter based rdd 
>>>>> does not.
>>>>> 
>>>>> Within the next two weeks I’m planning to implement a datasource for 
>>>>> spark that will have pushdown predicates and insertion/update 
>>>>> functionality (need to look more at cassandra and the hbase datasource 
>>>>> for best way to do this) I agree that server side upsert would be helpful.
>>>>> Having a datasource would give us useful data frames and also make spark 
>>>>> sql usable for kudu.
>>>>> 
>>>>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>>>>> have had trouble getting impala to run fast with high concurrency when 
>>>>> compared to spark 2. We interact with datasources which do not integrate 
>>>>> with impala. 3. We have custom sql query planners for extended sql 
>>>>> functionality.
>>>>> 
>>>>> -Chris George
>>>>> 
>>>>> 
>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> You guys make a convincing point, although on the upsert side we'll need 
>>>>> more support from the servers. Right now all you can do is an INSERT 
>>>>> then, if you get a dup key, do an UPDATE. I guess we could at least add 
>>>>> an API on the client side that would manage it, but it wouldn't be atomic.
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>>>>> <mailto:m...@clearstorydata.com>> wrote:
>>>>> It's pretty simple, actually.  I need to support versioned datasets in a 
>>>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>>>>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>>>>> timestamp-based read operations to support not only appending data, but 
>>>>> also updating existing data, and even some schema migration.  The most 
>>>>> typical use case is a dataset that is updated periodically (e.g., weekly 
>>>>> or monthly) in which the the preliminary data in the previous window 
>>>>> (week or month) is updated with values that are expected to remain 
>>>>> unchanged from then on, and a new set of preliminary values for the 
>>>>> current window need to be added/appended.
>>>>> 
>>>>> Using Kudu's Java API and developing additional functionality on top of 
>>>>> what Kudu has to offer isn't too much to ask, but the ease of integration 
>>>>> with Spark SQL will gate how quickly we would move to using Kudu and how 
>>>>> seriously we'd look at alternatives before making that decision. 
>>>>> 
>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> Mark,
>>>>> 
>>>>> Thanks for taking some time to reply in this thread, glad it caught the 
>>>>> attention of other folks!
>>>>> 
>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com 
>>>>> <mailto:m...@clearstorydata.com>> wrote:
>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>> 
>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
>>>>> refactoring of some Spark SQL-oriented insert functionality while trying 
>>>>> to evaluate what to expect from Kudu.  Whether Kudu does a good job 
>>>>> supporting inserts with Spark SQL will be a key consideration as to 
>>>>> whether we adopt Kudu.
>>>>> 
>>>>> I'd like to know more about why SparkSQL inserts in necessary for you. Is 
>>>>> it just that you currently do it that way into some database or parquet 
>>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing 
>>>>> those SQL lines into Scala and directly use the Java API's KuduSession be 
>>>>> too much work?
>>>>> 
>>>>> Additionally, what do you expect to gain from using Kudu VS your current 
>>>>> solution? If it's not completely clear, I'd love to help you think 
>>>>> through it.
>>>>>  
>>>>> 
>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> Yup, starting to get a good idea.
>>>>> 
>>>>> What are your DS folks looking for in terms of functionality related to 
>>>>> Spark? A SparkSQL integration that's as fully featured as Impala's? Do 
>>>>> they care being able to insert into Kudu with SparkSQL or just being able 
>>>>> to query real fast? Anything more specific to Spark that I'm missing?
>>>>> 
>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all 
>>>>> our resources are committed to making things happen in time, and a more 
>>>>> fully featured Spark integration isn't in our plans during that period. 
>>>>> I'm really hoping someone in the community will help with Spark, the same 
>>>>> way we got a big contribution for the Flume sink. 
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since 
>>>>> it’s not “production-ready”, upper management doesn’t want to fully 
>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so 
>>>>> much simpler and easier to use in every aspect compared to HBase. Impala 
>>>>> was great for the report writers and analysts to experiment with for the 
>>>>> short time it was up. But, once again, the only blocker was the lack of 
>>>>> Spark support for our Data Developers/Scientists. So, production-level 
>>>>> data population won’t happen until then.
>>>>> 
>>>>> I hope this helps you get an idea where I am coming from…
>>>>> 
>>>>> Cheers,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>> 
>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> J-D,
>>>>>> 
>>>>>> The main thing I hear that Cassandra is being used as an updatable hot 
>>>>>> data store to ensure that duplicates are taken care of and idempotency 
>>>>>> is maintained. Whether data was directly retrieved from Cassandra for 
>>>>>> analytics, reports, or searches, it was not clear as to what was its 
>>>>>> main use. Some also just used it for a staging area to populate 
>>>>>> downstream tables in parquet format. The last thing I heard was that CQL 
>>>>>> was terrible, so that rules out much use of direct queries against it.
>>>>>> 
>>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics, 
>>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu 
>>>>>> should beat it easily on big scans. Same for HBase. We've done 
>>>>>> benchmarks against the latter, not the former.
>>>>>>  
>>>>>> 
>>>>>> As for our company, we have been looking for an updatable data store for 
>>>>>> a long time that can be quickly queried directly either using Spark SQL 
>>>>>> or Impala or some other SQL engine and still handle TB or PB of data 
>>>>>> without performance degradation and many configuration headaches. For 
>>>>>> now, we are using HBase to take on this role with Phoenix as a fast way 
>>>>>> to directly query the data. I can see Kudu as the best way to fill this 
>>>>>> gap easily, especially being the closest thing to other relational 
>>>>>> databases out there in familiarity for the many SQL analytics people in 
>>>>>> our company. The other alternative would be to go with AWS Redshift for 
>>>>>> the same reasons, but it would come at a cost, of course. If we went 
>>>>>> with either solutions, Kudu or Redshift, it would get rid of the need to 
>>>>>> extract from HBase to parquet tables or export to PostgreSQL to support 
>>>>>> more of the SQL language using by analysts or the reporting software we 
>>>>>> use..
>>>>>> 
>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. 
>>>>>> Have you folks tried Kudu with Impala yet with those use cases?
>>>>>>  
>>>>>> 
>>>>>> I hope this helps.
>>>>>> 
>>>>>> It does, thanks for nice reply.
>>>>>>  
>>>>>> 
>>>>>> Cheers,
>>>>>> Ben 
>>>>>> 
>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>> 
>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer 
>>>>>>> to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues 
>>>>>>> who were also there did say that the hype around Spark isn't dying down.
>>>>>>> 
>>>>>>> There's definitely an overlap in the use cases that Cassandra, HBase, 
>>>>>>> and Kudu cater to. I wouldn't go as far as saying that C* is just an 
>>>>>>> interim solution for the use case you describe.
>>>>>>> 
>>>>>>> Nothing significant happened in Kudu over the past month, it's a 
>>>>>>> storage engine so things move slowly *smile*. I'd love to see more 
>>>>>>> contributions on the Spark front. I know there's code out there that 
>>>>>>> could be integrated in kudu-spark, it just needs to land in gerrit. I'm 
>>>>>>> sure folks will happily review it.
>>>>>>> 
>>>>>>> Do you have relevant experiences you can share? I'd love to learn more 
>>>>>>> about the use cases for which you envision using Kudu as a C* 
>>>>>>> replacement.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> Hi J-D,
>>>>>>> 
>>>>>>> My colleagues recently came back from Strata in San Jose. They told me 
>>>>>>> that everything was about Spark and there is a big buzz about the SMACK 
>>>>>>> stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think that 
>>>>>>> Cassandra is just an interim solution as a low-latency, easily queried 
>>>>>>> data store. I was wondering if anything significant happened in regards 
>>>>>>> to Kudu, especially on the Spark front. Plus, can you come up with your 
>>>>>>> own proposed stack acronym to promote?
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>> Hi Ben,
>>>>>>>> 
>>>>>>>> AFAIK no one in the dev community committed to any timeline. I know of 
>>>>>>>> one person on the Kudu Slack who's working on a better RDD, but that's 
>>>>>>>> about it.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>>>>> <mailto:b...@amobee.com>> wrote:
>>>>>>>> Hi J-D,
>>>>>>>> 
>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a 
>>>>>>>> version of Kudu to begin real testing of Spark against it for our 
>>>>>>>> devs. At least, I can tell them what timeframe to anticipate.
>>>>>>>> 
>>>>>>>> Just curious,
>>>>>>>> Benjamin Kim
>>>>>>>> Data Solutions Architect
>>>>>>>> 
>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>> 
>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>> 
>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed 
>>>>>>>>> either.
>>>>>>>>> 
>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd use 
>>>>>>>>> scans directly.
>>>>>>>>> 
>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of pushdown. 
>>>>>>>>> It's really basic.
>>>>>>>>> 
>>>>>>>>> The goal was to provide something for others to contribute to. We 
>>>>>>>>> have some basic unit tests that others can easily extend. None of us 
>>>>>>>>> on the team are Spark experts, but we'd be really happy to assist one 
>>>>>>>>> improve the kudu-spark code.
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> J-D,
>>>>>>>>> 
>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu RDD, 
>>>>>>>>> kudu DStream) in KUDU-1214. Am I right? Besides shoring up more Spark 
>>>>>>>>> SQL functionality (Dataframes) and doing the documentation, what more 
>>>>>>>>> needs to be done? Optimizations?
>>>>>>>>> 
>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu and 
>>>>>>>>> compare it to HBase with Spark (not clean).
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>> 
>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in for 
>>>>>>>>>> 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>> 
>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on Kudu, 
>>>>>>>>>> but it will require a lot more work to make it fast/useful.
>>>>>>>>>> 
>>>>>>>>>> Hope this helps,
>>>>>>>>>> 
>>>>>>>>>> J-D
>>>>>>>>>> 
>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> I see this KUDU-1214 <https://issues.cloudera.org/browse/KUDU-1214> 
>>>>>>>>>> targeted for 0.8.0, but I see no progress on it. When this is 
>>>>>>>>>> complete, will this mean that Spark will be able to work with Kudu 
>>>>>>>>>> both programmatically and as a client via Spark SQL? Or is there 
>>>>>>>>>> more work that needs to be done on the Spark side for it to work?
>>>>>>>>>> 
>>>>>>>>>> Just curious.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
> 
>

Re: Spark on Kudu

Reply via email to