Re: Spark on Kudu

Jean-Daniel Cryans Tue, 14 Jun 2016 15:01:49 -0700

It's only in Cloudera's maven repo:
https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/


J-D

On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi J-D,
>
> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
> spark-shell to use. Can you show me where to find it?
>
> Thanks,
> Ben
>
>
> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org>
> wrote:
>
> What's in this doc is what's gonna get released:
> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>
> J-D
>
> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Will this be documented with examples once 0.9.0 comes out?
>>
>> Thanks,
>> Ben
>>
>>
>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>> wrote:
>>
>> It will be in 0.9.0.
>>
>> J-D
>>
>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Hi Chris,
>>>
>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com>
>>> wrote:
>>>
>>> There is some code in review that needs some more refinement.
>>> It will allow upsert/insert from a dataframe using the datasource api.
>>> It will also allow the creation and deletion of tables from a dataframe
>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>
>>> Example usages will look something like:
>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>
>>> -Chris George
>>>
>>>
>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
>>>
>>> Can someone tell me what the state is of this Spark work?
>>>
>>> Also, does anyone have any sample code on how to update/insert data in
>>> Kudu using DataFrames?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com>
>>> wrote:
>>>
>>> SparkSQL cannot support these type of statements but we may be able to
>>> implement similar functionality through the api.
>>> -Chris
>>>
>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
>>>
>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if
>>> it were to be implemented.
>>>
>>> MERGE INTO table_name USING table_reference ON (condition)
>>>  WHEN MATCHED THEN
>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>  WHEN NOT MATCHED THEN
>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com>
>>> wrote:
>>>
>>> I have a wip kuduRDD that I made a few months ago. I pushed it into
>>> gerrit if you want to take a look.
>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>> It does pushdown predicates which the existing input formatter based rdd
>>> does not.
>>>
>>> Within the next two weeks I’m planning to implement a datasource for
>>> spark that will have pushdown predicates and insertion/update functionality
>>> (need to look more at cassandra and the hbase datasource for best way to do
>>> this) I agree that server side upsert would be helpful.
>>> Having a datasource would give us useful data frames and also make spark
>>> sql usable for kudu.
>>>
>>> My reasoning for having a spark datasource and not using Impala is: 1.
>>> We have had trouble getting impala to run fast with high concurrency when
>>> compared to spark 2. We interact with datasources which do not integrate
>>> with impala. 3. We have custom sql query planners for extended sql
>>> functionality.
>>>
>>> -Chris George
>>>
>>>
>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org> wrote:
>>>
>>> You guys make a convincing point, although on the upsert side we'll need
>>> more support from the servers. Right now all you can do is an INSERT then,
>>> if you get a dup key, do an UPDATE. I guess we could at least add an API on
>>> the client side that would manage it, but it wouldn't be atomic.
>>>
>>> J-D
>>>
>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> It's pretty simple, actually.  I need to support versioned datasets in
>>>> a Spark SQL environment.  Instead of a hack on top of a Parquet data store,
>>>> I'm hoping (among other reasons) to be able to use Kudu's write and
>>>> timestamp-based read operations to support not only appending data, but
>>>> also updating existing data, and even some schema migration.  The most
>>>> typical use case is a dataset that is updated periodically (e.g., weekly or
>>>> monthly) in which the the preliminary data in the previous window (week or
>>>> month) is updated with values that are expected to remain unchanged from
>>>> then on, and a new set of preliminary values for the current window need to
>>>> be added/appended.
>>>>
>>>> Using Kudu's Java API and developing additional functionality on top of
>>>> what Kudu has to offer isn't too much to ask, but the ease of integration
>>>> with Spark SQL will gate how quickly we would move to using Kudu and how
>>>> seriously we'd look at alternatives before making that decision.
>>>>
>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <
>>>> jdcry...@apache.org> wrote:
>>>>
>>>>> Mark,
>>>>>
>>>>> Thanks for taking some time to reply in this thread, glad it caught
>>>>> the attention of other folks!
>>>>>
>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <
>>>>> m...@clearstorydata.com> wrote:
>>>>>
>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>
>>>>>>
>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying
>>>>>> a refactoring of some Spark SQL-oriented insert functionality while 
>>>>>> trying
>>>>>> to evaluate what to expect from Kudu.  Whether Kudu does a good job
>>>>>> supporting inserts with Spark SQL will be a key consideration as to 
>>>>>> whether
>>>>>> we adopt Kudu.
>>>>>>
>>>>>
>>>>> I'd like to know more about why SparkSQL inserts in necessary for you.
>>>>> Is it just that you currently do it that way into some database or parquet
>>>>> so with minimal refactoring you'd be able to use Kudu? Would re-writing
>>>>> those SQL lines into Scala and directly use the Java API's KuduSession be
>>>>> too much work?
>>>>>
>>>>> Additionally, what do you expect to gain from using Kudu VS your
>>>>> current solution? If it's not completely clear, I'd love to help you think
>>>>> through it.
>>>>>
>>>>>
>>>>>>
>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>>>>> jdcry...@apache.org> wrote:
>>>>>>
>>>>>>> Yup, starting to get a good idea.
>>>>>>>
>>>>>>> What are your DS folks looking for in terms of functionality related
>>>>>>> to Spark? A SparkSQL integration that's as fully featured as Impala's? 
>>>>>>> Do
>>>>>>> they care being able to insert into Kudu with SparkSQL or just being 
>>>>>>> able
>>>>>>> to query real fast? Anything more specific to Spark that I'm missing?
>>>>>>>
>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At
>>>>>>> Cloudera all our resources are committed to making things happen in 
>>>>>>> time,
>>>>>>> and a more fully featured Spark integration isn't in our plans during 
>>>>>>> that
>>>>>>> period. I'm really hoping someone in the community will help with Spark,
>>>>>>> the same way we got a big contribution for the Flume sink.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But,
>>>>>>>> since it’s not “production-ready”, upper management doesn’t want to 
>>>>>>>> fully
>>>>>>>> deploy it yet. They just want to keep an eye on it though. Kudu was so 
>>>>>>>> much
>>>>>>>> simpler and easier to use in every aspect compared to HBase. Impala was
>>>>>>>> great for the report writers and analysts to experiment with for the 
>>>>>>>> short
>>>>>>>> time it was up. But, once again, the only blocker was the lack of Spark
>>>>>>>> support for our Data Developers/Scientists. So, production-level data
>>>>>>>> population won’t happen until then.
>>>>>>>>
>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <
>>>>>>>> jdcry...@apache.org> wrote:
>>>>>>>>
>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> J-D,
>>>>>>>>>
>>>>>>>>> The main thing I hear that Cassandra is being used as an updatable
>>>>>>>>> hot data store to ensure that duplicates are taken care of and 
>>>>>>>>> idempotency
>>>>>>>>> is maintained. Whether data was directly retrieved from Cassandra for
>>>>>>>>> analytics, reports, or searches, it was not clear as to what was its 
>>>>>>>>> main
>>>>>>>>> use. Some also just used it for a staging area to populate downstream
>>>>>>>>> tables in parquet format. The last thing I heard was that CQL was 
>>>>>>>>> terrible,
>>>>>>>>> so that rules out much use of direct queries against it.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real
>>>>>>>> analytics, just ease of use instead of plainly using the APIs. Even 
>>>>>>>> then,
>>>>>>>> Kudu should beat it easily on big scans. Same for HBase. We've done
>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> As for our company, we have been looking for an updatable data
>>>>>>>>> store for a long time that can be quickly queried directly either 
>>>>>>>>> using
>>>>>>>>> Spark SQL or Impala or some other SQL engine and still handle TB or 
>>>>>>>>> PB of
>>>>>>>>> data without performance degradation and many configuration 
>>>>>>>>> headaches. For
>>>>>>>>> now, we are using HBase to take on this role with Phoenix as a fast 
>>>>>>>>> way to
>>>>>>>>> directly query the data. I can see Kudu as the best way to fill this 
>>>>>>>>> gap
>>>>>>>>> easily, especially being the closest thing to other relational 
>>>>>>>>> databases
>>>>>>>>> out there in familiarity for the many SQL analytics people in our 
>>>>>>>>> company.
>>>>>>>>> The other alternative would be to go with AWS Redshift for the same
>>>>>>>>> reasons, but it would come at a cost, of course. If we went with 
>>>>>>>>> either
>>>>>>>>> solutions, Kudu or Redshift, it would get rid of the need to extract 
>>>>>>>>> from
>>>>>>>>> HBase to parquet tables or export to PostgreSQL to support more of 
>>>>>>>>> the SQL
>>>>>>>>> language using by analysts or the reporting software we use..
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with
>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I hope this helps.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It does, thanks for nice reply.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <
>>>>>>>>> jdcry...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to
>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My
>>>>>>>>> colleagues who were also there did say that the hype around Spark 
>>>>>>>>> isn't
>>>>>>>>> dying down.
>>>>>>>>>
>>>>>>>>> There's definitely an overlap in the use cases that Cassandra,
>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* is 
>>>>>>>>> just an
>>>>>>>>> interim solution for the use case you describe.
>>>>>>>>>
>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a
>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see more
>>>>>>>>> contributions on the Spark front. I know there's code out there that 
>>>>>>>>> could
>>>>>>>>> be integrated in kudu-spark, it just needs to land in gerrit. I'm sure
>>>>>>>>> folks will happily review it.
>>>>>>>>>
>>>>>>>>> Do you have relevant experiences you can share? I'd love to learn
>>>>>>>>> more about the use cases for which you envision using Kudu as a C*
>>>>>>>>> replacement.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi J-D,
>>>>>>>>>>
>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They
>>>>>>>>>> told me that everything was about Spark and there is a big buzz 
>>>>>>>>>> about the
>>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think 
>>>>>>>>>> that
>>>>>>>>>> Cassandra is just an interim solution as a low-latency, easily 
>>>>>>>>>> queried data
>>>>>>>>>> store. I was wondering if anything significant happened in regards 
>>>>>>>>>> to Kudu,
>>>>>>>>>> especially on the Spark front. Plus, can you come up with your own 
>>>>>>>>>> proposed
>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <
>>>>>>>>>> jdcry...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Ben,
>>>>>>>>>>
>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I
>>>>>>>>>> know of one person on the Kudu Slack who's working on a better RDD, 
>>>>>>>>>> but
>>>>>>>>>> that's about it.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>
>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target
>>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for our 
>>>>>>>>>>> devs.
>>>>>>>>>>> At least, I can tell them what timeframe to anticipate.
>>>>>>>>>>>
>>>>>>>>>>> Just curious,
>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>
>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital marketing.
>>>>>>>>>>>
>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |
>>>>>>>>>>> www.amobee.com
>>>>>>>>>>>
>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <
>>>>>>>>>>> jdcry...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's
>>>>>>>>>>> needed either.
>>>>>>>>>>>
>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd
>>>>>>>>>>> use scans directly.
>>>>>>>>>>>
>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of
>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>
>>>>>>>>>>> The goal was to provide something for others to contribute to.
>>>>>>>>>>> We have some basic unit tests that others can easily extend. None 
>>>>>>>>>>> of us on
>>>>>>>>>>> the team are Spark experts, but we'd be really happy to assist one 
>>>>>>>>>>> improve
>>>>>>>>>>> the kudu-spark code.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <
>>>>>>>>>>> bbuil...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> J-D,
>>>>>>>>>>>>
>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu
>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up 
>>>>>>>>>>>> more Spark
>>>>>>>>>>>> SQL functionality (Dataframes) and doing the documentation, what 
>>>>>>>>>>>> more needs
>>>>>>>>>>>> to be done? Optimizations?
>>>>>>>>>>>>
>>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu
>>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <
>>>>>>>>>>>> jdcry...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in
>>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>
>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on
>>>>>>>>>>>> Kudu, but it will require a lot more work to make it fast/useful.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <
>>>>>>>>>>>> bbuil...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for
>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, will 
>>>>>>>>>>>>> this mean
>>>>>>>>>>>>> that Spark will be able to work with Kudu both programmatically 
>>>>>>>>>>>>> and as a
>>>>>>>>>>>>> client via Spark SQL? Or is there more work that needs to be done 
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Spark on Kudu

Reply via email to