Re: Spark on Kudu

Benjamin Kim Tue, 20 Sep 2016 13:19:15 -0700

Now that Kudu 1.0.0 is officially out and ready for production use, where do we 
find the spark connector jar for this release?


Thanks,
Ben

> On Jun 17, 2016, at 11:08 AM, Dan Burkert <[email protected]> wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> 
>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Now, I’m getting this error when trying to write to the table.
>>>> 
>>>> import scala.collection.JavaConverters._
>>>> val key_seq = Seq(“my_id")
>>>> val key_list = List(“my_id”).asJava
>>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>> 
>>>> df.write
>>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>>>     .mode("overwrite")
>>>>     .kudu
>>>> 
>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>>> (error 0)Not found: key not found (error 0)
>>>> 
>>>> Does the key field need to be first in the DataFrame?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Dan,
>>>>> 
>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>> 
>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>> 
>>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>>> using setRangePartitionColumns or addHashPartitions
>>>>> 
>>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>>> primary key columns, so in this case you have specified the single PK 
>>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>>> the table, in this case over the column "my_id" (which is good, it must 
>>>>> be over one or more PK columns, so in this case "my_id" is the one and 
>>>>> only valid combination).  However, the call to `addHashPartition` also 
>>>>> takes the number of buckets as the second param.  You shouldn't get the 
>>>>> IllegalArgumentException as long as you are specifying either 
>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>> 
>>>>> - Dan
>>>>>  
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Looks like we're missing an import statement in that example.  Could you 
>>>>>> try:
>>>>>> 
>>>>>> import org.kududb.client._
>>>>>> and try again?
>>>>>> 
>>>>>> - Dan
>>>>>> 
>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> I encountered an error trying to create a table based on the 
>>>>>> documentation from a DataFrame.
>>>>>> 
>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>               kuduContext.createTable(tableName, df.schema, Seq("key"), 
>>>>>> new CreateTableOptions().setNumReplicas(1))
>>>>>> 
>>>>>> Is there something I’m missing?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> It's only in Cloudera's maven repo: 
>>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>  
>>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Hi J-D,
>>>>>>> 
>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar 
>>>>>>> for spark-shell to use. Can you show me where to find it?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> What's in this doc is what's gonna get released: 
>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>>  
>>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> It will be in 0.9.0.
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> Hi Chris,
>>>>>>>>> 
>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George 
>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource 
>>>>>>>>>> api. It will also allow the creation and deletion of tables from a 
>>>>>>>>>> dataframe
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>>>>> 
>>>>>>>>>> Example usages will look something like:
>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>>>>> 
>>>>>>>>>> -Chris George
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>>> 
>>>>>>>>>> Also, does anyone have any sample code on how to update/insert data 
>>>>>>>>>> in Kudu using DataFrames?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George 
>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be able 
>>>>>>>>>>> to implement similar functionality through the api.
>>>>>>>>>>> -Chris
>>>>>>>>>>> 
>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” 
>>>>>>>>>>> if it were to be implemented.
>>>>>>>>>>> 
>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it 
>>>>>>>>>>>> into gerrit if you want to take a look. 
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>>>>> It does pushdown predicates which the existing input formatter 
>>>>>>>>>>>> based rdd does not.
>>>>>>>>>>>> 
>>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource 
>>>>>>>>>>>> for spark that will have pushdown predicates and insertion/update 
>>>>>>>>>>>> functionality (need to look more at cassandra and the hbase 
>>>>>>>>>>>> datasource for best way to do this) I agree that server side 
>>>>>>>>>>>> upsert would be helpful.
>>>>>>>>>>>> Having a datasource would give us useful data frames and also make 
>>>>>>>>>>>> spark sql usable for kudu.
>>>>>>>>>>>> 
>>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala 
>>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high 
>>>>>>>>>>>> concurrency when compared to spark 2. We interact with datasources 
>>>>>>>>>>>> which do not integrate with impala. 3. We have custom sql query 
>>>>>>>>>>>> planners for extended sql functionality.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Chris George
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> You guys make a convincing point, although on the upsert side 
>>>>>>>>>>>> we'll need more support from the servers. Right now all you can do 
>>>>>>>>>>>> is an INSERT then, if you get a dup key, do an UPDATE. I guess we 
>>>>>>>>>>>> could at least add an API on the client side that would manage it, 
>>>>>>>>>>>> but it wouldn't be atomic.
>>>>>>>>>>>> 
>>>>>>>>>>>> J-D
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>> It's pretty simple, actually.  I need to support versioned 
>>>>>>>>>>>> datasets in a Spark SQL environment.  Instead of a hack on top of 
>>>>>>>>>>>> a Parquet data store, I'm hoping (among other reasons) to be able 
>>>>>>>>>>>> to use Kudu's write and timestamp-based read operations to support 
>>>>>>>>>>>> not only appending data, but also updating existing data, and even 
>>>>>>>>>>>> some schema migration.  The most typical use case is a dataset 
>>>>>>>>>>>> that is updated periodically (e.g., weekly or monthly) in which 
>>>>>>>>>>>> the the preliminary data in the previous window (week or month) is 
>>>>>>>>>>>> updated with values that are expected to remain unchanged from 
>>>>>>>>>>>> then on, and a new set of preliminary values for the current 
>>>>>>>>>>>> window need to be added/appended.
>>>>>>>>>>>> 
>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on 
>>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the ease 
>>>>>>>>>>>> of integration with Spark SQL will gate how quickly we would move 
>>>>>>>>>>>> to using Kudu and how seriously we'd look at alternatives before 
>>>>>>>>>>>> making that decision. 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>> Mark,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it 
>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>>> 
>>>>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently 
>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert 
>>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu.  
>>>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL 
>>>>>>>>>>>> will be a key consideration as to whether we adopt Kudu.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for 
>>>>>>>>>>>> you. Is it just that you currently do it that way into some 
>>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to 
>>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and directly 
>>>>>>>>>>>> use the Java API's KuduSession be too much work?
>>>>>>>>>>>> 
>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help 
>>>>>>>>>>>> you think through it.
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>> 
>>>>>>>>>>>> What are your DS folks looking for in terms of functionality 
>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully featured 
>>>>>>>>>>>> as Impala's? Do they care being able to insert into Kudu with 
>>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more 
>>>>>>>>>>>> specific to Spark that I'm missing?
>>>>>>>>>>>> 
>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At 
>>>>>>>>>>>> Cloudera all our resources are committed to making things happen 
>>>>>>>>>>>> in time, and a more fully featured Spark integration isn't in our 
>>>>>>>>>>>> plans during that period. I'm really hoping someone in the 
>>>>>>>>>>>> community will help with Spark, the same way we got a big 
>>>>>>>>>>>> contribution for the Flume sink. 
>>>>>>>>>>>> 
>>>>>>>>>>>> J-D
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, 
>>>>>>>>>>>> since it’s not “production-ready”, upper management doesn’t want 
>>>>>>>>>>>> to fully deploy it yet. They just want to keep an eye on it 
>>>>>>>>>>>> though. Kudu was so much simpler and easier to use in every aspect 
>>>>>>>>>>>> compared to HBase. Impala was great for the report writers and 
>>>>>>>>>>>> analysts to experiment with for the short time it was up. But, 
>>>>>>>>>>>> once again, the only blocker was the lack of Spark support for our 
>>>>>>>>>>>> Data Developers/Scientists.  So, production-level data population 
>>>>>>>>>>>> won’t happen until then.
>>>>>>>>>>>> 
>>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an 
>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken care 
>>>>>>>>>>>>> of and idempotency is maintained. Whether data was directly 
>>>>>>>>>>>>> retrieved from Cassandra for analytics, reports, or searches, it 
>>>>>>>>>>>>> was not clear as to what was its main use. Some also just used it 
>>>>>>>>>>>>> for a staging area to populate downstream tables in parquet 
>>>>>>>>>>>>> format. The last thing I heard was that CQL was terrible, so that 
>>>>>>>>>>>>> rules out much use of direct queries against it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real 
>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. 
>>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for 
>>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former.
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As for our company, we have been looking for an updatable data 
>>>>>>>>>>>>> store for a long time that can be quickly queried directly either 
>>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still 
>>>>>>>>>>>>> handle TB or PB of data without performance degradation and many 
>>>>>>>>>>>>> configuration headaches. For now, we are using HBase to take on 
>>>>>>>>>>>>> this role with Phoenix as a fast way to directly query the data. 
>>>>>>>>>>>>> I can see Kudu as the best way to fill this gap easily, 
>>>>>>>>>>>>> especially being the closest thing to other relational databases 
>>>>>>>>>>>>> out there in familiarity for the many SQL analytics people in our 
>>>>>>>>>>>>> company. The other alternative would be to go with AWS Redshift 
>>>>>>>>>>>>> for the same reasons, but it would come at a cost, of course. If 
>>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would get rid 
>>>>>>>>>>>>> of the need to extract from HBase to parquet tables or export to 
>>>>>>>>>>>>> PostgreSQL to support more of the SQL language using by analysts 
>>>>>>>>>>>>> or the reporting software we use..
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with 
>>>>>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use 
>>>>>>>>>>>>> cases?
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>  
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like 
>>>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but yeah it's not as 
>>>>>>>>>>>>>> sexy. My colleagues who were also there did say that the hype 
>>>>>>>>>>>>>> around Spark isn't dying down.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* 
>>>>>>>>>>>>>> is just an interim solution for the use case you describe.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a 
>>>>>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see 
>>>>>>>>>>>>>> more contributions on the Spark front. I know there's code out 
>>>>>>>>>>>>>> there that could be integrated in kudu-spark, it just needs to 
>>>>>>>>>>>>>> land in gerrit. I'm sure folks will happily review it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to 
>>>>>>>>>>>>>> learn more about the use cases for which you envision using Kudu 
>>>>>>>>>>>>>> as a C* replacement.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They 
>>>>>>>>>>>>>> told me that everything was about Spark and there is a big buzz 
>>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I 
>>>>>>>>>>>>>> still think that Cassandra is just an interim solution as a 
>>>>>>>>>>>>>> low-latency, easily queried data store. I was wondering if 
>>>>>>>>>>>>>> anything significant happened in regards to Kudu, especially on 
>>>>>>>>>>>>>> the Spark front. Plus, can you come up with your own proposed 
>>>>>>>>>>>>>> stack acronym to promote?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I 
>>>>>>>>>>>>>>> know of one person on the Kudu Slack who's working on a better 
>>>>>>>>>>>>>>> RDD, but that's about it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <[email protected] 
>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target 
>>>>>>>>>>>>>>> a version of Kudu to begin real testing of Spark against it for 
>>>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to 
>>>>>>>>>>>>>>> anticipate.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's 
>>>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally 
>>>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The goal was to provide something for others to contribute to. 
>>>>>>>>>>>>>>>> We have some basic unit tests that others can easily extend. 
>>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be really 
>>>>>>>>>>>>>>>> happy to assist one improve the kudu-spark code.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim 
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu 
>>>>>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring 
>>>>>>>>>>>>>>>> up more Spark SQL functionality (Dataframes) and doing the 
>>>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with 
>>>>>>>>>>>>>>>> Kudu and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this 
>>>>>>>>>>>>>>>>> in for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on 
>>>>>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it 
>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, 
>>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu both 
>>>>>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there 
>>>>>>>>>>>>>>>>> more work that needs to be done on the Spark side for it to 
>>>>>>>>>>>>>>>>> work?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Spark on Kudu

Reply via email to