Re: Spark on Kudu

Benjamin Kim Tue, 20 Sep 2016 15:06:02 -0700
Thanks!

> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell <[email protected]> 
> wrote:
> 
> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark 
> <http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark>
> 
> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <[email protected] 
> <mailto:[email protected]>> wrote:
> I see that the API has changed a bit so my old code doesn’t work anymore. Can 
> someone direct me to some code samples?
> 
> Thanks,
> Ben
> 
> 
>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
>> we find the spark connector jar for this release?
>> 
>> 
>> It's available in the official ASF maven repository:  
>> https://repository.apache.org/#nexus-search;quick~kudu-spark 
>> <https://repository.apache.org/#nexus-search;quick~kudu-spark>
>> 
>> <dependency>
>>   <groupId>org.apache.kudu</groupId>
>>   <artifactId>kudu-spark_2.10</artifactId>
>>   <version>1.0.0</version>
>> </dependency>
>> 
>> 
>> -Todd
>>  
>> 
>> 
>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi Ben,
>>> 
>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I 
>>> do not think we support that at this point.  I haven't looked deeply into 
>>> it, but we may hit issues specifying Kudu-specific options (partitioning, 
>>> column encoding, etc.).  Probably issues that can be worked through 
>>> eventually, though.  If you are interested in contributing to Kudu, this is 
>>> an area that could obviously use improvement!  Most or all of our Spark 
>>> features have been completely community driven to date.
>>>  
>>> I am assuming that more Spark support along with semantic changes below 
>>> will be incorporated into Kudu 0.9.1.
>>> 
>>> As a rule we do not release new features in patch releases, but the good 
>>> news is that we are releasing regularly, and our next scheduled release is 
>>> for the August timeframe (see JD's roadmap 
>>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>>  email about what we are aiming to include).  Also, Cloudera does publish 
>>> snapshot versions of the Spark connector here 
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so 
>>> the jars are available if you don't mind using snapshots.
>>>  
>>> Anyone know of a better way to make unique primary keys other than using 
>>> UUID to make every row unique if there is no unique column (or combination 
>>> thereof) to use.
>>> 
>>> Not that I know of.  In general it's pretty rare to have a dataset without 
>>> a natural primary key (even if it's just all of the columns), but in those 
>>> cases UUID is a good solution.
>>>  
>>> This is what I am using. I know auto incrementing is coming down the line 
>>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>>> out of curiosity?
>>> 
>>> To my knowledge there is no plan to have auto increment in Kudu.  
>>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>>> and I don't think there are any known solutions that would be fast enough 
>>> for Kudu (happy to be proven wrong, though!).
>>> 
>>> - Dan
>>>  
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> I'm not sure exactly what the semantics will be, but at least one of them 
>>>> will be upsert.  These modes come from spark, and they were really 
>>>> designed for file-backed storage and not table storage.  We may want to do 
>>>> append = upsert, and overwrite = truncate + insert.  I think that may 
>>>> match the normal spark semantics more closely.
>>>> 
>>>> - Dan
>>>> 
>>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks for the information. That would mean both “append” and “overwrite” 
>>>> modes would be combined or not needed in the future.
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Right now append uses an update Kudu operation, which requires the row 
>>>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>>>> into the Spark connector.  So pretty soon these sharp edges will get a 
>>>>> lot better, since upsert is the way to go for most spark workloads.
>>>>> 
>>>>> - Dan
>>>>> 
>>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>>>> data. Now, I have to find answers to these questions. What would happen 
>>>>> if I “append” to the data in the Kudu table if the data already exists? 
>>>>> What would happen if I “overwrite” existing data when the DataFrame has 
>>>>> data in it that does not exist in the Kudu table? I need to evaluate the 
>>>>> best way to simulate the UPSERT behavior in HBase because this is what 
>>>>> our use case is.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Now, I’m getting this error when trying to write to the table.
>>>>>> 
>>>>>> import scala.collection.JavaConverters._
>>>>>> val key_seq = Seq(“my_id")
>>>>>> val key_list = List(“my_id”).asJava
>>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>>>> 
>>>>>> df.write
>>>>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>>>>>     .mode("overwrite")
>>>>>>     .kudu
>>>>>> 
>>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>>>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key 
>>>>>> not found (error 0)Not found: key not found (error 0)Not found: key not 
>>>>>> found (error 0)Not found: key not found (error 0)
>>>>>> 
>>>>>> Does the key field need to be first in the DataFrame?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Dan,
>>>>>>> 
>>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>>>> 
>>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>>>> 
>>>>>>> java.lang.IllegalArgumentException: Table partitioning must be 
>>>>>>> specified using setRangePartitionColumns or addHashPartitions
>>>>>>> 
>>>>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>>>>> primary key columns, so in this case you have specified the single PK 
>>>>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>>>>> the table, in this case over the column "my_id" (which is good, it must 
>>>>>>> be over one or more PK columns, so in this case "my_id" is the one and 
>>>>>>> only valid combination).  However, the call to `addHashPartition` also 
>>>>>>> takes the number of buckets as the second param.  You shouldn't get the 
>>>>>>> IllegalArgumentException as long as you are specifying either 
>>>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>>>> 
>>>>>>> - Dan
>>>>>>>  
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> Looks like we're missing an import statement in that example.  Could 
>>>>>>>> you try:
>>>>>>>> 
>>>>>>>> import org.kududb.client._
>>>>>>>> and try again?
>>>>>>>> 
>>>>>>>> - Dan
>>>>>>>> 
>>>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> I encountered an error trying to create a table based on the 
>>>>>>>> documentation from a DataFrame.
>>>>>>>> 
>>>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>>>               kuduContext.createTable(tableName, df.schema, 
>>>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1))
>>>>>>>> 
>>>>>>>> Is there something I’m missing?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> It's only in Cloudera's maven repo: 
>>>>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>>>  
>>>>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>>>>>>> 
>>>>>>>>> J-D
>>>>>>>>> 
>>>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> Hi J-D,
>>>>>>>>> 
>>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar 
>>>>>>>>> for spark-shell to use. Can you show me where to find it?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> What's in this doc is what's gonna get released: 
>>>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>>>>  
>>>>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>>>>>>> 
>>>>>>>>>> J-D
>>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans 
>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> It will be in 0.9.0.
>>>>>>>>>>> 
>>>>>>>>>>> J-D
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <[email protected] 
>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>> Hi Chris,
>>>>>>>>>>> 
>>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George 
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource 
>>>>>>>>>>>> api. It will also allow the creation and deletion of tables from a 
>>>>>>>>>>>> dataframe
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>>>>>>> 
>>>>>>>>>>>> Example usages will look something like:
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>>>>>>> 
>>>>>>>>>>>> -Chris George
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>>>>> 
>>>>>>>>>>>> Also, does anyone have any sample code on how to update/insert 
>>>>>>>>>>>> data in Kudu using DataFrames?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George 
>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be 
>>>>>>>>>>>>> able to implement similar functionality through the api.
>>>>>>>>>>>>> -Chris
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <[email protected] 
>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an 
>>>>>>>>>>>>> “upsert” if it were to be implemented.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it 
>>>>>>>>>>>>>> into gerrit if you want to take a look. 
>>>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>>>>>>> It does pushdown predicates which the existing input formatter 
>>>>>>>>>>>>>> based rdd does not.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource 
>>>>>>>>>>>>>> for spark that will have pushdown predicates and 
>>>>>>>>>>>>>> insertion/update functionality (need to look more at cassandra 
>>>>>>>>>>>>>> and the hbase datasource for best way to do this) I agree that 
>>>>>>>>>>>>>> server side upsert would be helpful.
>>>>>>>>>>>>>> Having a datasource would give us useful data frames and also 
>>>>>>>>>>>>>> make spark sql usable for kudu.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala 
>>>>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high 
>>>>>>>>>>>>>> concurrency when compared to spark 2. We interact with 
>>>>>>>>>>>>>> datasources which do not integrate with impala. 3. We have 
>>>>>>>>>>>>>> custom sql query planners for extended sql functionality.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> You guys make a convincing point, although on the upsert side 
>>>>>>>>>>>>>> we'll need more support from the servers. Right now all you can 
>>>>>>>>>>>>>> do is an INSERT then, if you get a dup key, do an UPDATE. I 
>>>>>>>>>>>>>> guess we could at least add an API on the client side that would 
>>>>>>>>>>>>>> manage it, but it wouldn't be atomic.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>>wrote:
>>>>>>>>>>>>>> It's pretty simple, actually.  I need to support versioned 
>>>>>>>>>>>>>> datasets in a Spark SQL environment.  Instead of a hack on top 
>>>>>>>>>>>>>> of a Parquet data store, I'm hoping (among other reasons) to be 
>>>>>>>>>>>>>> able to use Kudu's write and timestamp-based read operations to 
>>>>>>>>>>>>>> support not only appending data, but also updating existing 
>>>>>>>>>>>>>> data, and even some schema migration.  The most typical use case 
>>>>>>>>>>>>>> is a dataset that is updated periodically (e.g., weekly or 
>>>>>>>>>>>>>> monthly) in which the the preliminary data in the previous 
>>>>>>>>>>>>>> window (week or month) is updated with values that are expected 
>>>>>>>>>>>>>> to remain unchanged from then on, and a new set of preliminary 
>>>>>>>>>>>>>> values for the current window need to be added/appended.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on 
>>>>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the 
>>>>>>>>>>>>>> ease of integration with Spark SQL will gate how quickly we 
>>>>>>>>>>>>>> would move to using Kudu and how seriously we'd look at 
>>>>>>>>>>>>>> alternatives before making that decision. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>>wrote:
>>>>>>>>>>>>>> Mark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it 
>>>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark 
>>>>>>>>>>>>>> Hamstra<[email protected] 
>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently 
>>>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert 
>>>>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu. 
>>>>>>>>>>>>>>  Whether Kudu does a good job supporting inserts with Spark SQL 
>>>>>>>>>>>>>> will be a key consideration as to whether we adopt Kudu.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary 
>>>>>>>>>>>>>> for you. Is it just that you currently do it that way into some 
>>>>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to 
>>>>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and 
>>>>>>>>>>>>>> directly use the Java API's KuduSession be too much work?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help 
>>>>>>>>>>>>>> you think through it.
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> What are your DS folks looking for in terms of functionality 
>>>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully 
>>>>>>>>>>>>>> featured as Impala's? Do they care being able to insert into 
>>>>>>>>>>>>>> Kudu with SparkSQL or just being able to query real fast? 
>>>>>>>>>>>>>> Anything more specific to Spark that I'm missing?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At 
>>>>>>>>>>>>>> Cloudera all our resources are committed to making things happen 
>>>>>>>>>>>>>> in time, and a more fully featured Spark integration isn't in 
>>>>>>>>>>>>>> our plans during that period. I'm really hoping someone in the 
>>>>>>>>>>>>>> community will help with Spark, the same way we got a big 
>>>>>>>>>>>>>> contribution for the Flume sink. 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim 
>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>>wrote:
>>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. 
>>>>>>>>>>>>>> But, since it’s not “production-ready”, upper management doesn’t 
>>>>>>>>>>>>>> want to fully deploy it yet. They just want to keep an eye on it 
>>>>>>>>>>>>>> though. Kudu was so much simpler and easier to use in every 
>>>>>>>>>>>>>> aspect compared to HBase. Impala was great for the report 
>>>>>>>>>>>>>> writers and analysts to experiment with for the short time it 
>>>>>>>>>>>>>> was up. But, once again, the only blocker was the lack of Spark 
>>>>>>>>>>>>>> support for our Data Developers/Scientists. So, production-level 
>>>>>>>>>>>>>> data population won’t happen until then.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim 
>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an 
>>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken 
>>>>>>>>>>>>>>> care of and idempotency is maintained. Whether data was 
>>>>>>>>>>>>>>> directly retrieved from Cassandra for analytics, reports, or 
>>>>>>>>>>>>>>> searches, it was not clear as to what was its main use. Some 
>>>>>>>>>>>>>>> also just used it for a staging area to populate downstream 
>>>>>>>>>>>>>>> tables in parquet format. The last thing I heard was that CQL 
>>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries 
>>>>>>>>>>>>>>> against it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real 
>>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs. 
>>>>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for 
>>>>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As for our company, we have been looking for an updatable data 
>>>>>>>>>>>>>>> store for a long time that can be quickly queried directly 
>>>>>>>>>>>>>>> either using Spark SQL or Impala or some other SQL engine and 
>>>>>>>>>>>>>>> still handle TB or PB of data without performance degradation 
>>>>>>>>>>>>>>> and many configuration headaches. For now, we are using HBase 
>>>>>>>>>>>>>>> to take on this role with Phoenix as a fast way to directly 
>>>>>>>>>>>>>>> query the data. I can see Kudu as the best way to fill this gap 
>>>>>>>>>>>>>>> easily, especially being the closest thing to other relational 
>>>>>>>>>>>>>>> databases out there in familiarity for the many SQL analytics 
>>>>>>>>>>>>>>> people in our company. The other alternative would be to go 
>>>>>>>>>>>>>>> with AWS Redshift for the same reasons, but it would come at a 
>>>>>>>>>>>>>>> cost, of course. If we went with either solutions, Kudu or 
>>>>>>>>>>>>>>> Redshift, it would get rid of the need to extract from HBase to 
>>>>>>>>>>>>>>> parquet tables or export to PostgreSQL to support more of the 
>>>>>>>>>>>>>>> SQL language using by analysts or the reporting software we 
>>>>>>>>>>>>>>> use..
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off 
>>>>>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala yet with those 
>>>>>>>>>>>>>>> use cases?
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Ben 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like 
>>>>>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but yeah it's not as 
>>>>>>>>>>>>>>>> sexy. My colleagues who were also there did say that the hype 
>>>>>>>>>>>>>>>> around Spark isn't dying down.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that 
>>>>>>>>>>>>>>>> C* is just an interim solution for the use case you describe.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's 
>>>>>>>>>>>>>>>> a storage engine so things move slowly *smile*. I'd love to 
>>>>>>>>>>>>>>>> see more contributions on the Spark front. I know there's code 
>>>>>>>>>>>>>>>> out there that could be integrated in kudu-spark, it just 
>>>>>>>>>>>>>>>> needs to land in gerrit. I'm sure folks will happily review it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to 
>>>>>>>>>>>>>>>> learn more about the use cases for which you envision using 
>>>>>>>>>>>>>>>> Kudu as a C* replacement.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim 
>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They 
>>>>>>>>>>>>>>>> told me that everything was about Spark and there is a big 
>>>>>>>>>>>>>>>> buzz about the SMACK stack (Spark, Mesos, Akka, Cassandra, 
>>>>>>>>>>>>>>>> Kafka). I still think that Cassandra is just an interim 
>>>>>>>>>>>>>>>> solution as a low-latency, easily queried data store. I was 
>>>>>>>>>>>>>>>> wondering if anything significant happened in regards to Kudu, 
>>>>>>>>>>>>>>>> especially on the Spark front. Plus, can you come up with your 
>>>>>>>>>>>>>>>> own proposed stack acronym to promote?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. 
>>>>>>>>>>>>>>>>> I know of one person on the Kudu Slack who's working on a 
>>>>>>>>>>>>>>>>> better RDD, but that's about it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim 
>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to 
>>>>>>>>>>>>>>>>> target a version of Kudu to begin real testing of Spark 
>>>>>>>>>>>>>>>>> against it for our devs. At least, I can tell them what 
>>>>>>>>>>>>>>>>> timeframe to anticipate.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  | 
>>>>>>>>>>>>>>>>>  www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's 
>>>>>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally 
>>>>>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The goal was to provide something for others to contribute 
>>>>>>>>>>>>>>>>>> to. We have some basic unit tests that others can easily 
>>>>>>>>>>>>>>>>>> extend. None of us on the team are Spark experts, but we'd 
>>>>>>>>>>>>>>>>>> be really happy to assist one improve the kudu-spark code.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim 
>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements 
>>>>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214. Am I right? Besides 
>>>>>>>>>>>>>>>>>> shoring up more Spark SQL functionality (Dataframes) and 
>>>>>>>>>>>>>>>>>> doing the documentation, what more needs to be done? 
>>>>>>>>>>>>>>>>>> Optimizations?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with 
>>>>>>>>>>>>>>>>>> Kudu and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get 
>>>>>>>>>>>>>>>>>>> this in for 0.7.0: 
>>>>>>>>>>>>>>>>>>> https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL 
>>>>>>>>>>>>>>>>>>> on Kudu, but it will require a lot more work to make it 
>>>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim 
>>>>>>>>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, 
>>>>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu 
>>>>>>>>>>>>>>>>>>> both programmatically and as a client via Spark SQL? Or is 
>>>>>>>>>>>>>>>>>>> there more work that needs to be done on the Spark side for 
>>>>>>>>>>>>>>>>>>> it to work?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
>
Re: Spark on Kudu

Reply via email to