Re: Spark on Kudu

Benjamin Kim Tue, 14 Jun 2016 17:06:41 -0700

Hi,

Now, I’m getting this error when trying to write to the table.


import scala.collection.JavaConverters._
val key_seq = Seq(“my_id")
val key_list = List(“my_id”).asJava
kuduContext.createTable(tableName, df.schema, key_seq, new 
CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))

df.write
    .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
    .mode("overwrite")
    .kudu

java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
sample errors: Not found: key not found (error 0)Not found: key not found 
(error 0)Not found: key not found (error 0)Not found: key not found (error 
0)Not found: key not found (error 0)

Does the key field need to be first in the DataFrame?

Thanks,
Ben

> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> 
> 
> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
> in the DataFrame and set the partitioning? Is it like this?
> 
> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
> 
> java.lang.IllegalArgumentException: Table partitioning must be specified 
> using setRangePartitionColumns or addHashPartitions
> 
> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
> key columns, so in this case you have specified the single PK column "my_id". 
>  The `addHashPartitions` call adds hash partitioning to the table, in this 
> case over the column "my_id" (which is good, it must be over one or more PK 
> columns, so in this case "my_id" is the one and only valid combination).  
> However, the call to `addHashPartition` also takes the number of buckets as 
> the second param.  You shouldn't get the IllegalArgumentException as long as 
> you are specifying either `addHashPartitions` or `setRangePartitionColumns`.
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
> 
>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Looks like we're missing an import statement in that example.  Could you try:
>> 
>> import org.kududb.client._
>> and try again?
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I encountered an error trying to create a table based on the documentation 
>> from a DataFrame.
>> 
>> <console>:49: error: not found: type CreateTableOptions
>>               kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>> CreateTableOptions().setNumReplicas(1))
>> 
>> Is there something I’m missing?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> It's only in Cloudera's maven repo: 
>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>  
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>> 
>>> J-D
>>> 
>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi J-D,
>>> 
>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>> spark-shell to use. Can you show me where to find it?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> What's in this doc is what's gonna get released: 
>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>  
>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Will this be documented with examples once 0.9.0 comes out?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>> 
>>>>> It will be in 0.9.0.
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Hi Chris,
>>>>> 
>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com 
>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>> 
>>>>>> There is some code in review that needs some more refinement.
>>>>>> It will allow upsert/insert from a dataframe using the datasource api. 
>>>>>> It will also allow the creation and deletion of tables from a dataframe
>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>> 
>>>>>> Example usages will look something like:
>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>> 
>>>>>> -Chris George
>>>>>> 
>>>>>> 
>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> 
>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>> 
>>>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>>>> Kudu using DataFrames?
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com 
>>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>>> 
>>>>>>> SparkSQL cannot support these type of statements but we may be able to 
>>>>>>> implement similar functionality through the api.
>>>>>>> -Chris
>>>>>>> 
>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if 
>>>>>>> it were to be implemented.
>>>>>>> 
>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>  WHEN MATCHED THEN
>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>> 
>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>>>>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>>>>> 
>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it into 
>>>>>>>> gerrit if you want to take a look. 
>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/ 
>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>> It does pushdown predicates which the existing input formatter based 
>>>>>>>> rdd does not.
>>>>>>>> 
>>>>>>>> Within the next two weeks I’m planning to implement a datasource for 
>>>>>>>> spark that will have pushdown predicates and insertion/update 
>>>>>>>> functionality (need to look more at cassandra and the hbase datasource 
>>>>>>>> for best way to do this) I agree that server side upsert would be 
>>>>>>>> helpful.
>>>>>>>> Having a datasource would give us useful data frames and also make 
>>>>>>>> spark sql usable for kudu.
>>>>>>>> 
>>>>>>>> My reasoning for having a spark datasource and not using Impala is: 1. 
>>>>>>>> We have had trouble getting impala to run fast with high concurrency 
>>>>>>>> when compared to spark 2. We interact with datasources which do not 
>>>>>>>> integrate with impala. 3. We have custom sql query planners for 
>>>>>>>> extended sql functionality.
>>>>>>>> 
>>>>>>>> -Chris George
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> 
>>>>>>>> You guys make a convincing point, although on the upsert side we'll 
>>>>>>>> need more support from the servers. Right now all you can do is an 
>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could at 
>>>>>>>> least add an API on the client side that would manage it, but it 
>>>>>>>> wouldn't be atomic.
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>>>>>>>> <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>> It's pretty simple, actually.  I need to support versioned datasets in 
>>>>>>>> a Spark SQL environment.  Instead of a hack on top of a Parquet data 
>>>>>>>> store, I'm hoping (among other reasons) to be able to use Kudu's write 
>>>>>>>> and timestamp-based read operations to support not only appending 
>>>>>>>> data, but also updating existing data, and even some schema migration. 
>>>>>>>>  The most typical use case is a dataset that is updated periodically 
>>>>>>>> (e.g., weekly or monthly) in which the the preliminary data in the 
>>>>>>>> previous window (week or month) is updated with values that are 
>>>>>>>> expected to remain unchanged from then on, and a new set of 
>>>>>>>> preliminary values for the current window need to be added/appended.
>>>>>>>> 
>>>>>>>> Using Kudu's Java API and developing additional functionality on top 
>>>>>>>> of what Kudu has to offer isn't too much to ask, but the ease of 
>>>>>>>> integration with Spark SQL will gate how quickly we would move to 
>>>>>>>> using Kudu and how seriously we'd look at alternatives before making 
>>>>>>>> that decision. 
>>>>>>>> 
>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans 
>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> Mark,
>>>>>>>> 
>>>>>>>> Thanks for taking some time to reply in this thread, glad it caught 
>>>>>>>> the attention of other folks!
>>>>>>>> 
>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra 
>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>> 
>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently delaying 
>>>>>>>> a refactoring of some Spark SQL-oriented insert functionality while 
>>>>>>>> trying to evaluate what to expect from Kudu.  Whether Kudu does a good 
>>>>>>>> job supporting inserts with Spark SQL will be a key consideration as 
>>>>>>>> to whether we adopt Kudu.
>>>>>>>> 
>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary for you. 
>>>>>>>> Is it just that you currently do it that way into some database or 
>>>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? Would 
>>>>>>>> re-writing those SQL lines into Scala and directly use the Java API's 
>>>>>>>> KuduSession be too much work?
>>>>>>>> 
>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your 
>>>>>>>> current solution? If it's not completely clear, I'd love to help you 
>>>>>>>> think through it.
>>>>>>>>  
>>>>>>>> 
>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans 
>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>> Yup, starting to get a good idea.
>>>>>>>> 
>>>>>>>> What are your DS folks looking for in terms of functionality related 
>>>>>>>> to Spark? A SparkSQL integration that's as fully featured as Impala's? 
>>>>>>>> Do they care being able to insert into Kudu with SparkSQL or just 
>>>>>>>> being able to query real fast? Anything more specific to Spark that 
>>>>>>>> I'm missing?
>>>>>>>> 
>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera 
>>>>>>>> all our resources are committed to making things happen in time, and a 
>>>>>>>> more fully featured Spark integration isn't in our plans during that 
>>>>>>>> period. I'm really hoping someone in the community will help with 
>>>>>>>> Spark, the same way we got a big contribution for the Flume sink. 
>>>>>>>> 
>>>>>>>> J-D
>>>>>>>> 
>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, 
>>>>>>>> since it’s not “production-ready”, upper management doesn’t want to 
>>>>>>>> fully deploy it yet. They just want to keep an eye on it though. Kudu 
>>>>>>>> was so much simpler and easier to use in every aspect compared to 
>>>>>>>> HBase. Impala was great for the report writers and analysts to 
>>>>>>>> experiment with for the short time it was up. But, once again, the 
>>>>>>>> only blocker was the lack of Spark support for our Data 
>>>>>>>> Developers/Scientists. So, production-level data population won’t 
>>>>>>>> happen until then.
>>>>>>>> 
>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Ben
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>> 
>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> J-D,
>>>>>>>>> 
>>>>>>>>> The main thing I hear that Cassandra is being used as an updatable 
>>>>>>>>> hot data store to ensure that duplicates are taken care of and 
>>>>>>>>> idempotency is maintained. Whether data was directly retrieved from 
>>>>>>>>> Cassandra for analytics, reports, or searches, it was not clear as to 
>>>>>>>>> what was its main use. Some also just used it for a staging area to 
>>>>>>>>> populate downstream tables in parquet format. The last thing I heard 
>>>>>>>>> was that CQL was terrible, so that rules out much use of direct 
>>>>>>>>> queries against it.
>>>>>>>>> 
>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real analytics, 
>>>>>>>>> just ease of use instead of plainly using the APIs. Even then, Kudu 
>>>>>>>>> should beat it easily on big scans. Same for HBase. We've done 
>>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> As for our company, we have been looking for an updatable data store 
>>>>>>>>> for a long time that can be quickly queried directly either using 
>>>>>>>>> Spark SQL or Impala or some other SQL engine and still handle TB or 
>>>>>>>>> PB of data without performance degradation and many configuration 
>>>>>>>>> headaches. For now, we are using HBase to take on this role with 
>>>>>>>>> Phoenix as a fast way to directly query the data. I can see Kudu as 
>>>>>>>>> the best way to fill this gap easily, especially being the closest 
>>>>>>>>> thing to other relational databases out there in familiarity for the 
>>>>>>>>> many SQL analytics people in our company. The other alternative would 
>>>>>>>>> be to go with AWS Redshift for the same reasons, but it would come at 
>>>>>>>>> a cost, of course. If we went with either solutions, Kudu or 
>>>>>>>>> Redshift, it would get rid of the need to extract from HBase to 
>>>>>>>>> parquet tables or export to PostgreSQL to support more of the SQL 
>>>>>>>>> language using by analysts or the reporting software we use..
>>>>>>>>> 
>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off with 
>>>>>>>>> Kudu. Have you folks tried Kudu with Impala yet with those use cases?
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> I hope this helps.
>>>>>>>>> 
>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Ben 
>>>>>>>>> 
>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to 
>>>>>>>>>> refer to "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My 
>>>>>>>>>> colleagues who were also there did say that the hype around Spark 
>>>>>>>>>> isn't dying down.
>>>>>>>>>> 
>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra, 
>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that C* is 
>>>>>>>>>> just an interim solution for the use case you describe.
>>>>>>>>>> 
>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's a 
>>>>>>>>>> storage engine so things move slowly *smile*. I'd love to see more 
>>>>>>>>>> contributions on the Spark front. I know there's code out there that 
>>>>>>>>>> could be integrated in kudu-spark, it just needs to land in gerrit. 
>>>>>>>>>> I'm sure folks will happily review it.
>>>>>>>>>> 
>>>>>>>>>> Do you have relevant experiences you can share? I'd love to learn 
>>>>>>>>>> more about the use cases for which you envision using Kudu as a C* 
>>>>>>>>>> replacement.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> J-D
>>>>>>>>>> 
>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> Hi J-D,
>>>>>>>>>> 
>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They told 
>>>>>>>>>> me that everything was about Spark and there is a big buzz about the 
>>>>>>>>>> SMACK stack (Spark, Mesos, Akka, Cassandra, Kafka). I still think 
>>>>>>>>>> that Cassandra is just an interim solution as a low-latency, easily 
>>>>>>>>>> queried data store. I was wondering if anything significant happened 
>>>>>>>>>> in regards to Kudu, especially on the Spark front. Plus, can you 
>>>>>>>>>> come up with your own proposed stack acronym to promote?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> Ben
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans 
>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Ben,
>>>>>>>>>>> 
>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline. I know 
>>>>>>>>>>> of one person on the Kudu Slack who's working on a better RDD, but 
>>>>>>>>>>> that's about it.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> J-D
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <b...@amobee.com 
>>>>>>>>>>> <mailto:b...@amobee.com>> wrote:
>>>>>>>>>>> Hi J-D,
>>>>>>>>>>> 
>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to target a 
>>>>>>>>>>> version of Kudu to begin real testing of Spark against it for our 
>>>>>>>>>>> devs. At least, I can tell them what timeframe to anticipate.
>>>>>>>>>>> 
>>>>>>>>>>> Just curious,
>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>> 
>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>> 
>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA 90405  |  
>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>> 
>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans 
>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's needed 
>>>>>>>>>>>> either.
>>>>>>>>>>>> 
>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally we'd 
>>>>>>>>>>>> use scans directly.
>>>>>>>>>>>> 
>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of 
>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>> 
>>>>>>>>>>>> The goal was to provide something for others to contribute to. We 
>>>>>>>>>>>> have some basic unit tests that others can easily extend. None of 
>>>>>>>>>>>> us on the team are Spark experts, but we'd be really happy to 
>>>>>>>>>>>> assist one improve the kudu-spark code.
>>>>>>>>>>>> 
>>>>>>>>>>>> J-D
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>> J-D,
>>>>>>>>>>>> 
>>>>>>>>>>>> It looks like it fulfills most of the basic requirements (kudu 
>>>>>>>>>>>> RDD, kudu DStream) in KUDU-1214. Am I right? Besides shoring up 
>>>>>>>>>>>> more Spark SQL functionality (Dataframes) and doing the 
>>>>>>>>>>>> documentation, what more needs to be done? Optimizations?
>>>>>>>>>>>> 
>>>>>>>>>>>> I believe that it’s a good place to start using Spark with Kudu 
>>>>>>>>>>>> and compare it to HBase with Spark (not clean).
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans 
>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get this in 
>>>>>>>>>>>>> for 0.7.0: https://issues.cloudera.org/browse/KUDU-1321 
>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL on 
>>>>>>>>>>>>> Kudu, but it will require a lot more work to make it fast/useful.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> J-D
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>> I see this KUDU-1214 
>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for 
>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete, will 
>>>>>>>>>>>>> this mean that Spark will be able to work with Kudu both 
>>>>>>>>>>>>> programmatically and as a client via Spark SQL? Or is there more 
>>>>>>>>>>>>> work that needs to be done on the Spark side for it to work?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
>

Re: Spark on Kudu

Reply via email to