[ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Todd Lipcon
The Apache Kudu team is happy to announce the release of Kudu 1.0.0!

Kudu is an open source storage engine for structured data which supports
low-latency random access together with efficient analytical access
patterns. It is designed within the context of the Apache Hadoop ecosystem
and supports many integrations with other data analytics projects both
inside and outside of the Apache Software Foundation.

This latest version adds several new features, including:

- Removal of multiversion concurrency control (MVCC) history is now
supported. This allows Kudu to reclaim disk space, where previously Kudu
would keep a full history of all changes made to a given table since the
beginning of time.

- Most of Kudu’s command line tools have been consolidated under a new
top-level "kudu" tool. This reduces the number of large binaries
distributed with Kudu and also includes much-improved help output.

- Administrative tools including "kudu cluster ksck" now support running
against multi-master Kudu clusters.

- The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND
mode. This can provide higher throughput for ingest workloads.

This release also includes many bug fixes, optimizations, and other
improvements, detailed in the release notes available at:
http://kudu.apache.org/releases/1.0.0/docs/release_notes.html

Download the source release here:
http://kudu.apache.org/releases/1.0.0/

Convenience binary artifacts for the Java client and various Java
integrations (eg Spark, Flume) are also now available via the ASF Maven
repository.

Enjoy the new release!

- The Apache Kudu team


Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Jacques Nadeau
Congrats to everyone. This is a great accomplishment!

On Tue, Sep 20, 2016 at 12:11 AM, Todd Lipcon  wrote:

> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
>
> Kudu is an open source storage engine for structured data which supports
> low-latency random access together with efficient analytical access
> patterns. It is designed within the context of the Apache Hadoop ecosystem
> and supports many integrations with other data analytics projects both
> inside and outside of the Apache Software Foundation.
>
> This latest version adds several new features, including:
>
> - Removal of multiversion concurrency control (MVCC) history is now
> supported. This allows Kudu to reclaim disk space, where previously Kudu
> would keep a full history of all changes made to a given table since the
> beginning of time.
>
> - Most of Kudu’s command line tools have been consolidated under a new
> top-level "kudu" tool. This reduces the number of large binaries
> distributed with Kudu and also includes much-improved help output.
>
> - Administrative tools including "kudu cluster ksck" now support running
> against multi-master Kudu clusters.
>
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND
> mode. This can provide higher throughput for ingest workloads.
>
> This release also includes many bug fixes, optimizations, and other
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html
>
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/
>
> Convenience binary artifacts for the Java client and various Java
> integrations (eg Spark, Flume) are also now available via the ASF Maven
> repository.
>
> Enjoy the new release!
>
> - The Apache Kudu team
>


Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Aminul Islam
Congrats
On Sep 20, 2016 9:35 PM, "Jacques Nadeau"  wrote:

> Congrats to everyone. This is a great accomplishment!
>
> On Tue, Sep 20, 2016 at 12:11 AM, Todd Lipcon  wrote:
>
>> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
>>
>> Kudu is an open source storage engine for structured data which supports
>> low-latency random access together with efficient analytical access
>> patterns. It is designed within the context of the Apache Hadoop ecosystem
>> and supports many integrations with other data analytics projects both
>> inside and outside of the Apache Software Foundation.
>>
>> This latest version adds several new features, including:
>>
>> - Removal of multiversion concurrency control (MVCC) history is now
>> supported. This allows Kudu to reclaim disk space, where previously Kudu
>> would keep a full history of all changes made to a given table since the
>> beginning of time.
>>
>> - Most of Kudu’s command line tools have been consolidated under a new
>> top-level "kudu" tool. This reduces the number of large binaries
>> distributed with Kudu and also includes much-improved help output.
>>
>> - Administrative tools including "kudu cluster ksck" now support running
>> against multi-master Kudu clusters.
>>
>> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND
>> mode. This can provide higher throughput for ingest workloads.
>>
>> This release also includes many bug fixes, optimizations, and other
>> improvements, detailed in the release notes available at:
>> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html
>>
>> Download the source release here:
>> http://kudu.apache.org/releases/1.0.0/
>>
>> Convenience binary artifacts for the Java client and various Java
>> integrations (eg Spark, Flume) are also now available via the ASF Maven
>> repository.
>>
>> Enjoy the new release!
>>
>> - The Apache Kudu team
>>
>
>


Re: Casual meetup/happy hour at Strata?

2016-09-20 Thread Todd Lipcon
Sounds like there's enough interest (I got a few other people ping me via
Slack).

Any locals have a good suggestion of a spot to meet? Probably somewhere
near the Javits would be easiest for everyone, though I'd be happy to go
elsewhere as well if anyone knows a nice spot that wouldn't be too crowded
during the conference.

-Todd

On Sat, Sep 17, 2016 at 7:12 PM, Clifford Resnick 
wrote:

> +1. We're just starting with Kudu, but it would be nice to meet other
> users, and a casual Q & A would be great if you're up for it!
>
> On Sep 17, 2016 9:05 PM, Todd Lipcon  wrote:
>
> Hey Kudu users,
>
> I'll be in NYC for the last week in September for Strata/Hadoop World. I
> imagine some others might be as well, and wanted to gauge interest in doing
> a casual meetup or happy hour on Wednesday night of that week. No
> presentations or anything, just pick a time and place and whoever's around
> can drop by and put some faces to names.
>
> Let me know if you're interested - if not enough people are around, I'll
> can the idea, but if it seems there are at least a few people in town it
> might be fun.
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim
This is awesome!!! Great!!!

Do you know if any improvements were also made to the Spark plugin jar?

Thanks,
Ben

> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team



Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Todd Lipcon
-announce


On Tue, Sep 20, 2016 at 11:34 AM, Benjamin Kim  wrote:

> This is awesome!!! Great!!!
>
> Do you know if any improvements were also made to the Spark plugin jar?
>

Looks like a few changes based on the git log:
https://gist.github.com/4fa3ccc3b9be787227fed89c1bd42837

as well as a number of changes to the Java client (which gets pulled into
the Spark jar):
https://gist.github.com/e2a8ca78e51773fabb70aae34207199f


In particular, I think the partition pruning work in the Java client should
reduce the number of Spark partitions if you have predicates on your data
frames. (though I haven't personally verified it)

-Todd



> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
>
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
>
> Kudu is an open source storage engine for structured data which supports
> low-latency random access together with efficient analytical access
> patterns. It is designed within the context of the Apache Hadoop ecosystem
> and supports many integrations with other data analytics projects both
> inside and outside of the Apache Software Foundation.
>
> This latest version adds several new features, including:
>
> - Removal of multiversion concurrency control (MVCC) history is now
> supported. This allows Kudu to reclaim disk space, where previously Kudu
> would keep a full history of all changes made to a given table since the
> beginning of time.
>
> - Most of Kudu’s command line tools have been consolidated under a new
> top-level "kudu" tool. This reduces the number of large binaries
> distributed with Kudu and also includes much-improved help output.
>
> - Administrative tools including "kudu cluster ksck" now support running
> against multi-master Kudu clusters.
>
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND
> mode. This can provide higher throughput for ingest workloads.
>
> This release also includes many bug fixes, optimizations, and other
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html
>
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/
>
> Convenience binary artifacts for the Java client and various Java
> integrations (eg Spark, Flume) are also now available via the ASF Maven
> repository.
>
> Enjoy the new release!
>
> - The Apache Kudu team
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Benjamin Kim
Todd,

Thanks. I’ll look into those.

Cheers,
Ben


> On Sep 20, 2016, at 12:11 AM, Todd Lipcon  wrote:
> 
> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds several new features, including:
> 
> - Removal of multiversion concurrency control (MVCC) history is now 
> supported. This allows Kudu to reclaim disk space, where previously Kudu 
> would keep a full history of all changes made to a given table since the 
> beginning of time.
> 
> - Most of Kudu’s command line tools have been consolidated under a new 
> top-level "kudu" tool. This reduces the number of large binaries distributed 
> with Kudu and also includes much-improved help output.
> 
> - Administrative tools including "kudu cluster ksck" now support running 
> against multi-master Kudu clusters.
> 
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. 
> This can provide higher throughput for ingest workloads.
> 
> This release also includes many bug fixes, optimizations, and other 
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html 
> 
> 
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/ 
> 
> 
> Convenience binary artifacts for the Java client and various Java 
> integrations (eg Spark, Flume) are also now available via the ASF Maven 
> repository.
> 
> Enjoy the new release!
> 
> - The Apache Kudu team



Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim
Now that Kudu 1.0.0 is officially out and ready for production use, where do we 
find the spark connector jar for this release?

Thanks,
Ben

> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> 
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> , so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert > > wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim > > wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert >> > wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >> > wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> 
 On Jun 14, 2016, at 5:05 PM, Benjamin Kim >>> > wrote:
 
 Hi,
 
 Now, I’m getting this error when trying to write to the table.
 
 import scala.collection.JavaConverters._
 val key_seq = Seq(“my_id")
 val key_list = List(“my_id”).asJava
 kuduContext.createTable(tableName, df.schema, key_seq, new 
 CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
 
 df.write
 .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
 .mode("overwrite")
 .kudu
 
 java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
 Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
 found (error 0)Not found: key not found (error 0)Not

Re: Spark on Kudu

2016-09-20 Thread Dan Burkert
Hi Benjamin,

The spark connector jar can be found on the Apache maven repository.

Maven Coordinates:


  org.apache.kudu
  kudu-spark_2.10
  1.0.0



  apache.releases
  Apache Release Repository
  https://repository.apache.org/releases



On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  wrote:

> Now that Kudu 1.0.0 is officially out and ready for production use, where
> do we find the spark connector jar for this release?
>
> Thanks,
> Ben
>
>
> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
>
> Hi Ben,
>
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
> do not think we support that at this point.  I haven't looked deeply into
> it, but we may hit issues specifying Kudu-specific options (partitioning,
> column encoding, etc.).  Probably issues that can be worked through
> eventually, though.  If you are interested in contributing to Kudu, this is
> an area that could obviously use improvement!  Most or all of our Spark
> features have been completely community driven to date.
>
>
>> I am assuming that more Spark support along with semantic changes below
>> will be incorporated into Kudu 0.9.1.
>>
>
> As a rule we do not release new features in patch releases, but the good
> news is that we are releasing regularly, and our next scheduled release is
> for the August timeframe (see JD's roadmap
> 
>  email
> about what we are aiming to include).  Also, Cloudera does publish snapshot
> versions of the Spark connector here
> , so
> the jars are available if you don't mind using snapshots.
>
>
>> Anyone know of a better way to make unique primary keys other than using
>> UUID to make every row unique if there is no unique column (or combination
>> thereof) to use.
>>
>
> Not that I know of.  In general it's pretty rare to have a dataset without
> a natural primary key (even if it's just all of the columns), but in those
> cases UUID is a good solution.
>
>
>> This is what I am using. I know auto incrementing is coming down the line
>> (don’t know when), but is there a way to simulate this in Kudu using Spark
>> out of curiosity?
>>
>
> To my knowledge there is no plan to have auto increment in Kudu.
> Distributed, consistent, auto incrementing counters is a difficult problem,
> and I don't think there are any known solutions that would be fast enough
> for Kudu (happy to be proven wrong, though!).
>
> - Dan
>
>
>>
>> Thanks,
>> Ben
>>
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
>>
>> I'm not sure exactly what the semantics will be, but at least one of them
>> will be upsert.  These modes come from spark, and they were really designed
>> for file-backed storage and not table storage.  We may want to do append =
>> upsert, and overwrite = truncate + insert.  I think that may match the
>> normal spark semantics more closely.
>>
>> - Dan
>>
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim  wrote:
>>
>>> Dan,
>>>
>>> Thanks for the information. That would mean both “append” and
>>> “overwrite” modes would be combined or not needed in the future.
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert  wrote:
>>>
>>> Right now append uses an update Kudu operation, which requires the row
>>> already be present in the table. Overwrite maps to insert.  Kudu very
>>> recently got upsert support baked in, but it hasn't yet been integrated
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot
>>> better, since upsert is the way to go for most spark workloads.
>>>
>>> - Dan
>>>
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim 
>>> wrote:
>>>
 I tried to use the “append” mode, and it worked. Over 3.8 million rows
 in 64s. I would assume that now I can use the “overwrite” mode on existing
 data. Now, I have to find answers to these questions. What would happen if
 I “append” to the data in the Kudu table if the data already exists? What
 would happen if I “overwrite” existing data when the DataFrame has data in
 it that does not exist in the Kudu table? I need to evaluate the best way
 to simulate the UPSERT behavior in HBase because this is what our use case
 is.

 Thanks,
 Ben



 On Jun 14, 2016, at 5:05 PM, Benjamin Kim  wrote:

 Hi,

 Now, I’m getting this error when trying to write to the table.

 import scala.collection.JavaConverters._
 val key_seq = Seq(“my_id")
 val key_list = List(“my_id”).asJava
 kuduContext.createTable(tableName, df.schema, key_seq, new
 CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list,
 100))

 df.write
 .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
 .mode("overwrite")
 .kudu

 java.lang.RuntimeException: failed to write 1000 rows from 

Re: Spark on Kudu

2016-09-20 Thread Todd Lipcon
On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  wrote:

> Now that Kudu 1.0.0 is officially out and ready for production use, where
> do we find the spark connector jar for this release?
>
>
It's available in the official ASF maven repository:
https://repository.apache.org/#nexus-search;quick~kudu-spark


  org.apache.kudu
  kudu-spark_2.10
  1.0.0



-Todd



> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
>
> Hi Ben,
>
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
> do not think we support that at this point.  I haven't looked deeply into
> it, but we may hit issues specifying Kudu-specific options (partitioning,
> column encoding, etc.).  Probably issues that can be worked through
> eventually, though.  If you are interested in contributing to Kudu, this is
> an area that could obviously use improvement!  Most or all of our Spark
> features have been completely community driven to date.
>
>
>> I am assuming that more Spark support along with semantic changes below
>> will be incorporated into Kudu 0.9.1.
>>
>
> As a rule we do not release new features in patch releases, but the good
> news is that we are releasing regularly, and our next scheduled release is
> for the August timeframe (see JD's roadmap
> 
>  email
> about what we are aiming to include).  Also, Cloudera does publish snapshot
> versions of the Spark connector here
> , so
> the jars are available if you don't mind using snapshots.
>
>
>> Anyone know of a better way to make unique primary keys other than using
>> UUID to make every row unique if there is no unique column (or combination
>> thereof) to use.
>>
>
> Not that I know of.  In general it's pretty rare to have a dataset without
> a natural primary key (even if it's just all of the columns), but in those
> cases UUID is a good solution.
>
>
>> This is what I am using. I know auto incrementing is coming down the line
>> (don’t know when), but is there a way to simulate this in Kudu using Spark
>> out of curiosity?
>>
>
> To my knowledge there is no plan to have auto increment in Kudu.
> Distributed, consistent, auto incrementing counters is a difficult problem,
> and I don't think there are any known solutions that would be fast enough
> for Kudu (happy to be proven wrong, though!).
>
> - Dan
>
>
>>
>> Thanks,
>> Ben
>>
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
>>
>> I'm not sure exactly what the semantics will be, but at least one of them
>> will be upsert.  These modes come from spark, and they were really designed
>> for file-backed storage and not table storage.  We may want to do append =
>> upsert, and overwrite = truncate + insert.  I think that may match the
>> normal spark semantics more closely.
>>
>> - Dan
>>
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim  wrote:
>>
>>> Dan,
>>>
>>> Thanks for the information. That would mean both “append” and
>>> “overwrite” modes would be combined or not needed in the future.
>>>
>>> Cheers,
>>> Ben
>>>
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert  wrote:
>>>
>>> Right now append uses an update Kudu operation, which requires the row
>>> already be present in the table. Overwrite maps to insert.  Kudu very
>>> recently got upsert support baked in, but it hasn't yet been integrated
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot
>>> better, since upsert is the way to go for most spark workloads.
>>>
>>> - Dan
>>>
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim 
>>> wrote:
>>>
 I tried to use the “append” mode, and it worked. Over 3.8 million rows
 in 64s. I would assume that now I can use the “overwrite” mode on existing
 data. Now, I have to find answers to these questions. What would happen if
 I “append” to the data in the Kudu table if the data already exists? What
 would happen if I “overwrite” existing data when the DataFrame has data in
 it that does not exist in the Kudu table? I need to evaluate the best way
 to simulate the UPSERT behavior in HBase because this is what our use case
 is.

 Thanks,
 Ben



 On Jun 14, 2016, at 5:05 PM, Benjamin Kim  wrote:

 Hi,

 Now, I’m getting this error when trying to write to the table.

 import scala.collection.JavaConverters._
 val key_seq = Seq(“my_id")
 val key_list = List(“my_id”).asJava
 kuduContext.createTable(tableName, df.schema, key_seq, new
 CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list,
 100))

 df.write
 .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
 .mode("overwrite")
 .kudu

 java.lang.RuntimeException: failed to write 1000 rows from DataFrame to
 Kudu; sample errors: Not found: key not found (error 0)Not found: key 

Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim
I see that the API has changed a bit so my old code doesn’t work anymore. Can 
someone direct me to some code samples?

Thanks,
Ben

> On Sep 20, 2016, at 1:44 PM, Todd Lipcon  wrote:
> 
> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  > wrote:
> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
> we find the spark connector jar for this release?
> 
> 
> It's available in the official ASF maven repository:  
> https://repository.apache.org/#nexus-search;quick~kudu-spark 
> 
> 
> 
>   org.apache.kudu
>   kudu-spark_2.10
>   1.0.0
> 
> 
> 
> -Todd
>  
> 
> 
>> On Jun 17, 2016, at 11:08 AM, Dan Burkert > > wrote:
>> 
>> Hi Ben,
>> 
>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
>> not think we support that at this point.  I haven't looked deeply into it, 
>> but we may hit issues specifying Kudu-specific options (partitioning, column 
>> encoding, etc.).  Probably issues that can be worked through eventually, 
>> though.  If you are interested in contributing to Kudu, this is an area that 
>> could obviously use improvement!  Most or all of our Spark features have 
>> been completely community driven to date.
>>  
>> I am assuming that more Spark support along with semantic changes below will 
>> be incorporated into Kudu 0.9.1.
>> 
>> As a rule we do not release new features in patch releases, but the good 
>> news is that we are releasing regularly, and our next scheduled release is 
>> for the August timeframe (see JD's roadmap 
>> 
>>  email about what we are aiming to include).  Also, Cloudera does publish 
>> snapshot versions of the Spark connector here 
>> , so 
>> the jars are available if you don't mind using snapshots.
>>  
>> Anyone know of a better way to make unique primary keys other than using 
>> UUID to make every row unique if there is no unique column (or combination 
>> thereof) to use.
>> 
>> Not that I know of.  In general it's pretty rare to have a dataset without a 
>> natural primary key (even if it's just all of the columns), but in those 
>> cases UUID is a good solution.
>>  
>> This is what I am using. I know auto incrementing is coming down the line 
>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>> out of curiosity?
>> 
>> To my knowledge there is no plan to have auto increment in Kudu.  
>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>> and I don't think there are any known solutions that would be fast enough 
>> for Kudu (happy to be proven wrong, though!).
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert >> > wrote:
>>> 
>>> I'm not sure exactly what the semantics will be, but at least one of them 
>>> will be upsert.  These modes come from spark, and they were really designed 
>>> for file-backed storage and not table storage.  We may want to do append = 
>>> upsert, and overwrite = truncate + insert.  I think that may match the 
>>> normal spark semantics more closely.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim >> > wrote:
>>> Dan,
>>> 
>>> Thanks for the information. That would mean both “append” and “overwrite” 
>>> modes would be combined or not needed in the future.
>>> 
>>> Cheers,
>>> Ben
>>> 
 On Jun 14, 2016, at 5:57 PM, Dan Burkert >>> > wrote:
 
 Right now append uses an update Kudu operation, which requires the row 
 already be present in the table. Overwrite maps to insert.  Kudu very 
 recently got upsert support baked in, but it hasn't yet been integrated 
 into the Spark connector.  So pretty soon these sharp edges will get a lot 
 better, since upsert is the way to go for most spark workloads.
 
 - Dan
 
 On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim >>> > wrote:
 I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
 64s. I would assume that now I can use the “overwrite” mode on existing 
 data. Now, I have to find answers to these questions. What would happen if 
 I “append” to the data in the Kudu table if the data already exists? What 
 would happen if I “overwrite” existing data when the DataFrame has data in 
 it that does not exist in the Kudu table? I need to evaluate the best way 
 to simulate the UPSERT behavior in HBase because this is what our use case 
 is.
 
 Thanks,
 Ben
 
 
 
> On Jun 14, 2016, at 5:05 PM, Benjamin Kim  > wrote:
> 
> Hi,
> 
> Now, I’m getting this erro

Re: Spark on Kudu

2016-09-20 Thread Jordan Birdsell
http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark

On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim  wrote:

> I see that the API has changed a bit so my old code doesn’t work anymore.
> Can someone direct me to some code samples?
>
> Thanks,
> Ben
>
>
> On Sep 20, 2016, at 1:44 PM, Todd Lipcon  wrote:
>
> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim  wrote:
>
>> Now that Kudu 1.0.0 is officially out and ready for production use, where
>> do we find the spark connector jar for this release?
>>
>>
> It's available in the official ASF maven repository:
> https://repository.apache.org/#nexus-search;quick~kudu-spark
>
> 
>   org.apache.kudu
>   kudu-spark_2.10
>   1.0.0
> 
>
>
> -Todd
>
>
>
>> On Jun 17, 2016, at 11:08 AM, Dan Burkert  wrote:
>>
>> Hi Ben,
>>
>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
>> do not think we support that at this point.  I haven't looked deeply into
>> it, but we may hit issues specifying Kudu-specific options (partitioning,
>> column encoding, etc.).  Probably issues that can be worked through
>> eventually, though.  If you are interested in contributing to Kudu, this is
>> an area that could obviously use improvement!  Most or all of our Spark
>> features have been completely community driven to date.
>>
>>
>>> I am assuming that more Spark support along with semantic changes below
>>> will be incorporated into Kudu 0.9.1.
>>>
>>
>> As a rule we do not release new features in patch releases, but the good
>> news is that we are releasing regularly, and our next scheduled release is
>> for the August timeframe (see JD's roadmap
>> 
>>  email
>> about what we are aiming to include).  Also, Cloudera does publish snapshot
>> versions of the Spark connector here
>> ,
>> so the jars are available if you don't mind using snapshots.
>>
>>
>>> Anyone know of a better way to make unique primary keys other than using
>>> UUID to make every row unique if there is no unique column (or combination
>>> thereof) to use.
>>>
>>
>> Not that I know of.  In general it's pretty rare to have a dataset
>> without a natural primary key (even if it's just all of the columns), but
>> in those cases UUID is a good solution.
>>
>>
>>> This is what I am using. I know auto incrementing is coming down the
>>> line (don’t know when), but is there a way to simulate this in Kudu using
>>> Spark out of curiosity?
>>>
>>
>> To my knowledge there is no plan to have auto increment in Kudu.
>> Distributed, consistent, auto incrementing counters is a difficult problem,
>> and I don't think there are any known solutions that would be fast enough
>> for Kudu (happy to be proven wrong, though!).
>>
>> - Dan
>>
>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert  wrote:
>>>
>>> I'm not sure exactly what the semantics will be, but at least one of
>>> them will be upsert.  These modes come from spark, and they were really
>>> designed for file-backed storage and not table storage.  We may want to do
>>> append = upsert, and overwrite = truncate + insert.  I think that may match
>>> the normal spark semantics more closely.
>>>
>>> - Dan
>>>
>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim 
>>> wrote:
>>>
 Dan,

 Thanks for the information. That would mean both “append” and
 “overwrite” modes would be combined or not needed in the future.

 Cheers,
 Ben

 On Jun 14, 2016, at 5:57 PM, Dan Burkert  wrote:

 Right now append uses an update Kudu operation, which requires the row
 already be present in the table. Overwrite maps to insert.  Kudu very
 recently got upsert support baked in, but it hasn't yet been integrated
 into the Spark connector.  So pretty soon these sharp edges will get a lot
 better, since upsert is the way to go for most spark workloads.

 - Dan

 On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim 
 wrote:

> I tried to use the “append” mode, and it worked. Over 3.8 million rows
> in 64s. I would assume that now I can use the “overwrite” mode on existing
> data. Now, I have to find answers to these questions. What would happen if
> I “append” to the data in the Kudu table if the data already exists? What
> would happen if I “overwrite” existing data when the DataFrame has data in
> it that does not exist in the Kudu table? I need to evaluate the best way
> to simulate the UPSERT behavior in HBase because this is what our use case
> is.
>
> Thanks,
> Ben
>
>
>
> On Jun 14, 2016, at 5:05 PM, Benjamin Kim  wrote:
>
> Hi,
>
> Now, I’m getting this error when trying to write to the table.
>
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_lis

Re: Spark on Kudu

2016-09-20 Thread Benjamin Kim
Thanks!

> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell  
> wrote:
> 
> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark 
> 
> 
> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim  > wrote:
> I see that the API has changed a bit so my old code doesn’t work anymore. Can 
> someone direct me to some code samples?
> 
> Thanks,
> Ben
> 
> 
>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon > > wrote:
>> 
>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim > > wrote:
>> Now that Kudu 1.0.0 is officially out and ready for production use, where do 
>> we find the spark connector jar for this release?
>> 
>> 
>> It's available in the official ASF maven repository:  
>> https://repository.apache.org/#nexus-search;quick~kudu-spark 
>> 
>> 
>> 
>>   org.apache.kudu
>>   kudu-spark_2.10
>>   1.0.0
>> 
>> 
>> 
>> -Todd
>>  
>> 
>> 
>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert >> > wrote:
>>> 
>>> Hi Ben,
>>> 
>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I 
>>> do not think we support that at this point.  I haven't looked deeply into 
>>> it, but we may hit issues specifying Kudu-specific options (partitioning, 
>>> column encoding, etc.).  Probably issues that can be worked through 
>>> eventually, though.  If you are interested in contributing to Kudu, this is 
>>> an area that could obviously use improvement!  Most or all of our Spark 
>>> features have been completely community driven to date.
>>>  
>>> I am assuming that more Spark support along with semantic changes below 
>>> will be incorporated into Kudu 0.9.1.
>>> 
>>> As a rule we do not release new features in patch releases, but the good 
>>> news is that we are releasing regularly, and our next scheduled release is 
>>> for the August timeframe (see JD's roadmap 
>>> 
>>>  email about what we are aiming to include).  Also, Cloudera does publish 
>>> snapshot versions of the Spark connector here 
>>> , so 
>>> the jars are available if you don't mind using snapshots.
>>>  
>>> Anyone know of a better way to make unique primary keys other than using 
>>> UUID to make every row unique if there is no unique column (or combination 
>>> thereof) to use.
>>> 
>>> Not that I know of.  In general it's pretty rare to have a dataset without 
>>> a natural primary key (even if it's just all of the columns), but in those 
>>> cases UUID is a good solution.
>>>  
>>> This is what I am using. I know auto incrementing is coming down the line 
>>> (don’t know when), but is there a way to simulate this in Kudu using Spark 
>>> out of curiosity?
>>> 
>>> To my knowledge there is no plan to have auto increment in Kudu.  
>>> Distributed, consistent, auto incrementing counters is a difficult problem, 
>>> and I don't think there are any known solutions that would be fast enough 
>>> for Kudu (happy to be proven wrong, though!).
>>> 
>>> - Dan
>>>  
>>> 
>>> Thanks,
>>> Ben
>>> 
 On Jun 14, 2016, at 6:08 PM, Dan Burkert >>> > wrote:
 
 I'm not sure exactly what the semantics will be, but at least one of them 
 will be upsert.  These modes come from spark, and they were really 
 designed for file-backed storage and not table storage.  We may want to do 
 append = upsert, and overwrite = truncate + insert.  I think that may 
 match the normal spark semantics more closely.
 
 - Dan
 
 On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim >>> > wrote:
 Dan,
 
 Thanks for the information. That would mean both “append” and “overwrite” 
 modes would be combined or not needed in the future.
 
 Cheers,
 Ben
 
> On Jun 14, 2016, at 5:57 PM, Dan Burkert  > wrote:
> 
> Right now append uses an update Kudu operation, which requires the row 
> already be present in the table. Overwrite maps to insert.  Kudu very 
> recently got upsert support baked in, but it hasn't yet been integrated 
> into the Spark connector.  So pretty soon these sharp edges will get a 
> lot better, since upsert is the way to go for most spark workloads.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim  > wrote:
> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
> 64s. I would assume that now I can use the “overwrite” mode on existing 
> data. Now, I have to find answers to these questions. What would happen 
> if I “append” to the data in the Kudu table if the data a

Re: [ANNOUNCE] Apache Kudu 1.0.0 release

2016-09-20 Thread Matteo Durighetto
2016-09-20 9:11 GMT+02:00 Todd Lipcon :

> The Apache Kudu team is happy to announce the release of Kudu 1.0.0!
>
> Kudu is an open source storage engine for structured data which supports
> low-latency random access together with efficient analytical access
> patterns. It is designed within the context of the Apache Hadoop ecosystem
> and supports many integrations with other data analytics projects both
> inside and outside of the Apache Software Foundation.
>
> This latest version adds several new features, including:
>
> - Removal of multiversion concurrency control (MVCC) history is now
> supported. This allows Kudu to reclaim disk space, where previously Kudu
> would keep a full history of all changes made to a given table since the
> beginning of time.
>
> - Most of Kudu’s command line tools have been consolidated under a new
> top-level "kudu" tool. This reduces the number of large binaries
> distributed with Kudu and also includes much-improved help output.
>
> - Administrative tools including "kudu cluster ksck" now support running
> against multi-master Kudu clusters.
>
> - The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND
> mode. This can provide higher throughput for ingest workloads.
>
> This release also includes many bug fixes, optimizations, and other
> improvements, detailed in the release notes available at:
> http://kudu.apache.org/releases/1.0.0/docs/release_notes.html
>
> Download the source release here:
> http://kudu.apache.org/releases/1.0.0/
>
> Convenience binary artifacts for the Java client and various Java
> integrations (eg Spark, Flume) are also now available via the ASF Maven
> repository.
>
> Enjoy the new release!
>
> - The Apache Kudu team
>


Really great. Moreover there are a new producer in flume-kudu sink:
The regexp kudu producer

https://github.com/cloudera/kudu/blob/master/java/kudu-flume-sink/src/main/java/org/apache/kudu/flume/sink/RegexpKuduOperationsProducer.java

With the regexp kudu producer is simple to cast with a reg exp and write
records into kudu tables:

 * A regular expression serializer that generates one {@link Insert} or
 * {@link Upsert} per {@link Event} by parsing the payload into values
using a
 * regular expression. Values are coerced to the proper column types.
 *
 * Example: if the Kudu table has the schema
 *
 * key INT32
 * name STRING
 *
 * and producer.pattern is '(?\\d+),(?\w+)', then the
 * RegexpKuduOperationsProducer will parse the string
 *
 * |12345,Mike||54321,Todd|
 *
 * into the rows (key=12345, name=Mike) and (key=54321, name=Todd).

We are just testing it, and it's working.

Kind Regards

Matteo Durighetto
e-mail: m.durighe...@miriade.it