Thanks!
> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell <jordantbirds...@gmail.com>
> wrote:
>
> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
> <http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark>
>
> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <bbuil...@gmail.com
> <mailto:bbuil...@gmail.com>> wrote:
> I see that the API has changed a bit so my old code doesn’t work anymore. Can
> someone direct me to some code samples?
>
> Thanks,
> Ben
>
>
>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <t...@cloudera.com
>> <mailto:t...@cloudera.com>> wrote:
>>
>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <bbuil...@gmail.com
>> <mailto:bbuil...@gmail.com>> wrote:
>> Now that Kudu 1.0.0 is officially out and ready for production use, where do
>> we find the spark connector jar for this release?
>>
>>
>> It's available in the official ASF maven repository:
>> https://repository.apache.org/#nexus-search;quick~kudu-spark
>> <https://repository.apache.org/#nexus-search;quick~kudu-spark>
>>
>> <dependency>
>> <groupId>org.apache.kudu</groupId>
>> <artifactId>kudu-spark_2.10</artifactId>
>> <version>1.0.0</version>
>> </dependency>
>>
>>
>> -Todd
>>
>>
>>
>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com
>>> <mailto:d...@cloudera.com>> wrote:
>>>
>>> Hi Ben,
>>>
>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
>>> do not think we support that at this point. I haven't looked deeply into
>>> it, but we may hit issues specifying Kudu-specific options (partitioning,
>>> column encoding, etc.). Probably issues that can be worked through
>>> eventually, though. If you are interested in contributing to Kudu, this is
>>> an area that could obviously use improvement! Most or all of our Spark
>>> features have been completely community driven to date.
>>>
>>> I am assuming that more Spark support along with semantic changes below
>>> will be incorporated into Kudu 0.9.1.
>>>
>>> As a rule we do not release new features in patch releases, but the good
>>> news is that we are releasing regularly, and our next scheduled release is
>>> for the August timeframe (see JD's roadmap
>>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>> email about what we are aiming to include). Also, Cloudera does publish
>>> snapshot versions of the Spark connector here
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so
>>> the jars are available if you don't mind using snapshots.
>>>
>>> Anyone know of a better way to make unique primary keys other than using
>>> UUID to make every row unique if there is no unique column (or combination
>>> thereof) to use.
>>>
>>> Not that I know of. In general it's pretty rare to have a dataset without
>>> a natural primary key (even if it's just all of the columns), but in those
>>> cases UUID is a good solution.
>>>
>>> This is what I am using. I know auto incrementing is coming down the line
>>> (don’t know when), but is there a way to simulate this in Kudu using Spark
>>> out of curiosity?
>>>
>>> To my knowledge there is no plan to have auto increment in Kudu.
>>> Distributed, consistent, auto incrementing counters is a difficult problem,
>>> and I don't think there are any known solutions that would be fast enough
>>> for Kudu (happy to be proven wrong, though!).
>>>
>>> - Dan
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com
>>>> <mailto:d...@cloudera.com>> wrote:
>>>>
>>>> I'm not sure exactly what the semantics will be, but at least one of them
>>>> will be upsert. These modes come from spark, and they were really
>>>> designed for file-backed storage and not table storage. We may want to do
>>>> append = upsert, and overwrite = truncate + insert. I think that may
>>>> match the normal spark semantics more closely.
>>>>
>>>> - Dan
>>>>
>>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>>
>>>> Thanks for the information. That would mean both “append” and “overwrite”
>>>> modes would be combined or not needed in the future.
>>>>
>>>> Cheers,
>>>> Ben
>>>>
>>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com
>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>>
>>>>> Right now append uses an update Kudu operation, which requires the row
>>>>> already be present in the table. Overwrite maps to insert. Kudu very
>>>>> recently got upsert support baked in, but it hasn't yet been integrated
>>>>> into the Spark connector. So pretty soon these sharp edges will get a
>>>>> lot better, since upsert is the way to go for most spark workloads.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in
>>>>> 64s. I would assume that now I can use the “overwrite” mode on existing
>>>>> data. Now, I have to find answers to these questions. What would happen
>>>>> if I “append” to the data in the Kudu table if the data already exists?
>>>>> What would happen if I “overwrite” existing data when the DataFrame has
>>>>> data in it that does not exist in the Kudu table? I need to evaluate the
>>>>> best way to simulate the UPSERT behavior in HBase because this is what
>>>>> our use case is.
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Now, I’m getting this error when trying to write to the table.
>>>>>>
>>>>>> import scala.collection.JavaConverters._
>>>>>> val key_seq = Seq(“my_id")
>>>>>> val key_list = List(“my_id”).asJava
>>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new
>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>>>>
>>>>>> df.write
>>>>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>>>>> .mode("overwrite")
>>>>>> .kudu
>>>>>>
>>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to
>>>>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key
>>>>>> not found (error 0)Not found: key not found (error 0)Not found: key not
>>>>>> found (error 0)Not found: key not found (error 0)
>>>>>>
>>>>>> Does the key field need to be first in the DataFrame?
>>>>>>
>>>>>> Thanks,
>>>>>> Ben
>>>>>>
>>>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com
>>>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> Dan,
>>>>>>>
>>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a
>>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>>>>
>>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>>>>
>>>>>>> java.lang.IllegalArgumentException: Table partitioning must be
>>>>>>> specified using setRangePartitionColumns or addHashPartitions
>>>>>>>
>>>>>>> Yep. The `Seq("my_id")` part of that call is specifying the set of
>>>>>>> primary key columns, so in this case you have specified the single PK
>>>>>>> column "my_id". The `addHashPartitions` call adds hash partitioning to
>>>>>>> the table, in this case over the column "my_id" (which is good, it must
>>>>>>> be over one or more PK columns, so in this case "my_id" is the one and
>>>>>>> only valid combination). However, the call to `addHashPartition` also
>>>>>>> takes the number of buckets as the second param. You shouldn't get the
>>>>>>> IllegalArgumentException as long as you are specifying either
>>>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>>>>
>>>>>>> - Dan
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com
>>>>>>>> <mailto:d...@cloudera.com>> wrote:
>>>>>>>>
>>>>>>>> Looks like we're missing an import statement in that example. Could
>>>>>>>> you try:
>>>>>>>>
>>>>>>>> import org.kududb.client._
>>>>>>>> and try again?
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com
>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>> I encountered an error trying to create a table based on the
>>>>>>>> documentation from a DataFrame.
>>>>>>>>
>>>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>>> kuduContext.createTable(tableName, df.schema,
>>>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1))
>>>>>>>>
>>>>>>>> Is there something I’m missing?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org
>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>
>>>>>>>>> It's only in Cloudera's maven repo:
>>>>>>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>>>
>>>>>>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com
>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>> Hi J-D,
>>>>>>>>>
>>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar
>>>>>>>>> for spark-shell to use. Can you show me where to find it?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org
>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>
>>>>>>>>>> What's in this doc is what's gonna get released:
>>>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>>>>>>>
>>>>>>>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com
>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans
>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> It will be in 0.9.0.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com
>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>
>>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George
>>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>>>>> It will allow upsert/insert from a dataframe using the datasource
>>>>>>>>>>>> api. It will also allow the creation and deletion of tables from a
>>>>>>>>>>>> dataframe
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>>>>>>>>>
>>>>>>>>>>>> Example usages will look something like:
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>>>>>>>>>
>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com
>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>>>>>
>>>>>>>>>>>> Also, does anyone have any sample code on how to update/insert
>>>>>>>>>>>> data in Kudu using DataFrames?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George
>>>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be
>>>>>>>>>>>>> able to implement similar functionality through the api.
>>>>>>>>>>>>> -Chris
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com
>>>>>>>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an
>>>>>>>>>>>>> “upsert” if it were to be implemented.
>>>>>>>>>>>>>
>>>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>>>> WHEN MATCHED THEN
>>>>>>>>>>>>> UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>>>> WHEN NOT MATCHED THEN
>>>>>>>>>>>>> INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George
>>>>>>>>>>>>>> <christopher.geo...@rms.com <mailto:christopher.geo...@rms.com>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it
>>>>>>>>>>>>>> into gerrit if you want to take a look.
>>>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>>>>>>>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>>>>>>>>>>>>> It does pushdown predicates which the existing input formatter
>>>>>>>>>>>>>> based rdd does not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Within the next two weeks I’m planning to implement a datasource
>>>>>>>>>>>>>> for spark that will have pushdown predicates and
>>>>>>>>>>>>>> insertion/update functionality (need to look more at cassandra
>>>>>>>>>>>>>> and the hbase datasource for best way to do this) I agree that
>>>>>>>>>>>>>> server side upsert would be helpful.
>>>>>>>>>>>>>> Having a datasource would give us useful data frames and also
>>>>>>>>>>>>>> make spark sql usable for kudu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala
>>>>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high
>>>>>>>>>>>>>> concurrency when compared to spark 2. We interact with
>>>>>>>>>>>>>> datasources which do not integrate with impala. 3. We have
>>>>>>>>>>>>>> custom sql query planners for extended sql functionality.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org
>>>>>>>>>>>>>> <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You guys make a convincing point, although on the upsert side
>>>>>>>>>>>>>> we'll need more support from the servers. Right now all you can
>>>>>>>>>>>>>> do is an INSERT then, if you get a dup key, do an UPDATE. I
>>>>>>>>>>>>>> guess we could at least add an API on the client side that would
>>>>>>>>>>>>>> manage it, but it wouldn't be atomic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra
>>>>>>>>>>>>>> <m...@clearstorydata.com <mailto:m...@clearstorydata.com>>wrote:
>>>>>>>>>>>>>> It's pretty simple, actually. I need to support versioned
>>>>>>>>>>>>>> datasets in a Spark SQL environment. Instead of a hack on top
>>>>>>>>>>>>>> of a Parquet data store, I'm hoping (among other reasons) to be
>>>>>>>>>>>>>> able to use Kudu's write and timestamp-based read operations to
>>>>>>>>>>>>>> support not only appending data, but also updating existing
>>>>>>>>>>>>>> data, and even some schema migration. The most typical use case
>>>>>>>>>>>>>> is a dataset that is updated periodically (e.g., weekly or
>>>>>>>>>>>>>> monthly) in which the the preliminary data in the previous
>>>>>>>>>>>>>> window (week or month) is updated with values that are expected
>>>>>>>>>>>>>> to remain unchanged from then on, and a new set of preliminary
>>>>>>>>>>>>>> values for the current window need to be added/appended.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality on
>>>>>>>>>>>>>> top of what Kudu has to offer isn't too much to ask, but the
>>>>>>>>>>>>>> ease of integration with Spark SQL will gate how quickly we
>>>>>>>>>>>>>> would move to using Kudu and how seriously we'd look at
>>>>>>>>>>>>>> alternatives before making that decision.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans
>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>>wrote:
>>>>>>>>>>>>>> Mark,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it
>>>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark
>>>>>>>>>>>>>> Hamstra<m...@clearstorydata.com
>>>>>>>>>>>>>> <mailto:m...@clearstorydata.com>> wrote:
>>>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I care about insert into Kudu with Spark SQL. I'm currently
>>>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert
>>>>>>>>>>>>>> functionality while trying to evaluate what to expect from Kudu.
>>>>>>>>>>>>>> Whether Kudu does a good job supporting inserts with Spark SQL
>>>>>>>>>>>>>> will be a key consideration as to whether we adopt Kudu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary
>>>>>>>>>>>>>> for you. Is it just that you currently do it that way into some
>>>>>>>>>>>>>> database or parquet so with minimal refactoring you'd be able to
>>>>>>>>>>>>>> use Kudu? Would re-writing those SQL lines into Scala and
>>>>>>>>>>>>>> directly use the Java API's KuduSession be too much work?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS your
>>>>>>>>>>>>>> current solution? If it's not completely clear, I'd love to help
>>>>>>>>>>>>>> you think through it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans
>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are your DS folks looking for in terms of functionality
>>>>>>>>>>>>>> related to Spark? A SparkSQL integration that's as fully
>>>>>>>>>>>>>> featured as Impala's? Do they care being able to insert into
>>>>>>>>>>>>>> Kudu with SparkSQL or just being able to query real fast?
>>>>>>>>>>>>>> Anything more specific to Spark that I'm missing?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall. At
>>>>>>>>>>>>>> Cloudera all our resources are committed to making things happen
>>>>>>>>>>>>>> in time, and a more fully featured Spark integration isn't in
>>>>>>>>>>>>>> our plans during that period. I'm really hoping someone in the
>>>>>>>>>>>>>> community will help with Spark, the same way we got a big
>>>>>>>>>>>>>> contribution for the Flume sink.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim
>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>>wrote:
>>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7 versions.
>>>>>>>>>>>>>> But, since it’s not “production-ready”, upper management doesn’t
>>>>>>>>>>>>>> want to fully deploy it yet. They just want to keep an eye on it
>>>>>>>>>>>>>> though. Kudu was so much simpler and easier to use in every
>>>>>>>>>>>>>> aspect compared to HBase. Impala was great for the report
>>>>>>>>>>>>>> writers and analysts to experiment with for the short time it
>>>>>>>>>>>>>> was up. But, once again, the only blocker was the lack of Spark
>>>>>>>>>>>>>> support for our Data Developers/Scientists. So, production-level
>>>>>>>>>>>>>> data population won’t happen until then.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans
>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim
>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an
>>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken
>>>>>>>>>>>>>>> care of and idempotency is maintained. Whether data was
>>>>>>>>>>>>>>> directly retrieved from Cassandra for analytics, reports, or
>>>>>>>>>>>>>>> searches, it was not clear as to what was its main use. Some
>>>>>>>>>>>>>>> also just used it for a staging area to populate downstream
>>>>>>>>>>>>>>> tables in parquet format. The last thing I heard was that CQL
>>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries
>>>>>>>>>>>>>>> against it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real
>>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the APIs.
>>>>>>>>>>>>>>> Even then, Kudu should beat it easily on big scans. Same for
>>>>>>>>>>>>>>> HBase. We've done benchmarks against the latter, not the former.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As for our company, we have been looking for an updatable data
>>>>>>>>>>>>>>> store for a long time that can be quickly queried directly
>>>>>>>>>>>>>>> either using Spark SQL or Impala or some other SQL engine and
>>>>>>>>>>>>>>> still handle TB or PB of data without performance degradation
>>>>>>>>>>>>>>> and many configuration headaches. For now, we are using HBase
>>>>>>>>>>>>>>> to take on this role with Phoenix as a fast way to directly
>>>>>>>>>>>>>>> query the data. I can see Kudu as the best way to fill this gap
>>>>>>>>>>>>>>> easily, especially being the closest thing to other relational
>>>>>>>>>>>>>>> databases out there in familiarity for the many SQL analytics
>>>>>>>>>>>>>>> people in our company. The other alternative would be to go
>>>>>>>>>>>>>>> with AWS Redshift for the same reasons, but it would come at a
>>>>>>>>>>>>>>> cost, of course. If we went with either solutions, Kudu or
>>>>>>>>>>>>>>> Redshift, it would get rid of the need to extract from HBase to
>>>>>>>>>>>>>>> parquet tables or export to PostgreSQL to support more of the
>>>>>>>>>>>>>>> SQL language using by analysts or the reporting software we
>>>>>>>>>>>>>>> use..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far off
>>>>>>>>>>>>>>> with Kudu. Have you folks tried Kudu with Impala yet with those
>>>>>>>>>>>>>>> use cases?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans
>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like
>>>>>>>>>>>>>>>> to refer to "Impala + Kudu" as Kimpala, but yeah it's not as
>>>>>>>>>>>>>>>> sexy. My colleagues who were also there did say that the hype
>>>>>>>>>>>>>>>> around Spark isn't dying down.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that Cassandra,
>>>>>>>>>>>>>>>> HBase, and Kudu cater to. I wouldn't go as far as saying that
>>>>>>>>>>>>>>>> C* is just an interim solution for the use case you describe.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month, it's
>>>>>>>>>>>>>>>> a storage engine so things move slowly *smile*. I'd love to
>>>>>>>>>>>>>>>> see more contributions on the Spark front. I know there's code
>>>>>>>>>>>>>>>> out there that could be integrated in kudu-spark, it just
>>>>>>>>>>>>>>>> needs to land in gerrit. I'm sure folks will happily review it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love to
>>>>>>>>>>>>>>>> learn more about the use cases for which you envision using
>>>>>>>>>>>>>>>> Kudu as a C* replacement.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim
>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San Jose. They
>>>>>>>>>>>>>>>> told me that everything was about Spark and there is a big
>>>>>>>>>>>>>>>> buzz about the SMACK stack (Spark, Mesos, Akka, Cassandra,
>>>>>>>>>>>>>>>> Kafka). I still think that Cassandra is just an interim
>>>>>>>>>>>>>>>> solution as a low-latency, easily queried data store. I was
>>>>>>>>>>>>>>>> wondering if anything significant happened in regards to Kudu,
>>>>>>>>>>>>>>>> especially on the Spark front. Plus, can you come up with your
>>>>>>>>>>>>>>>> own proposed stack acronym to promote?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans
>>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any timeline.
>>>>>>>>>>>>>>>>> I know of one person on the Kudu Slack who's working on a
>>>>>>>>>>>>>>>>> better RDD, but that's about it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim
>>>>>>>>>>>>>>>>> <b...@amobee.com <mailto:b...@amobee.com>> wrote:
>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want to
>>>>>>>>>>>>>>>>> target a version of Kudu to begin real testing of Spark
>>>>>>>>>>>>>>>>> against it for our devs. At least, I can tell them what
>>>>>>>>>>>>>>>>> timeframe to anticipate.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>>>> Benjamin Kim
>>>>>>>>>>>>>>>>> Data Solutions Architect
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [a•mo•bee] (n.) the company defining digital marketing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mobile: +1 818 635 2900 <tel:%2B1%20818%20635%202900>
>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200 | Santa Monica, CA 90405 |
>>>>>>>>>>>>>>>>> www.amobee.com <http://www.amobee.com/>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans
>>>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if it's
>>>>>>>>>>>>>>>>>> needed either.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format, ideally
>>>>>>>>>>>>>>>>>> we'd use scans directly.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort of
>>>>>>>>>>>>>>>>>> pushdown. It's really basic.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The goal was to provide something for others to contribute
>>>>>>>>>>>>>>>>>> to. We have some basic unit tests that others can easily
>>>>>>>>>>>>>>>>>> extend. None of us on the team are Spark experts, but we'd
>>>>>>>>>>>>>>>>>> be really happy to assist one improve the kudu-spark code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim
>>>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic requirements
>>>>>>>>>>>>>>>>>> (kudu RDD, kudu DStream) in KUDU-1214. Am I right? Besides
>>>>>>>>>>>>>>>>>> shoring up more Spark SQL functionality (Dataframes) and
>>>>>>>>>>>>>>>>>> doing the documentation, what more needs to be done?
>>>>>>>>>>>>>>>>>> Optimizations?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark with
>>>>>>>>>>>>>>>>>> Kudu and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans
>>>>>>>>>>>>>>>>>>> <jdcry...@apache.org <mailto:jdcry...@apache.org>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to get
>>>>>>>>>>>>>>>>>>> this in for 0.7.0:
>>>>>>>>>>>>>>>>>>> https://issues.cloudera.org/browse/KUDU-1321
>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1321>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use SparkSQL
>>>>>>>>>>>>>>>>>>> on Kudu, but it will require a lot more work to make it
>>>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim
>>>>>>>>>>>>>>>>>>> <bbuil...@gmail.com <mailto:bbuil...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted for
>>>>>>>>>>>>>>>>>>> 0.8.0, but I see no progress on it. When this is complete,
>>>>>>>>>>>>>>>>>>> will this mean that Spark will be able to work with Kudu
>>>>>>>>>>>>>>>>>>> both programmatically and as a client via Spark SQL? Or is
>>>>>>>>>>>>>>>>>>> there more work that needs to be done on the Spark side for
>>>>>>>>>>>>>>>>>>> it to work?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>