Re: Spark on Kudu

Jean-Daniel Cryans Mon, 10 Oct 2016 16:14:42 -0700

On Mon, Oct 10, 2016 at 4:11 PM, Dan Burkert <[email protected]> wrote:


> Hi Ben,
>
> SparkSQL relies on Hive for DDL statements, so having support for this
> requires adding support to Hive for manipulating Kudu tables.  This is
> something that we would like to do in the long term, but there are no
> concrete plans (that I know of) to make it happen in the near term.
>

To be fair there's https://issues.apache.org/jira/browse/HIVE-12971 with a
link to https://github.com/BimalTandel/HiveKudu-Handler which I think Bimal
said he was going to update soon.

But we're still far, I think, from any Kudu support in a released version
of Hive.


>
> - Dan
>
> On Thu, Oct 6, 2016 at 4:38 PM, Benjamin Kim <[email protected]> wrote:
>
>> Anyone know if the Spark package will ever allow for creating tables in
>> Spark SQL?
>>
>> Such as:
>>        CREATE EXTERNAL TABLE <table-name>
>>        USING org.apache.kudu.spark.kudu
>>        OPTIONS (Map("kudu.master" -> “<kudu-master>", "kudu.table" ->
>> “table-name”));
>>
>> In this way, plain SQL can be used to do DDL, DML statements whether in
>> Spark SQL code or using JDBC to interface with Spark SQL Thriftserver.
>>
>> By the way, we are trying to create a DMP in Kudu with the a farm of
>> RESTful Endpoints to do cookie sync, ad serving, segmentation data
>> exchange. And, the Spark compute cluster and the Kudu cluster will reside
>> on the same racks in the same datacenter.
>>
>> Thanks,
>> Ben
>>
>> On Sep 20, 2016, at 3:02 PM, Jordan Birdsell <[email protected]>
>> wrote:
>>
>> http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
>>
>> On Tue, Sep 20, 2016 at 5:00 PM Benjamin Kim <[email protected]> wrote:
>>
>>> I see that the API has changed a bit so my old code doesn’t work
>>> anymore. Can someone direct me to some code samples?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Sep 20, 2016, at 1:44 PM, Todd Lipcon <[email protected]> wrote:
>>>
>>> On Tue, Sep 20, 2016 at 1:18 PM, Benjamin Kim <[email protected]> wrote
>>> :
>>>
>>>> Now that Kudu 1.0.0 is officially out and ready for production use,
>>>> where do we find the spark connector jar for this release?
>>>>
>>>>
>>> It's available in the official ASF maven repository:
>>> https://repository.apache.org/#nexus-search;quick~kudu-spark
>>>
>>> <dependency>
>>>   <groupId>org.apache.kudu</groupId>
>>>   <artifactId>kudu-spark_2.10</artifactId>
>>>   <version>1.0.0</version>
>>> </dependency>
>>>
>>>
>>> -Todd
>>>
>>>
>>>
>>>> On Jun 17, 2016, at 11:08 AM, Dan Burkert <[email protected]> wrote:
>>>>
>>>> Hi Ben,
>>>>
>>>> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL,
>>>> I do not think we support that at this point.  I haven't looked deeply into
>>>> it, but we may hit issues specifying Kudu-specific options (partitioning,
>>>> column encoding, etc.).  Probably issues that can be worked through
>>>> eventually, though.  If you are interested in contributing to Kudu, this is
>>>> an area that could obviously use improvement!  Most or all of our Spark
>>>> features have been completely community driven to date.
>>>>
>>>>
>>>>> I am assuming that more Spark support along with semantic changes
>>>>> below will be incorporated into Kudu 0.9.1.
>>>>>
>>>>
>>>> As a rule we do not release new features in patch releases, but the
>>>> good news is that we are releasing regularly, and our next scheduled
>>>> release is for the August timeframe (see JD's roadmap
>>>> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>>>>  email
>>>> about what we are aiming to include).  Also, Cloudera does publish snapshot
>>>> versions of the Spark connector here
>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>,
>>>> so the jars are available if you don't mind using snapshots.
>>>>
>>>>
>>>>> Anyone know of a better way to make unique primary keys other than
>>>>> using UUID to make every row unique if there is no unique column (or
>>>>> combination thereof) to use.
>>>>>
>>>>
>>>> Not that I know of.  In general it's pretty rare to have a dataset
>>>> without a natural primary key (even if it's just all of the columns), but
>>>> in those cases UUID is a good solution.
>>>>
>>>>
>>>>> This is what I am using. I know auto incrementing is coming down the
>>>>> line (don’t know when), but is there a way to simulate this in Kudu using
>>>>> Spark out of curiosity?
>>>>>
>>>>
>>>> To my knowledge there is no plan to have auto increment in Kudu.
>>>> Distributed, consistent, auto incrementing counters is a difficult problem,
>>>> and I don't think there are any known solutions that would be fast enough
>>>> for Kudu (happy to be proven wrong, though!).
>>>>
>>>> - Dan
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <[email protected]> wrote:
>>>>>
>>>>> I'm not sure exactly what the semantics will be, but at least one of
>>>>> them will be upsert.  These modes come from spark, and they were really
>>>>> designed for file-backed storage and not table storage.  We may want to do
>>>>> append = upsert, and overwrite = truncate + insert.  I think that may 
>>>>> match
>>>>> the normal spark semantics more closely.
>>>>>
>>>>> - Dan
>>>>>
>>>>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dan,
>>>>>>
>>>>>> Thanks for the information. That would mean both “append” and
>>>>>> “overwrite” modes would be combined or not needed in the future.
>>>>>>
>>>>>> Cheers,
>>>>>> Ben
>>>>>>
>>>>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <[email protected]> wrote:
>>>>>>
>>>>>> Right now append uses an update Kudu operation, which requires the
>>>>>> row already be present in the table. Overwrite maps to insert.  Kudu very
>>>>>> recently got upsert support baked in, but it hasn't yet been integrated
>>>>>> into the Spark connector.  So pretty soon these sharp edges will get a 
>>>>>> lot
>>>>>> better, since upsert is the way to go for most spark workloads.
>>>>>>
>>>>>> - Dan
>>>>>>
>>>>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I tried to use the “append” mode, and it worked. Over 3.8 million
>>>>>>> rows in 64s. I would assume that now I can use the “overwrite” mode on
>>>>>>> existing data. Now, I have to find answers to these questions. What 
>>>>>>> would
>>>>>>> happen if I “append” to the data in the Kudu table if the data already
>>>>>>> exists? What would happen if I “overwrite” existing data when the 
>>>>>>> DataFrame
>>>>>>> has data in it that does not exist in the Kudu table? I need to evaluate
>>>>>>> the best way to simulate the UPSERT behavior in HBase because this is 
>>>>>>> what
>>>>>>> our use case is.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Now, I’m getting this error when trying to write to the table.
>>>>>>>
>>>>>>> import scala.collection.JavaConverters._
>>>>>>> val key_seq = Seq(“my_id")
>>>>>>> val key_list = List(“my_id”).asJava
>>>>>>> kuduContext.createTable(tableName, df.schema, key_seq, new
>>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list,
>>>>>>> 100))
>>>>>>>
>>>>>>> df.write
>>>>>>>     .options(Map("kudu.master" -> kuduMaster,"kudu.table" ->
>>>>>>> tableName))
>>>>>>>     .mode("overwrite")
>>>>>>>     .kudu
>>>>>>>
>>>>>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame
>>>>>>> to Kudu; sample errors: Not found: key not found (error 0)Not found: key
>>>>>>> not found (error 0)Not found: key not found (error 0)Not found: key not
>>>>>>> found (error 0)Not found: key not found (error 0)
>>>>>>>
>>>>>>> Does the key field need to be first in the DataFrame?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ben
>>>>>>>
>>>>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Dan,
>>>>>>>>
>>>>>>>> Thanks! It got further. Now, how do I set the Primary Key to be a
>>>>>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>>>>>>
>>>>>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>>>>>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>>>>>>
>>>>>>>> java.lang.IllegalArgumentException: Table partitioning must be
>>>>>>>> specified using setRangePartitionColumns or addHashPartitions
>>>>>>>>
>>>>>>>
>>>>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of
>>>>>>> primary key columns, so in this case you have specified the single PK
>>>>>>> column "my_id".  The `addHashPartitions` call adds hash partitioning to 
>>>>>>> the
>>>>>>> table, in this case over the column "my_id" (which is good, it must be 
>>>>>>> over
>>>>>>> one or more PK columns, so in this case "my_id" is the one and only 
>>>>>>> valid
>>>>>>> combination).  However, the call to `addHashPartition` also takes the
>>>>>>> number of buckets as the second param.  You shouldn't get the
>>>>>>> IllegalArgumentException as long as you are specifying either
>>>>>>> `addHashPartitions` or `setRangePartitionColumns`.
>>>>>>>
>>>>>>> - Dan
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Looks like we're missing an import statement in that example.
>>>>>>>> Could you try:
>>>>>>>>
>>>>>>>> import org.kududb.client._
>>>>>>>>
>>>>>>>> and try again?
>>>>>>>>
>>>>>>>> - Dan
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I encountered an error trying to create a table based on the
>>>>>>>>> documentation from a DataFrame.
>>>>>>>>>
>>>>>>>>> <console>:49: error: not found: type CreateTableOptions
>>>>>>>>>               kuduContext.createTable(tableName, df.schema,
>>>>>>>>> Seq("key"), new CreateTableOptions().setNumReplicas(1))
>>>>>>>>>
>>>>>>>>> Is there something I’m missing?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> It's only in Cloudera's maven repo: https://repository.cloud
>>>>>>>>> era.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>>>>>>
>>>>>>>>> J-D
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <[email protected]>
>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>> Hi J-D,
>>>>>>>>>>
>>>>>>>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark
>>>>>>>>>> jar for spark-shell to use. Can you show me where to find it?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> What's in this doc is what's gonna get released:
>>>>>>>>>> https://github.com/cloudera/kudu/blob/master/docs/
>>>>>>>>>> developing.adoc#kudu-integration-with-spark
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <[email protected]>
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ben
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> It will be in 0.9.0.
>>>>>>>>>>>
>>>>>>>>>>> J-D
>>>>>>>>>>>
>>>>>>>>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Chris,
>>>>>>>>>>>>
>>>>>>>>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On May 18, 2016, at 9:01 AM, Chris George <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> There is some code in review that needs some more refinement.
>>>>>>>>>>>> It will allow upsert/insert from a dataframe using the
>>>>>>>>>>>> datasource api. It will also allow the creation and deletion of 
>>>>>>>>>>>> tables from
>>>>>>>>>>>> a dataframe
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>>>>>>>>>
>>>>>>>>>>>> Example usages will look something like:
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>>>>>>>>>
>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone tell me what the state is of this Spark work?
>>>>>>>>>>>>
>>>>>>>>>>>> Also, does anyone have any sample code on how to update/insert
>>>>>>>>>>>> data in Kudu using DataFrames?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> SparkSQL cannot support these type of statements but we may be
>>>>>>>>>>>> able to implement similar functionality through the api.
>>>>>>>>>>>> -Chris
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/12/16, 5:19 PM, "Benjamin Kim" <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> It would be nice to adhere to the SQL:2003 standard for an
>>>>>>>>>>>> “upsert” if it were to be implemented.
>>>>>>>>>>>>
>>>>>>>>>>>> MERGE INTO table_name USING table_reference ON (condition)
>>>>>>>>>>>>  WHEN MATCHED THEN
>>>>>>>>>>>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>>>>>>>>>>>  WHEN NOT MATCHED THEN
>>>>>>>>>>>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Ben
>>>>>>>>>>>>
>>>>>>>>>>>> On Apr 11, 2016, at 12:21 PM, Chris George <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I have a wip kuduRDD that I made a few months ago. I pushed it
>>>>>>>>>>>> into gerrit if you want to take a look.
>>>>>>>>>>>> http://gerrit.cloudera.org:8080/#/c/2754/
>>>>>>>>>>>> It does pushdown predicates which the existing input formatter
>>>>>>>>>>>> based rdd does not.
>>>>>>>>>>>>
>>>>>>>>>>>> Within the next two weeks I’m planning to implement a
>>>>>>>>>>>> datasource for spark that will have pushdown predicates and
>>>>>>>>>>>> insertion/update functionality (need to look more at cassandra and 
>>>>>>>>>>>> the
>>>>>>>>>>>> hbase datasource for best way to do this) I agree that server side 
>>>>>>>>>>>> upsert
>>>>>>>>>>>> would be helpful.
>>>>>>>>>>>> Having a datasource would give us useful data frames and also
>>>>>>>>>>>> make spark sql usable for kudu.
>>>>>>>>>>>>
>>>>>>>>>>>> My reasoning for having a spark datasource and not using Impala
>>>>>>>>>>>> is: 1. We have had trouble getting impala to run fast with high 
>>>>>>>>>>>> concurrency
>>>>>>>>>>>> when compared to spark 2. We interact with datasources which do not
>>>>>>>>>>>> integrate with impala. 3. We have custom sql query planners for 
>>>>>>>>>>>> extended
>>>>>>>>>>>> sql functionality.
>>>>>>>>>>>>
>>>>>>>>>>>> -Chris George
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> You guys make a convincing point, although on the upsert side
>>>>>>>>>>>> we'll need more support from the servers. Right now all you can do 
>>>>>>>>>>>> is an
>>>>>>>>>>>> INSERT then, if you get a dup key, do an UPDATE. I guess we could 
>>>>>>>>>>>> at least
>>>>>>>>>>>> add an API on the client side that would manage it, but it 
>>>>>>>>>>>> wouldn't be
>>>>>>>>>>>> atomic.
>>>>>>>>>>>>
>>>>>>>>>>>> J-D
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <
>>>>>>>>>>>> [email protected]>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> It's pretty simple, actually.  I need to support versioned
>>>>>>>>>>>>> datasets in a Spark SQL environment.  Instead of a hack on top of 
>>>>>>>>>>>>> a Parquet
>>>>>>>>>>>>> data store, I'm hoping (among other reasons) to be able to use 
>>>>>>>>>>>>> Kudu's write
>>>>>>>>>>>>> and timestamp-based read operations to support not only appending 
>>>>>>>>>>>>> data, but
>>>>>>>>>>>>> also updating existing data, and even some schema migration.  The 
>>>>>>>>>>>>> most
>>>>>>>>>>>>> typical use case is a dataset that is updated periodically (e.g., 
>>>>>>>>>>>>> weekly or
>>>>>>>>>>>>> monthly) in which the the preliminary data in the previous window 
>>>>>>>>>>>>> (week or
>>>>>>>>>>>>> month) is updated with values that are expected to remain 
>>>>>>>>>>>>> unchanged from
>>>>>>>>>>>>> then on, and a new set of preliminary values for the current 
>>>>>>>>>>>>> window need to
>>>>>>>>>>>>> be added/appended.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Using Kudu's Java API and developing additional functionality
>>>>>>>>>>>>> on top of what Kudu has to offer isn't too much to ask, but the 
>>>>>>>>>>>>> ease of
>>>>>>>>>>>>> integration with Spark SQL will gate how quickly we would move to 
>>>>>>>>>>>>> using
>>>>>>>>>>>>> Kudu and how seriously we'd look at alternatives before making 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> decision.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <
>>>>>>>>>>>>> [email protected]>wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mark,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking some time to reply in this thread, glad it
>>>>>>>>>>>>>> caught the attention of other folks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra<
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do they care being able to insert into Kudu with SparkSQL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I care about insert into Kudu with Spark SQL.  I'm currently
>>>>>>>>>>>>>>> delaying a refactoring of some Spark SQL-oriented insert 
>>>>>>>>>>>>>>> functionality
>>>>>>>>>>>>>>> while trying to evaluate what to expect from Kudu.  Whether 
>>>>>>>>>>>>>>> Kudu does a
>>>>>>>>>>>>>>> good job supporting inserts with Spark SQL will be a key 
>>>>>>>>>>>>>>> consideration as
>>>>>>>>>>>>>>> to whether we adopt Kudu.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'd like to know more about why SparkSQL inserts in necessary
>>>>>>>>>>>>>> for you. Is it just that you currently do it that way into some 
>>>>>>>>>>>>>> database or
>>>>>>>>>>>>>> parquet so with minimal refactoring you'd be able to use Kudu? 
>>>>>>>>>>>>>> Would
>>>>>>>>>>>>>> re-writing those SQL lines into Scala and directly use the Java 
>>>>>>>>>>>>>> API's
>>>>>>>>>>>>>> KuduSession be too much work?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additionally, what do you expect to gain from using Kudu VS
>>>>>>>>>>>>>> your current solution? If it's not completely clear, I'd love to 
>>>>>>>>>>>>>> help you
>>>>>>>>>>>>>> think through it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:23 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yup, starting to get a good idea.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What are your DS folks looking for in terms of
>>>>>>>>>>>>>>>> functionality related to Spark? A SparkSQL integration that's 
>>>>>>>>>>>>>>>> as fully
>>>>>>>>>>>>>>>> featured as Impala's? Do they care being able to insert into 
>>>>>>>>>>>>>>>> Kudu with
>>>>>>>>>>>>>>>> SparkSQL or just being able to query real fast? Anything more 
>>>>>>>>>>>>>>>> specific to
>>>>>>>>>>>>>>>> Spark that I'm missing?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FWIW the plan is to get to 1.0 in late Summer/early Fall.
>>>>>>>>>>>>>>>> At Cloudera all our resources are committed to making things 
>>>>>>>>>>>>>>>> happen in
>>>>>>>>>>>>>>>> time, and a more fully featured Spark integration isn't in our 
>>>>>>>>>>>>>>>> plans during
>>>>>>>>>>>>>>>> that period. I'm really hoping someone in the community will 
>>>>>>>>>>>>>>>> help with
>>>>>>>>>>>>>>>> Spark, the same way we got a big contribution for the Flume 
>>>>>>>>>>>>>>>> sink.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <
>>>>>>>>>>>>>>>> [email protected]>wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, we took Kudu for a test run using 0.6 and 0.7
>>>>>>>>>>>>>>>>> versions. But, since it’s not “production-ready”, upper 
>>>>>>>>>>>>>>>>> management doesn’t
>>>>>>>>>>>>>>>>> want to fully deploy it yet. They just want to keep an eye on 
>>>>>>>>>>>>>>>>> it though.
>>>>>>>>>>>>>>>>> Kudu was so much simpler and easier to use in every aspect 
>>>>>>>>>>>>>>>>> compared to
>>>>>>>>>>>>>>>>> HBase. Impala was great for the report writers and analysts 
>>>>>>>>>>>>>>>>> to experiment
>>>>>>>>>>>>>>>>> with for the short time it was up. But, once again, the only 
>>>>>>>>>>>>>>>>> blocker was
>>>>>>>>>>>>>>>>> the lack of Spark support for our Data Developers/Scientists. 
>>>>>>>>>>>>>>>>> So,
>>>>>>>>>>>>>>>>> production-level data population won’t happen until then.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I hope this helps you get an idea where I am coming from…
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The main thing I hear that Cassandra is being used as an
>>>>>>>>>>>>>>>>>> updatable hot data store to ensure that duplicates are taken 
>>>>>>>>>>>>>>>>>> care of and
>>>>>>>>>>>>>>>>>> idempotency is maintained. Whether data was directly 
>>>>>>>>>>>>>>>>>> retrieved from
>>>>>>>>>>>>>>>>>> Cassandra for analytics, reports, or searches, it was not 
>>>>>>>>>>>>>>>>>> clear as to what
>>>>>>>>>>>>>>>>>> was its main use. Some also just used it for a staging area 
>>>>>>>>>>>>>>>>>> to populate
>>>>>>>>>>>>>>>>>> downstream tables in parquet format. The last thing I heard 
>>>>>>>>>>>>>>>>>> was that CQL
>>>>>>>>>>>>>>>>>> was terrible, so that rules out much use of direct queries 
>>>>>>>>>>>>>>>>>> against it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm no C* expert, but I don't think CQL is meant for real
>>>>>>>>>>>>>>>>> analytics, just ease of use instead of plainly using the 
>>>>>>>>>>>>>>>>> APIs. Even then,
>>>>>>>>>>>>>>>>> Kudu should beat it easily on big scans. Same for HBase. 
>>>>>>>>>>>>>>>>> We've done
>>>>>>>>>>>>>>>>> benchmarks against the latter, not the former.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As for our company, we have been looking for an updatable
>>>>>>>>>>>>>>>>>> data store for a long time that can be quickly queried 
>>>>>>>>>>>>>>>>>> directly either
>>>>>>>>>>>>>>>>>> using Spark SQL or Impala or some other SQL engine and still 
>>>>>>>>>>>>>>>>>> handle TB or
>>>>>>>>>>>>>>>>>> PB of data without performance degradation and many 
>>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>>> headaches. For now, we are using HBase to take on this role 
>>>>>>>>>>>>>>>>>> with Phoenix as
>>>>>>>>>>>>>>>>>> a fast way to directly query the data. I can see Kudu as the 
>>>>>>>>>>>>>>>>>> best way to
>>>>>>>>>>>>>>>>>> fill this gap easily, especially being the closest thing to 
>>>>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>> relational databases out there in familiarity for the many 
>>>>>>>>>>>>>>>>>> SQL analytics
>>>>>>>>>>>>>>>>>> people in our company. The other alternative would be to go 
>>>>>>>>>>>>>>>>>> with AWS
>>>>>>>>>>>>>>>>>> Redshift for the same reasons, but it would come at a cost, 
>>>>>>>>>>>>>>>>>> of course. If
>>>>>>>>>>>>>>>>>> we went with either solutions, Kudu or Redshift, it would 
>>>>>>>>>>>>>>>>>> get rid of the
>>>>>>>>>>>>>>>>>> need to extract from HBase to parquet tables or export to 
>>>>>>>>>>>>>>>>>> PostgreSQL to
>>>>>>>>>>>>>>>>>> support more of the SQL language using by analysts or the 
>>>>>>>>>>>>>>>>>> reporting
>>>>>>>>>>>>>>>>>> software we use..
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ok, the usual then *smile*. Looks like we're not too far
>>>>>>>>>>>>>>>>> off with Kudu. Have you folks tried Kudu with Impala yet with 
>>>>>>>>>>>>>>>>> those use
>>>>>>>>>>>>>>>>> cases?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I hope this helps.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It does, thanks for nice reply.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ha first time I'm hearing about SMACK. Inside Cloudera we
>>>>>>>>>>>>>>>>>> like to refer to "Impala + Kudu" as Kimpala, but yeah it's 
>>>>>>>>>>>>>>>>>> not as sexy. My
>>>>>>>>>>>>>>>>>> colleagues who were also there did say that the hype around 
>>>>>>>>>>>>>>>>>> Spark isn't
>>>>>>>>>>>>>>>>>> dying down.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There's definitely an overlap in the use cases that
>>>>>>>>>>>>>>>>>> Cassandra, HBase, and Kudu cater to. I wouldn't go as far as 
>>>>>>>>>>>>>>>>>> saying that C*
>>>>>>>>>>>>>>>>>> is just an interim solution for the use case you describe.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Nothing significant happened in Kudu over the past month,
>>>>>>>>>>>>>>>>>> it's a storage engine so things move slowly *smile*. I'd 
>>>>>>>>>>>>>>>>>> love to see more
>>>>>>>>>>>>>>>>>> contributions on the Spark front. I know there's code out 
>>>>>>>>>>>>>>>>>> there that could
>>>>>>>>>>>>>>>>>> be integrated in kudu-spark, it just needs to land in 
>>>>>>>>>>>>>>>>>> gerrit. I'm sure
>>>>>>>>>>>>>>>>>> folks will happily review it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Do you have relevant experiences you can share? I'd love
>>>>>>>>>>>>>>>>>> to learn more about the use cases for which you envision 
>>>>>>>>>>>>>>>>>> using Kudu as a C*
>>>>>>>>>>>>>>>>>> replacement.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My colleagues recently came back from Strata in San
>>>>>>>>>>>>>>>>>>> Jose. They told me that everything was about Spark and 
>>>>>>>>>>>>>>>>>>> there is a big buzz
>>>>>>>>>>>>>>>>>>> about the SMACK stack (Spark, Mesos, Akka, Cassandra, 
>>>>>>>>>>>>>>>>>>> Kafka). I still think
>>>>>>>>>>>>>>>>>>> that Cassandra is just an interim solution as a 
>>>>>>>>>>>>>>>>>>> low-latency, easily queried
>>>>>>>>>>>>>>>>>>> data store. I was wondering if anything significant 
>>>>>>>>>>>>>>>>>>> happened in regards to
>>>>>>>>>>>>>>>>>>> Kudu, especially on the Spark front. Plus, can you come up 
>>>>>>>>>>>>>>>>>>> with your own
>>>>>>>>>>>>>>>>>>> proposed stack acronym to promote?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mar 1, 2016, at 12:20 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Ben,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> AFAIK no one in the dev community committed to any
>>>>>>>>>>>>>>>>>>> timeline. I know of one person on the Kudu Slack who's 
>>>>>>>>>>>>>>>>>>> working on a better
>>>>>>>>>>>>>>>>>>> RDD, but that's about it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Mar 1, 2016 at 11:00 AM, Benjamin Kim <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi J-D,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Quick question… Is there an ETA for KUDU-1214? I want
>>>>>>>>>>>>>>>>>>>> to target a version of Kudu to begin real testing of Spark 
>>>>>>>>>>>>>>>>>>>> against it for
>>>>>>>>>>>>>>>>>>>> our devs. At least, I can tell them what timeframe to 
>>>>>>>>>>>>>>>>>>>> anticipate.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Just curious,
>>>>>>>>>>>>>>>>>>>> *Benjamin Kim*
>>>>>>>>>>>>>>>>>>>> *Data Solutions Architect*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [a•mo•bee] *(n.)* the company defining digital
>>>>>>>>>>>>>>>>>>>> marketing.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> *Mobile: +1 818 635 2900 <%2B1%20818%20635%202900>*
>>>>>>>>>>>>>>>>>>>> 3250 Ocean Park Blvd, Suite 200  |  Santa Monica, CA
>>>>>>>>>>>>>>>>>>>> 90405  |  www.amobee.com
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:51 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The DStream stuff isn't there at all. I'm not sure if
>>>>>>>>>>>>>>>>>>>> it's needed either.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The kuduRDD is just leveraging the MR input format,
>>>>>>>>>>>>>>>>>>>> ideally we'd use scans directly.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The SparkSQL stuff is there but it doesn't do any sort
>>>>>>>>>>>>>>>>>>>> of pushdown. It's really basic.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The goal was to provide something for others to
>>>>>>>>>>>>>>>>>>>> contribute to. We have some basic unit tests that others 
>>>>>>>>>>>>>>>>>>>> can easily extend.
>>>>>>>>>>>>>>>>>>>> None of us on the team are Spark experts, but we'd be 
>>>>>>>>>>>>>>>>>>>> really happy to
>>>>>>>>>>>>>>>>>>>> assist one improve the kudu-spark code.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:41 PM, Benjamin Kim <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It looks like it fulfills most of the basic
>>>>>>>>>>>>>>>>>>>>> requirements (kudu RDD, kudu DStream) in KUDU-1214. Am I 
>>>>>>>>>>>>>>>>>>>>> right? Besides
>>>>>>>>>>>>>>>>>>>>> shoring up more Spark SQL functionality (Dataframes) and 
>>>>>>>>>>>>>>>>>>>>> doing the
>>>>>>>>>>>>>>>>>>>>> documentation, what more needs to be done? Optimizations?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I believe that it’s a good place to start using Spark
>>>>>>>>>>>>>>>>>>>>> with Kudu and compare it to HBase with Spark (not clean).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Feb 24, 2016, at 3:10 PM, Jean-Daniel Cryans <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> AFAIK no one is working on it, but we did manage to
>>>>>>>>>>>>>>>>>>>>> get this in for 0.7.0: https://issues.cloudera
>>>>>>>>>>>>>>>>>>>>> .org/browse/KUDU-1321
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It's a really simple wrapper, and yes you can use
>>>>>>>>>>>>>>>>>>>>> SparkSQL on Kudu, but it will require a lot more work to 
>>>>>>>>>>>>>>>>>>>>> make it
>>>>>>>>>>>>>>>>>>>>> fast/useful.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hope this helps,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Feb 24, 2016 at 3:08 PM, Benjamin Kim <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I see this KUDU-1214
>>>>>>>>>>>>>>>>>>>>>> <https://issues.cloudera.org/browse/KUDU-1214> targeted
>>>>>>>>>>>>>>>>>>>>>> for 0.8.0, but I see no progress on it. When this is 
>>>>>>>>>>>>>>>>>>>>>> complete, will this
>>>>>>>>>>>>>>>>>>>>>> mean that Spark will be able to work with Kudu both 
>>>>>>>>>>>>>>>>>>>>>> programmatically and as
>>>>>>>>>>>>>>>>>>>>>> a client via Spark SQL? Or is there more work that needs 
>>>>>>>>>>>>>>>>>>>>>> to be done on the
>>>>>>>>>>>>>>>>>>>>>> Spark side for it to work?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Just curious.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Ben
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>>
>>>
>>
>

Re: Spark on Kudu

Reply via email to