[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-09-26 Thread Grant Henke (Code Review)
Grant Henke has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..

Supporting Spark streaming DataFrame in KuduContext.

KUDU-2539: Supporting Spark streaming DataFrame in KuduContext.

This solution follows the way how other sinks ie. KafkaSink
is implemented, for details see
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87

Where on the DataFrame a queryExecution.toRdd.foreachPartition
is called to access the InternalRows which mapped to Rows by Catalyst
converters.

Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Reviewed-on: http://gerrit.cloudera.org:8080/11199
Tested-by: Kudu Jenkins
Reviewed-by: Attila Bukor 
Reviewed-by: Grant Henke 
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
M java/kudu-spark/pom.xml
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
A java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
5 files changed, 126 insertions(+), 6 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Attila Bukor: Looks good to me, but someone else must approve
  Grant Henke: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-09-26 Thread Grant Henke (Code Review)
Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 26 Sep 2018 15:45:18 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-09-25 Thread Attila Piros (Code Review)
Attila Piros has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 5:

> Patch Set 5:
>
> (1 comment)
In case of structured streaming foreachPartition on a streaming DataFrame is an 
unsupported operation. You can see this error running the new test without my 
KuduContext changes:


> Task :kudu-spark:test

org.apache.kudu.spark.kudu.StreamingTest > testKuduContextWithSparkStreaming 
FAILED
org.apache.spark.sql.streaming.StreamingQueryException: Queries with 
streaming sources must be executed with writeStream.start();;
LocalRelation [value#16]

=== Streaming Query ===
Identifier: [id = f00274e4-469e-43fd-8d9a-3e42f81125a3, runId = 
35f66191-e282-4112-b226-26cd2b0c6fae]
Current Committed Offsets: {}
Current Available Offsets: {MemoryStream[value#1]: 0}

Current State: ACTIVE
Thread State: RUNNABLE

Logical Plan:
Project [_1#7 AS key#10, _2#8 AS val#11]
+- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, 
true]))._1 AS _1#7, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS 
_2#8]
   +- MapElements , int, [StructField(value,IntegerType,false)], 
obj#6: scala.Tuple2
  +- DeserializeToObject assertnotnull(cast(value#1 as int)), obj#5: int
 +- StreamingExecutionRelation MemoryStream[value#1], [value#1]

Caused by:
org.apache.spark.sql.AnalysisException: Queries with streaming sources 
must be executed with writeStream.start();;
LocalRelation [value#16]

1 test completed, 1 failed

> Task :kudu-spark:test FAILED
~~~

So this solution uses the same method as was used for KafkaSink: using 
foreachPartition on RDD-level. Basically it is a workaround.


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 26 Sep 2018 00:42:12 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-09-25 Thread Grant Henke (Code Review)
Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11199/5/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
File 
java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala:

http://gerrit.cloudera.org:8080/#/c/11199/5/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala@305
PS5, Line 305: data.queryExecution.toRdd.foreachPartition(iterator => {
I don't fully understand why this is needed. Especially given that we convert 
back from and InternalRow to a Row below in writePartitionRows.

My Spark internal background isn't the strongest. Could you explain what this 
conversion fixes?



--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 25 Sep 2018 18:42:12 +
Gerrit-HasComments: Yes


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-09-12 Thread Attila Piros (Code Review)
Attila Piros has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 5:

Gentle ping. Is there any new comments?


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 12 Sep 2018 14:09:14 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-29 Thread Attila Bukor (Code Review)
Attila Bukor has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 5: Code-Review+1

looks good to me, I'll let Grant review it as well.


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Bukor 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 29 Aug 2018 14:22:32 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-15 Thread Attila Piros (Code Review)
Hello Kudu Jenkins, Grant Henke,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/11199

to look at the new patch set (#5).

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..

Supporting Spark streaming DataFrame in KuduContext.

KUDU-2539: Supporting Spark streaming DataFrame in KuduContext.

This solution follows the way how other sinks ie. KafkaSink
is implemented, for details see
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87

Where on the DataFrame a queryExecution.toRdd.foreachPartition
is called to access the InternalRows which mapped to Rows by Catalyst
converters.

Change-Id: Iead04539d3514920a5d6803c34715e5686124572
---
M java/gradle/dependencies.gradle
M java/kudu-spark/build.gradle
M java/kudu-spark/pom.xml
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
A java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingTest.scala
5 files changed, 126 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/5
--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-14 Thread Attila Piros (Code Review)
Attila Piros has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 4:

> Patch Set 4:
>
> > Could you add tests to demonstrate and prove the functionality this
>  > change is supporting? Also to ensure it doesn't break with future
>  > changes.
>

So I have tried: 
https://gist.github.com/attilapiros/3b5ef42c0f7aa08b0e2c834fbadfc574

But there is an unexpected problem StreamTest is a Scalatest but KuduTestSuite 
is a JUnitSuite and they cannot be mixed:

```
[ERROR] 
/home/systest/kudu-public/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/StreamingSuite.scala:27:
 error: illegal inheritance; superclass QueryTest
[ERROR]  is not a subclass of the superclass JUnitSuite
[ERROR]  of the mixin trait KuduTestSuite
[ERROR] class StreamingSuite extends StreamTest with KuduTestSuite {
[ERROR]  ^
```

Moreover runtime both would create a spark session :(

As StreamTest is quite complex the easier solution would be to remove the "with 
KuduTestSuite" and create the Kudu cluster and simple test table here.

Or using composition instead of inheritance via a new class extending 
KudutestSuite and call before and after methods accordingly.

In both cases here would be a scalatest. Is there any reason why JUnitSuite is 
used here? What is your suggestion?


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 14 Aug 2018 15:22:20 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-13 Thread Attila Piros (Code Review)
Attila Piros has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 4:

> Could you add tests to demonstrate and prove the functionality this
 > change is supporting? Also to ensure it doesn't break with future
 > changes.

Unfortunately no idea how. If you check my kudu custom sink repo:

 > Could you add tests to demonstrate and prove the functionality this
 > change is supporting? Also to ensure it doesn't break with future
 > changes.

Unfortunately it is not easy: here there is custom sink using the KuduContext 
in streaming: https://github.com/attilapiros/kudu_custom_sink but in our case 
we would need the Spark streaming test in memory solution too.


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Mon, 13 Aug 2018 17:01:05 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-13 Thread Grant Henke (Code Review)
Grant Henke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11199 )

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..


Patch Set 1:

Could you add tests to demonstrate and prove the functionality this change is 
supporting? Also to ensure it doesn't break with future changes.


--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Grant Henke 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Mon, 13 Aug 2018 15:26:16 +
Gerrit-HasComments: No


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-13 Thread Attila Piros (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/11199

to look at the new patch set (#4).

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..

Supporting Spark streaming DataFrame in KuduContext.

KUDU-2539: Supporting Spark streaming DataFrame in KuduContext.

This solution follows the way how other sinks ie. KafkaSink
is implemented, for details see
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87

Where on the DataFrame a queryExecution.toRdd.foreachPartition
is called to access the InternalRows which mapped to Rows by Catalyst
converters.

Change-Id: Iead04539d3514920a5d6803c34715e5686124572
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
1 file changed, 9 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/4
--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-13 Thread Attila Piros (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/11199

to look at the new patch set (#2).

Change subject: Supporting Spark streaming DataFrame in KuduContext.
..

Supporting Spark streaming DataFrame in KuduContext.

KUDU-2539: Supporting Spark streaming DataFrame in KuduContext.

This solution follows the way how other sinks ie. KafkaSink
is implemented, see
https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala#L87

Where on the DataFrame a queryExecution.toRdd.foreachPartition
is called to access the InternalRows.

Change-Id: Iead04539d3514920a5d6803c34715e5686124572
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
1 file changed, 13 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/2
--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Piros 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] Supporting Spark streaming DataFrame in KuduContext.

2018-08-13 Thread Attila Piros (Code Review)
Attila Piros has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/11199


Change subject: Supporting Spark streaming DataFrame in KuduContext.
..

Supporting Spark streaming DataFrame in KuduContext.

KUDU-2539.

Change-Id: Iead04539d3514920a5d6803c34715e5686124572
---
M java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala
1 file changed, 13 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/99/11199/1
--
To view, visit http://gerrit.cloudera.org:8080/11199
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iead04539d3514920a5d6803c34715e5686124572
Gerrit-Change-Number: 11199
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Piros