[jira] [Resolved] (SPARK-20894) Error while checkpointing to HDFS

2017-08-07 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-20894.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Error while checkpointing to HDFS
> -
>
> Key: SPARK-20894
> URL: https://issues.apache.org/jira/browse/SPARK-20894
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
> Environment: Ubuntu, Spark 2.1.1, hadoop 2.7
>Reporter: kant kodali
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
> Attachments: driver_info_log, executor1_log, executor2_log
>
>
> Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 
> hours", "24 hours"), df1.col("AppName")).count();
> StreamingQuery query = df2.writeStream().foreach(new 
> KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start();
> query.awaitTermination();
> This for some reason fails with the Error 
> ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalStateException: Error reading delta file 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = 
> (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: 
> /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist
> I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/  and all 
> consumer offsets in Kafka from all brokers prior to running and yet this 
> error still persists. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105427#comment-16105427
 ] 

Mark Grover commented on SPARK-19720:
-

I wasn't planning on. One could argue the case that this could be backported to 
branch-2.1 given that it's a rather simple change. However, 2.2 brought in some 
changes that were long overdue - dropping support for Java 7, Hadoop 2.5 and 
even if we got this change backported, you won't be able to make use of 
goodness down the road you didn't upgrade to Hadoop 2.6, Java 8, etc. So, my 
recommendation here would be to brave the new world of hadoop 2.6.

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19526) Spark should raise an exception when it tries to read a Hive view but it doesn't have read access on the corresponding table(s)

2017-07-25 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-19526:

Summary: Spark should raise an exception when it tries to read a Hive view 
but it doesn't have read access on the corresponding table(s)  (was: Spark 
should rise an exception when it tries to read a Hive view but it doesn't have 
read access on the corresponding table(s))

> Spark should raise an exception when it tries to read a Hive view but it 
> doesn't have read access on the corresponding table(s)
> ---
>
> Key: SPARK-19526
> URL: https://issues.apache.org/jira/browse/SPARK-19526
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.4, 2.0.3, 2.2.0, 2.3.0
>Reporter: Reza Safi
>
> Spark sees a Hive views as a set of hdfs "files". So to read anything from a 
> Hive view, Spark needs access to all of the files that belongs to the 
> table(s) that the view queries them.  In other words a Spark user cannot be 
> granted fine grained permissions at the levels of Hive columns or records.
> Consider that there is a Spark job that contains a SQL query that tries to 
> read a Hive view. Currently the Spark job will finish successfully if the 
> user that runs the Spark job doesn't have proper read access permissions to 
> the tables that the Hive view has been built on top of them. It will just 
> return an empty result set. This can be confusing for the users, since the 
> job will be finishes without any exception or error. 
> Spark should raise an exception like  AccessDenied when it tries to run a 
> Hive view query and its user doesn't have proper permissions to the tables 
> that the Hive view is created on top of them. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2017-07-24 Thread Mark Grover (JIRA)
Mark Grover created SPARK-21525:
---

 Summary: ReceiverSupervisorImpl seems to ignore the error code 
when writing to the WAL
 Key: SPARK-21525
 URL: https://issues.apache.org/jira/browse/SPARK-21525
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Mark Grover


{{AddBlock}} returns an error code related to whether writing the block to the 
WAL was successful or not. In cases where a WAL may be unavailable temporarily, 
the write would fail but it seems like we are not using the return code (see 
[here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).

For example, when using the Flume Receiver, we should be sending a n'ack back 
to Flume if the block wasn't written to the WAL. I haven't gone through the 
full code path yet but at least from looking at the ReceiverSupervisorImpl, it 
doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-24 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062196#comment-16062196
 ] 

Mark Grover commented on SPARK-18016:
-

Thanks for working on this [~aeskilson]. [~cloud_fan], [~rxin] something seems 
off w.r.t the versions this change is in. Seems like it was committed to 2.1 
and 2.2 branches and then later taken out of 2.2 branch because it was deemed 
too risky. That all makes sense, but I believe this change is still in the 2.1 
branch. In order to make sure, we have no regression in 2.2, we need to revert 
this from 2.1 branch as well. 

And, accordingly we should adjust the fixed version here on this JIRA to be 2.3.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.1.2, 2.2.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> 

[jira] [Created] (SPARK-20756) yarn-shuffle jar has references to unshaded guava and contains scala classes

2017-05-15 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20756:
---

 Summary: yarn-shuffle jar has references to unshaded guava and 
contains scala classes
 Key: SPARK-20756
 URL: https://issues.apache.org/jira/browse/SPARK-20756
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Mark Grover


There are 2 problems with yarn's shuffle jar currently:
1. It contains shaded guava but it contains references to unshaded classes.
{code}
# Guava is correctly relocated
>jar -tf common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar | grep 
>guava | head
META-INF/maven/com.google.guava/
META-INF/maven/com.google.guava/guava/
META-INF/maven/com.google.guava/guava/pom.properties
META-INF/maven/com.google.guava/guava/pom.xml
org/spark_project/guava/
org/spark_project/guava/annotations/
org/spark_project/guava/annotations/Beta.class
org/spark_project/guava/annotations/GwtCompatible.class
org/spark_project/guava/annotations/GwtIncompatible.class
org/spark_project/guava/annotations/VisibleForTesting.class

# But, there are still references to unshaded guava
>javap -cp common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar -c 
>org/apache/spark/network/yarn/YarnShuffleService | grep google
  57: invokestatic  #139// Method 
com/google/common/collect/Lists.newArrayList:()Ljava/util/ArrayList;
{code}

2. There are references to scala classes in the uber jar:
{code}
jar -tf 
/opt/src/spark/common/network-yarn/target/scala-2.11/spark-*yarn-shuffle.jar | 
grep "^scala"
scala/AnyVal.class
{code}

We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20033) spark sql can not use hive permanent function

2017-05-11 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-20033.
-
Resolution: Not A Problem

Marking this JIRA as resolved, accordingly.

> spark sql can not use hive permanent function
> -
>
> Key: SPARK-20033
> URL: https://issues.apache.org/jira/browse/SPARK-20033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> {code}
> spark-sql> SELECT concat_all_ws('-', *) from det.result_set where 
> job_id='1028448' limit 10;
> Error in query: Undefined function: 'concat_all_ws'. This function is neither 
> a registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20033) spark sql can not use hive permanent function

2017-05-11 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006870#comment-16006870
 ] 

Mark Grover commented on SPARK-20033:
-

Reading the PR, it seems like this is not an issue. The related issue (of 
allowing adding jars from HDFS, in Spark) in SPARK-12868 was fixed in Spark 2.2.

> spark sql can not use hive permanent function
> -
>
> Key: SPARK-20033
> URL: https://issues.apache.org/jira/browse/SPARK-20033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> {code}
> spark-sql> SELECT concat_all_ws('-', *) from det.result_set where 
> job_id='1028448' limit 10;
> Error in query: Undefined function: 'concat_all_ws'. This function is neither 
> a registered temporary function nor a permanent function registered in the 
> database 'default'.; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.11.v20160721

2017-04-27 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20514:

Summary: Upgrade Jetty to 9.3.11.v20160721  (was: Upgrade Jetty to 
9.3.13.v20161014)

> Upgrade Jetty to 9.3.11.v20160721
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20514:

Description: 
Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.

  was:
Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty due to this.


> Upgrade Jetty to 9.3.13.v20161014
> -
>
> Key: SPARK-20514
> URL: https://issues.apache.org/jira/browse/SPARK-20514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> Currently, we are using Jetty version 9.2.16.v20160414.
> However, Hadoop 3, uses 
> [9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
>  (Jetty upgrade was brought in by HADOOP-10075).
> Currently, when you try to build Spark with Hadoop 3, due to this 
> incompatibilities in jetty versions used by Hadoop and Spark, compilation 
> fails with:
> {code}
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
> error: object gzip is not a member of package org.eclipse.jetty.servlets
> [ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
> [ERROR]   ^
> [ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
> error: not found: type GzipHandler
> [ERROR]   val gzipHandler = new GzipHandler
> [ERROR] ^
> [ERROR] two errors found
> {code}
> So, it'd be good to upgrade Jetty to get us closer to working with Hadoop 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20514) Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20514:
---

 Summary: Upgrade Jetty to 9.3.13.v20161014
 Key: SPARK-20514
 URL: https://issues.apache.org/jira/browse/SPARK-20514
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Mark Grover


Currently, we are using Jetty version 9.2.16.v20160414.

However, Hadoop 3, uses 
[9.3.11.v20160721|https://github.com/apache/hadoop/blob/release-3.0.0-alpha2-RC0/hadoop-project/pom.xml#L38]
 (Jetty upgrade was brought in by HADOOP-10075).

Currently, when you try to build Spark with Hadoop 3, due to this 
incompatibilities in jetty versions used by Hadoop and Spark, compilation fails 
with:
{code}
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:31: 
error: object gzip is not a member of package org.eclipse.jetty.servlets
[ERROR] import org.eclipse.jetty.servlets.gzip.GzipHandler
[ERROR]   ^
[ERROR] source/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala:238: 
error: not found: type GzipHandler
[ERROR]   val gzipHandler = new GzipHandler
[ERROR] ^
[ERROR] two errors found
{code}

So, it'd be good to upgrade Jetty due to this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:

Comment: was deleted

(was: Thanks Marcelo!


)

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:


Thanks Marcelo!




> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-26 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-20435:


Thanks Marcelo!




> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-24 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981730#comment-15981730
 ] 

Mark Grover commented on SPARK-20435:
-

bq. I'm not saying redacting from logs is useless, but I'm saying that a user 
that is providing secrets in the command line is giving up any security, and 
redaction won't save him.
Thanks for the ps ax explanation. I appreciated your input and agree that 
redacting from logs is not useless.

The way it is there are 2 ways to supply passwords:
1. The user copies over the entire conf (say from /etc/spark/conf to 
$USER/custom-conf). And, then updates the spark-defaults.conf with the 
appropriate properties containing the password. And, runs Spark jobs with this 
custom configuration. The benefit is that without any change in Spark today, 
they can run the jobs and the password won't be leaked anywhere. However, the 
disadvantage is it is hard to keep the custom configuration in sync given the 
lack of an overlay style config today in Spark. Moreover, the password is being 
written by the user to possibly unencrypted disk in the custom configuration.
2. Supply the password over command line to spark-submit. The advantage is that 
there's no custom configuration to be maintained, there's no password being 
persisted to a file by the user. However, during the duration of the job, the 
password is visible through output of commands like 'ps ax' and with the 
current version of Spark, the password shows up in HDFS, in the event logs and 
anything derived from them. And, the latter may not be secure. This change is 
to make this case less worse by redacting passwords from HDFS event logs. 
Furthermore, as a benefit, we get to add some unit tests that make sure none of 
the redaction functionality regresses in the future.

I think both the above methods have their pros and cons and I think it's best 
for us to document both ways and let the users choose which method they prefer. 
This change makes #2 slight less worse and I think it's worth doing. 
Your points make sense, but it seems still worth making #2 less worse. And, if 
you agree, I'd really appreciate your review of the PR. Thanks!

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-21 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15979741#comment-15979741
 ] 

Mark Grover commented on SPARK-20435:
-

{quote}
If someone is typing passwords in the process's command line, they have bigger 
problems than the password showing up in the logs... (a.k.a. "ps ax")
{quote}
Thanks for your comment, Marcelo. Providing passwords that way is supported by 
Spark, terminal sessions finish, and ps ax works for users with appropriate 
privileges. Log files, on the other hand, touch disks that may not be 
encrypted, that may be blindly shared over unencrypted channels (say for 
debugging), so this is still a good thing to do, in my opinion.

> More thorough redaction of sensitive information from logs/UI, more unit tests
> --
>
> Key: SPARK-20435
> URL: https://issues.apache.org/jira/browse/SPARK-20435
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
> hadoop credential provider password, AWS access/secret keys) from event logs 
> + YARN logs + UI and from the console output, respectively.
> While some unit tests were added along with these changes - they asserted 
> when a sensitive key was found, that redaction took place for that key. They 
> didn't assert globally that when running a full-fledged Spark app (whether or 
> YARN or locally), that sensitive information was not present in any of the 
> logs or UI. Such a test would also prevent regressions from happening in the 
> future if someone unknowingly adds extra logging that publishes out sensitive 
> information to disk or UI.
> Consequently, it was found that in some Java configurations, sensitive 
> information was still being leaked in the event logs under the 
> {{SparkListenerEnvironmentUpdate}} event, like so:
> {code}
> "sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
> spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
> {code}
> "secret_password" should have been redacted.
> Moreover, previously redaction logic was only checking if the key matched the 
> secret regex pattern, it'd redact it's value. That worked for most cases. 
> However, in the above case, the key (sun.java.command) doesn't tell much, so 
> the value needs to be searched. So the check needs to be expanded to match 
> against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20435) More thorough redaction of sensitive information from logs/UI, more unit tests

2017-04-21 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20435:
---

 Summary: More thorough redaction of sensitive information from 
logs/UI, more unit tests
 Key: SPARK-20435
 URL: https://issues.apache.org/jira/browse/SPARK-20435
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Mark Grover


SPARK-18535 and SPARK-19720 were works to redact sensitive information (e.g. 
hadoop credential provider password, AWS access/secret keys) from event logs + 
YARN logs + UI and from the console output, respectively.

While some unit tests were added along with these changes - they asserted when 
a sensitive key was found, that redaction took place for that key. They didn't 
assert globally that when running a full-fledged Spark app (whether or YARN or 
locally), that sensitive information was not present in any of the logs or UI. 
Such a test would also prevent regressions from happening in the future if 
someone unknowingly adds extra logging that publishes out sensitive information 
to disk or UI.

Consequently, it was found that in some Java configurations, sensitive 
information was still being leaked in the event logs under the 
{{SparkListenerEnvironmentUpdate}} event, like so:
{code}
"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ...
{code}

"secret_password" should have been redacted.

Moreover, previously redaction logic was only checking if the key matched the 
secret regex pattern, it'd redact it's value. That worked for most cases. 
However, in the above case, the key (sun.java.command) doesn't tell much, so 
the value needs to be searched. So the check needs to be expanded to match 
against values as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926

2017-04-13 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968149#comment-15968149
 ] 

Mark Grover commented on SPARK-20327:
-

Daniel, we don't assign JIRAs in Spark. Folks issue a PR and once the PR gets 
merged, the committer will assign the JIRA to the contributor.

> Add CLI support for YARN-3926
> -
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-23 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-20066.
-
Resolution: Won't Fix

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937658#comment-15937658
 ] 

Mark Grover commented on SPARK-20066:
-

I have attached some simple test code here: 
https://github.com/markgrover/spark-20066

With the current state of Spark,
mvn clean package -Dspark.version=2.1.0 fails.

mvn clean package -Dspark.version=2.0.0 passes.

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20066:
---

 Summary: Add explicit SecurityManager(SparkConf) constructor for 
backwards compatibility with Java
 Key: SPARK-20066
 URL: https://issues.apache.org/jira/browse/SPARK-20066
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1, 2.2.0
Reporter: Mark Grover


SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
class. And, it has a default value, so life is great.

However, that's not enough when invoking the class from Java. We didn't see 
this before because the SecurityManager class is private to the spark package 
and all the code that uses it is Scala.

However, I have some code that was extending it, in Java, and that code breaks 
because Java can't access that default value (more details 
[here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19734) OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst

2017-03-01 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890825#comment-15890825
 ] 

Mark Grover commented on SPARK-19734:
-

Don't mean to step on any toes but since there wasn't any activity here for the 
past few days, I decided to issue a PR 
(https://github.com/apache/spark/pull/17127/).

Corey, if you already have a PR - I would gladly have it supersede mine.

> OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst
> -
>
> Key: SPARK-19734
> URL: https://issues.apache.org/jira/browse/SPARK-19734
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.0
>Reporter: Corey
>Priority: Minor
>  Labels: documentation, easyfix
>
> The {{OneHotEncoder.__init__}} doc string in PySpark has an input keyword 
> listed as {{includeFirst}}, whereas the code actually uses {{dropLast}}.
> This especially confusing because the {{__init__}} function accepts only 
> keywords, and following the documentation on the web 
> (https://spark.apache.org/docs/2.0.1/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
>  or of {{help}} in Python will result in the error:
> {quote}
> TypeError: __init__() got an unexpected keyword argument 'includeFirst'
> {quote}
> The error is immediately viewable in the source code:
> {code}
> @keyword_only
> def __init__(self, dropLast=True, inputCol=None, outputCol=None):
> """
> __init__(self, includeFirst=True, inputCol=None, outputCol=None)
> """
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-02-23 Thread Mark Grover (JIRA)
Mark Grover created SPARK-19720:
---

 Summary: Redact sensitive information from SparkSubmit console 
output
 Key: SPARK-19720
 URL: https://issues.apache.org/jira/browse/SPARK-19720
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: Mark Grover


SPARK-18535 took care of redacting sensitive information from Spark event logs 
and UI. However, it intentionally didn't bother redacting the same sensitive 
information from SparkSubmit's console output because it was on the client's 
machine, which already had the sensitive information on disk (in 
spark-defaults.conf) or on terminal (spark-submit command line).

However, it seems now that it's better to redact information from SparkSubmit's 
console output as well because orchestration software like Oozie usually expose 
SparkSubmit's console output via a UI. To make matters worse, Oozie, in 
particular, always sets the {{--verbose}} flag on SparkSubmit invocation, 
making the sensitive information readily available in its UI (see 
[code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
 here).

This is a JIRA for tracking redaction of sensitive information from 
SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2016-12-09 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-18120:

Description: QueryExecutionListener is a class that has methods named 
onSuccess() and onFailure() that gets called when a query is executed. Each of 
those methods takes a QueryExecution object as a parameter which can be used 
for metrics analysis. It gets called for several of the DataSet methods like 
take, head, first, collect etc. but doesn't get called for any of the 
DataFrameWriter methods like saveAsTable, save etc.   (was: 
QueryExecutionListener is a class that has methods named onSuccess() and 
onFailure() that gets called when a query is executed. Each of those methods 
takes a QueryExecution object as a parameter which can be used for metrics 
analysis. It gets called for several of the DataSet methods like take, head, 
first, collect etc. but doesn't get called for any of hte DataFrameWriter 
methods like saveAsTable, save etc. )

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18535) Redact sensitive information from Spark logs and UI

2016-11-21 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685225#comment-15685225
 ] 

Mark Grover edited comment on SPARK-18535 at 11/22/16 12:36 AM:


I just issued a PR for this, that adds a new customizable property for 
determining what configuration properties are sensitive. Attached is an image 
from the UI with this change.
Here's the text in the YARN logs, with this change:
{{HADOOP_CREDSTORE_PASSWORD -> *(redacted)}}

Here's the text in the event logs, with this change:
{code}
...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)",...
{code}


was (Author: mgrover):
I just issued a PR for this, that adds a new customizable property for 
determining what configuration properties are sensitive. Attached is an image 
from the UI with this change.
Here's the text in the YARN logs, with this change:
{{HADOOP_CREDSTORE_PASSWORD -> *(redacted)}}

Here's the text in the event logs, with this change:
{{...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)",...}}

> Redact sensitive information from Spark logs and UI
> ---
>
> Key: SPARK-18535
> URL: https://issues.apache.org/jira/browse/SPARK-18535
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.1.0
>Reporter: Mark Grover
> Attachments: redacted.png
>
>
> A Spark user may have to provide a sensitive information for a Spark 
> configuration property, or a source out an environment variable in the 
> executor or driver environment that contains sensitive information. A good 
> example of this would be when reading/writing data from/to S3 using Spark. 
> The S3 secret and S3 access key can be placed in a [hadoop credential 
> provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
>  However, one still needs to provide the password for the credential provider 
> to Spark, which is typically supplied as an environment variable to the 
> driver and executor environments. This environment variable shows up in logs, 
> and may also show up in the UI.
> 1. For logs, it shows up in a few places:
>   1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
>   1B. YARN logs, when printing the executor launch context.
> 2. For UI, it would show up in the _Environment_ tab, but it is redacted if 
> it contains the words "password" or "secret" in it. And, these magic words 
> are 
> [hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
>  and hence not customizable.
> This JIRA is to track the work to make sure sensitive information is redacted 
> from all logs and UIs in Spark, while still being passed on to all relevant 
> places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18535) Redact sensitive information from Spark logs and UI

2016-11-21 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15685225#comment-15685225
 ] 

Mark Grover commented on SPARK-18535:
-

I just issued a PR for this, that adds a new customizable property for 
determining what configuration properties are sensitive. Attached is an image 
from the UI with this change.
Here's the text in the YARN logs, with this change:
{{HADOOP_CREDSTORE_PASSWORD -> *(redacted)}}

Here's the text in the event logs, with this change:
{{...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)",...}}

> Redact sensitive information from Spark logs and UI
> ---
>
> Key: SPARK-18535
> URL: https://issues.apache.org/jira/browse/SPARK-18535
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.1.0
>Reporter: Mark Grover
> Attachments: redacted.png
>
>
> A Spark user may have to provide a sensitive information for a Spark 
> configuration property, or a source out an environment variable in the 
> executor or driver environment that contains sensitive information. A good 
> example of this would be when reading/writing data from/to S3 using Spark. 
> The S3 secret and S3 access key can be placed in a [hadoop credential 
> provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
>  However, one still needs to provide the password for the credential provider 
> to Spark, which is typically supplied as an environment variable to the 
> driver and executor environments. This environment variable shows up in logs, 
> and may also show up in the UI.
> 1. For logs, it shows up in a few places:
>   1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
>   1B. YARN logs, when printing the executor launch context.
> 2. For UI, it would show up in the _Environment_ tab, but it is redacted if 
> it contains the words "password" or "secret" in it. And, these magic words 
> are 
> [hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
>  and hence not customizable.
> This JIRA is to track the work to make sure sensitive information is redacted 
> from all logs and UIs in Spark, while still being passed on to all relevant 
> places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18535) Redact sensitive information from Spark logs and UI

2016-11-21 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-18535:

Attachment: redacted.png

> Redact sensitive information from Spark logs and UI
> ---
>
> Key: SPARK-18535
> URL: https://issues.apache.org/jira/browse/SPARK-18535
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.1.0
>Reporter: Mark Grover
> Attachments: redacted.png
>
>
> A Spark user may have to provide a sensitive information for a Spark 
> configuration property, or a source out an environment variable in the 
> executor or driver environment that contains sensitive information. A good 
> example of this would be when reading/writing data from/to S3 using Spark. 
> The S3 secret and S3 access key can be placed in a [hadoop credential 
> provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
>  However, one still needs to provide the password for the credential provider 
> to Spark, which is typically supplied as an environment variable to the 
> driver and executor environments. This environment variable shows up in logs, 
> and may also show up in the UI.
> 1. For logs, it shows up in a few places:
>   1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
>   1B. YARN logs, when printing the executor launch context.
> 2. For UI, it would show up in the _Environment_ tab, but it is redacted if 
> it contains the words "password" or "secret" in it. And, these magic words 
> are 
> [hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
>  and hence not customizable.
> This JIRA is to track the work to make sure sensitive information is redacted 
> from all logs and UIs in Spark, while still being passed on to all relevant 
> places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18535) Redact sensitive information from Spark logs and UI

2016-11-21 Thread Mark Grover (JIRA)
Mark Grover created SPARK-18535:
---

 Summary: Redact sensitive information from Spark logs and UI
 Key: SPARK-18535
 URL: https://issues.apache.org/jira/browse/SPARK-18535
 Project: Spark
  Issue Type: Bug
  Components: Web UI, YARN
Affects Versions: 2.1.0
Reporter: Mark Grover


A Spark user may have to provide a sensitive information for a Spark 
configuration property, or a source out an environment variable in the executor 
or driver environment that contains sensitive information. A good example of 
this would be when reading/writing data from/to S3 using Spark. The S3 secret 
and S3 access key can be placed in a [hadoop credential 
provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
 However, one still needs to provide the password for the credential provider 
to Spark, which is typically supplied as an environment variable to the driver 
and executor environments. This environment variable shows up in logs, and may 
also show up in the UI.

1. For logs, it shows up in a few places:
  1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
  1B. YARN logs, when printing the executor launch context.
2. For UI, it would show up in the _Environment_ tab, but it is redacted if it 
contains the words "password" or "secret" in it. And, these magic words are 
[hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
 and hence not customizable.

This JIRA is to track the work to make sure sensitive information is redacted 
from all logs and UIs in Spark, while still being passed on to all relevant 
places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17850) HadoopRDD should not swallow EOFException

2016-11-21 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684685#comment-15684685
 ] 

Mark Grover commented on SPARK-17850:
-

Hi [~zsxwing] and [~srowen], the JIRA fix version seems to suggest that it's in 
both Spark 2.0 branch as well as Spark 2.1 branch, but I don't see it in 
branch-2.0. Should the fix version be 2.1.0 only?

> HadoopRDD should not swallow EOFException
> -
>
> Key: SPARK-17850
> URL: https://issues.apache.org/jira/browse/SPARK-17850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> The code in 
> https://github.com/apache/spark/blob/2bcd5d5ce3eaf0eb1600a12a2b55ddb40927533b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L256
>  catches EOFException and mark RecordReader finished. However, in some cases, 
> RecordReader will throw EOFException to indicate the stream is corrupted. See 
> the following stack trace as an example:
> {code}
> Caused by: java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:164)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:134)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:97)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Then HadoopRDD doesn't fail the job when files are corrupted (e.g., corrupted 
> gzip files).
> Note: NewHadoopRDD doesn't have this issue.
> This is reported by Bilal Aslam.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-18093:

Description: 
At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} in 
SQLConfSuite fails because left side of the assert doesn't have a trailing 
slash while the right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.

  was:
At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
fails because left side of the assert doesn't have a trailing slash while the 
right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.


> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605356#comment-15605356
 ] 

Mark Grover commented on SPARK-18093:
-

Yeah, I thought so too - but it failed on two different environments for me - 
an internal Jenkins job and my mac. Perhaps, it's related to some of the 
profiles/properties I am setting?

Anyways, filed a PR. Thanks!

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)
Mark Grover created SPARK-18093:
---

 Summary: Fix default value test in SQLConfSuite to work regardless 
of warehouse dir's existence
 Key: SPARK-18093
 URL: https://issues.apache.org/jira/browse/SPARK-18093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Mark Grover


At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
fails because left side of the assert doesn't have a trailing slash while the 
right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330675#comment-15330675
 ] 

Mark Grover commented on SPARK-12177:
-

bq. I can rename it to spark-streaming-kafka-0-10 to match the change made
for the 0.8 consumer
Thanks!

Mark have you (or anyone else) actually tried this PR out using TLS?
bq. No, I haven't, sorry.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-14 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330500#comment-15330500
 ] 

Mark Grover commented on SPARK-12177:
-

bq. It's worth mentioning that authentication is also supported via TLS. I am 
aware of a number of people who are using TLS for both authentication and 
encryption. So, the security benefit is available now for some people, at least.
Fair point, thanks.

Ok, so what remains to get this in?
1. The PR (https://github.com/apache/spark/pull/11863) reviewed by me, so it 
probably needs to be reviewed by a committer.
2. Sorry for sounding like a broken record, but I don't think kafka-beta as the 
name for the subproject makes much sense, especially now that the new consumer 
api in Kafka 0.10 is not beta. So, some committer buy in would be more valuable 
there too.

Anything else?

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-13 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328699#comment-15328699
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Ismael and Cody,
My personal opinion was to hold off because a) The new consumer API was still 
marked as beta, and so I wasn't sure of the compatibility guarantees, which 
Kafka did seem to break a little (as discussed 
[here|http://mail-archives.apache.org/mod_mbox/kafka-dev/201605.mbox/%3CCAKm=r7v5jgg9qxgjioczdph9vej57m46ngy_626kiq-ovdx...@mail.gmail.com%3E])
 b) the real benefit is security - I am personally a little more biased towards 
authentication (Kerberos) than encryption, so I was just waiting for delegation 
tokens to land. 

Now, that 0.10.0 is released, there's a good chance delegation tokens would 
land in Kafka 0.11.0, and the new consumer API is marked stable, I am more open 
to this PR being merged, it's been around for too long anyways. Cody, what do 
you say? Any reason you'd want to wait? If not, we can make a case for this 
going in now.

As far the logistics of whether this belongs in Apache Bahir or not - today, I 
don't have a strong opinion on where kafka integration should reside. What I do 
feel strongly about, like Cody said, is that the old consumer API integration 
and new consumer API integration should reside in the same place. Since the old 
integration is in Spark, that's where the new makes sense. If a vote on Apache 
Spark results in Kafka integration to be taken out, both the new and the old in 
Apache Bahir would make sense.

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-05-10 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-12177:

Target Version/s:   (was: 2.0.0)

Removing the target version of 2.0.0.

Holler if you disagree.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-05-06 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274910#comment-15274910
 ] 

Mark Grover commented on SPARK-12177:
-

I spent some time earlier today looking at the latest Kafka 0.10 RC. Thanks 
Cody, looks like you poked at it too.

As Cody found out, the new kafka consumer API is in flux. I found and filed a 
PR (KAFKA-3669) for an incompatibility in KafkaConfig, which is now fixed in 
Kafka 0.10. But, there's more - KAFKA-3633 is of note - which discusses fixing 
a compatibility break between 0.9 and 0.10 and remains unresolved.

So, at this point, I can say, that there's no point in committing the PR 
associated with this JIRA (https://github.com/apache/spark/pull/11863), until 
at least Kafka 0.10.0 is released and we have a good sense that Kafka 0.11.0 is 
not going to break compatibility with 0.10.0. Otherwise, we have to bear the 
burden of adding and maintaining complexity to build against multiple versions 
of Kafka, something Storm folks are already suffering from, having their 
KafkaSpout now using the new Kafka consumer API from 0.9.

Given the timing of the Kafka 0.10.0 release and Spark 2 release, this JIRA 
likely wouldn't get resolved in 2.0.




> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-05-03 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268696#comment-15268696
 ] 

Mark Grover commented on SPARK-12177:
-

bq. 1. Rename existing Kafka connector to include 0.8 version in the Maven 
project name.
That sounds good to me.

bq. 2. Don't support 0.9 client for now (otherwise we will need to support 3 
clients very soon). We should revisit the 0.10 support in 2.1, and most likely 
support that.
This is a little more involved. I agree with Cody that the new Kafka consumer 
API implementation in Spark, doesn't really have a benefit right now since we 
can't use the security features which are gated by delegation tokens support in 
Kafka (KAFKA-1696). However, delegation tokens aren't even going to make it to 
Kafka 0.10, so I see little point in us not committing the new Kafka consumer 
API implementation to Spark _because of that_.

Also, I think
bq. 2. Kafka 0.9 changes the client API.
can be better expressed as 
bq. Kafka 0.9 introduces a new client API.

There are 2 axes - one is kafka version (0.8/0.9) and another is the consumer 
API version (old/new). 
Both Kafka 0.8 and 0.9 support the old API without any modifications (for the 
most part) and the existing kafka module in Spark will continue to work with 
Kafka 0.8 and 0.9 (and with Kafka 0.10, I'd imagine. I have been working with 
Kafka community to report issues like KAFKA-3563, which break old API 
compatibility) because existing Spark module is based on the old API which is 
meant to be compatible in all those versions. As far as the new API goes, that 
may change in incompatible ways between 0.9 and 0.10, so we may need a new 
sub-module for an 0.10 based API implementation after all. 

The point I am trying to make  is that there's nothing we'd gain by waiting for 
Kafka 0.10 to come out. It's not any better than Kafka 0.9 in terms of support 
of security features. I suppose the only thing you could save on is not having 
an additional subproject, if new consumer API from Kafka 0.10 broke 
compatibility significantly as compared to 0.9's new consumer API (smaller 
incompatibilities can be deal with reflection and such). I haven't played with 
the RC yet so I don't know if even that is the case. So, really we should be 
gating this on security features (like KAFKA-1696) going in to Kafka, which 
won't happen till at least Kafka 0.11 or put this is in right now.

I can definitely look at the latest Kafka 0.10 and see how likely it is that we 
are going to need a new module (probably by end of this week), if that'd help 
in our decision.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13252) Bump up Kafka to 0.9.0.0

2016-04-19 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-13252.
-
Resolution: Won't Fix

Marking this as won't fix, and taking the focus back on the discussion in 
SPARK-12177. The very likely outcome will be that the existing kafka 
integration (using the old consumer API) will be built against Kafka 0.8.x, and 
a completely new module for the new kafka consumer api (which is in beta) will 
be introduced that will be built against 0.9.x. Those who rely on the old kafka 
consumer API will not have to upgrade their Kafka version (but can if they want 
to) when they upgrade to Spark 2.0.

> Bump up Kafka to 0.9.0.0
> 
>
> Key: SPARK-13252
> URL: https://issues.apache.org/jira/browse/SPARK-13252
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>  Labels: kafka
>
> Apache Kafka release 0.9.0.0 came out some time and we should add support for 
> it. This JIRA is related to SPARK-12177 which is related to adding support 
> for new consumer API only available starting v0.9.0.0
> However, we should upgrade Kafka to 0.9.0.0 regardless of when  (and before) 
> the support for the new consumer API gets added.
> We also use some non-public APIs from Kafka which have changed in 0.9.0.0 
> release. So, this change should also take care of updating those usages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14731) Revert SPARK-12130 to make 2.0 shuffle service compatible with 1.x

2016-04-19 Thread Mark Grover (JIRA)
Mark Grover created SPARK-14731:
---

 Summary: Revert SPARK-12130 to make 2.0 shuffle service compatible 
with 1.x
 Key: SPARK-14731
 URL: https://issues.apache.org/jira/browse/SPARK-14731
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.0
Reporter: Mark Grover


Discussion on the dev list on [this 
thread|http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html].

Conclusion seems to be that we should try to maintain compatibility between 
Spark 1.x and Spark 2.x's shuffle service so folks who may want to run Spark 1 
and Spark 2 on, say, the same YARN cluster can do that easily while running 
only one shuffle service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14711) Examples jar not a part of distribution

2016-04-18 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246333#comment-15246333
 ] 

Mark Grover commented on SPARK-14711:
-

Posting here, mostly for search indexing:-)

The error without this change is:
{code}
bin/run-example SparkPi
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:177)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:683)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

> Examples jar not a part of distribution
> ---
>
> Key: SPARK-14711
> URL: https://issues.apache.org/jira/browse/SPARK-14711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>
> While mucking around with some examples, it seems like spark-examples jar is 
> not being included in the distribution tarball. Also, it's not in the 
> classpath in the spark-submit classpath, which means commands like 
> {{run-example}} fail to work, whether a "distribution" tarball is used or a 
> regular {{mvn package}} build.
> The root cause of this may be due to the fact that the spark-examples jar is 
> under {{$SPARK_HOME/examples/target}} while all its dependencies are at 
> {{$SPARK_HOME/examples/target/scala-2.11/jars}}. And, we only seem to be 
> including the jars directory in the classpath. See 
> [here|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L354]
>  for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14711) Examples jar not a part of distribution

2016-04-18 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246263#comment-15246263
 ] 

Mark Grover commented on SPARK-14711:
-

One solution is to move the spark-examples.jar in the jars directory as well. 
That will fix both the issue with getting the jar in the distribution and 
getting {{run-examples}} to work as well. I think that was the original intent 
of SPARK-13576 anyways.
I have a PR ready for that, will upload shortly.

> Examples jar not a part of distribution
> ---
>
> Key: SPARK-14711
> URL: https://issues.apache.org/jira/browse/SPARK-14711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>
> While mucking around with some examples, it seems like spark-examples jar is 
> not being included in the distribution tarball. Also, it's not in the 
> classpath in the spark-submit classpath, which means commands like 
> {{run-example}} fail to work, whether a "distribution" tarball is used or a 
> regular {{mvn package}} build.
> The root cause of this may be due to the fact that the spark-examples jar is 
> under {{$SPARK_HOME/examples/target}} while all its dependencies are at 
> {{$SPARK_HOME/examples/target/scala-2.11/jars}}. And, we only seem to be 
> including the jars directory in the classpath. See 
> [here|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L354]
>  for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14711) Examples jar not a part of distribution

2016-04-18 Thread Mark Grover (JIRA)
Mark Grover created SPARK-14711:
---

 Summary: Examples jar not a part of distribution
 Key: SPARK-14711
 URL: https://issues.apache.org/jira/browse/SPARK-14711
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Mark Grover


While mucking around with some examples, it seems like spark-examples jar is 
not being included in the distribution tarball. Also, it's not in the classpath 
in the spark-submit classpath, which means commands like {{run-example}} fail 
to work, whether a "distribution" tarball is used or a regular {{mvn package}} 
build.

The root cause of this may be due to the fact that the spark-examples jar is 
under {{$SPARK_HOME/examples/target}} while all its dependencies are at 
{{$SPARK_HOME/examples/target/scala-2.11/jars}}. And, we only seem to be 
including the jars directory in the classpath. See 
[here|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L354]
 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14601) Minor doc/usage changes related to removal of Spark assembly

2016-04-13 Thread Mark Grover (JIRA)
Mark Grover created SPARK-14601:
---

 Summary: Minor doc/usage changes related to removal of Spark 
assembly
 Key: SPARK-14601
 URL: https://issues.apache.org/jira/browse/SPARK-14601
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.0.0
Reporter: Mark Grover


While poking around with 2.0, I noticed that a few places still referred to 
spark assembly jar, so I updated where it made sense. I also updated the usage 
section of spark-submit since you can now use 'run-example' argument to run an 
example from spark-submit which was mostly undocumented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232939#comment-15232939
 ] 

Mark Grover commented on SPARK-14477:
-

Thanks Marcelo for committing! Much appreciated!

Ah, I didn't know that was about the ASF mirrors going down, Sean. I have been 
having problems downloading those artifacts and that's what led me to this 
change. Out of curiosity, was there an email thread about the ASF mirrors going 
down.

> Allow custom mirrors for downloading artifacts in build/mvn
> ---
>
> Key: SPARK-14477
> URL: https://issues.apache.org/jira/browse/SPARK-14477
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Assignee: Mark Grover
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
> from. It makes sense to override these locations with mirrors in many cases, 
> so this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-04-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232885#comment-15232885
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Cody. I agree about the separate subproject and I will review the code 
in your PR. Thank you!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14477) Allow custom mirrors for downloading artifacts in build/mvn

2016-04-07 Thread Mark Grover (JIRA)
Mark Grover created SPARK-14477:
---

 Summary: Allow custom mirrors for downloading artifacts in 
build/mvn
 Key: SPARK-14477
 URL: https://issues.apache.org/jira/browse/SPARK-14477
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Mark Grover
Priority: Minor


Currently, build/mvn hardcodes the URLs where it downloads mvn and zinc/scala 
from. It makes sense to override these locations with mirrors in many cases, so 
this change will add support for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-22 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207146#comment-15207146
 ] 

Mark Grover edited comment on SPARK-13670 at 3/22/16 7:53 PM:
--

Apologies for the delay in responding. This works for me in the failure case, 
on mac. I haven't done exhaustive testing - but my use-case that was originally 
broken which led to this JIRA, is fixed by the proposed fix. Thanks for working 
on this, Marcelo.


was (Author: mgrover):
Apologies for the delay in responding. This works for me in the failure case. I 
haven't done exhaustive testing - but my use-case that was originally broken 
which led to this JIRA, is fixed by the proposed fix. Thanks for working on 
this, Marcelo.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-22 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207146#comment-15207146
 ] 

Mark Grover commented on SPARK-13670:
-

Apologies for the delay in responding. This works for me in the failure case. I 
haven't done exhaustive testing - but my use-case that was originally broken 
which led to this JIRA, is fixed by the proposed fix. Thanks for working on 
this, Marcelo.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200188#comment-15200188
 ] 

Mark Grover commented on SPARK-13877:
-

Yeah, that totally makes sense. I agree that it's a big change but I also think 
we can't really keep the same package name if this code moves out of Apache 
Spark.

So should we mark this as Won't Fix then?

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-18 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200079#comment-15200079
 ] 

Mark Grover commented on SPARK-13877:
-

I am guessing this needs to be done before Spark 2.0 code freeze. Also, if we 
are moving this to be outside of Spark, it's not a part of Apache Spark project 
any more, so in my opinion, we should be updating the maven coordinates and 
package names to be something like {{org.spark-packages.*}}. I am happy to 
volunteer to make those changes, unless someone has objection.

But, I think it's a big enough change for our end users that we should have a 
dev@ vote thread on this. Also, we need to come up with a who can commit code 
to this external repo. All Spark committers seems like a safe choice to begin 
with but could be expand later on. Thoughts?

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13877) Consider removing Kafka modules from Spark / Spark Streaming

2016-03-15 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195873#comment-15195873
 ] 

Mark Grover commented on SPARK-13877:
-

I am in support of taking the kafka integration out as well. However, in my 
mind, we should figure out the answer to the following questions before we do 
(some of these have already been aptly pointed out by Cody and Sean):
* Where will the code repo be located?
* Who would have access to commit code?
* How do we track issues there? Github Issues/PRs?
* Whose infrastructure would the test jobs run on?
* Where would the artifacts be released? Probably not on apache.org/dist. If 
not there, then where?

> Consider removing Kafka modules from Spark / Spark Streaming
> 
>
> Key: SPARK-13877
> URL: https://issues.apache.org/jira/browse/SPARK-13877
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Hari Shreedharan
>
> Based on the discussion the PR for SPARK-13843 
> ([here|https://github.com/apache/spark/pull/11672#issuecomment-196553283]), 
> we should consider moving the Kafka modules out of Spark as well. 
> Providing newer functionality (like security) has become painful while 
> maintaining compatibility with older versions of Kafka. Moving this out 
> allows more flexibility, allowing users to mix and match Kafka and Spark 
> versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185461#comment-15185461
 ] 

Mark Grover commented on SPARK-12177:
-

For a) I think it's a larger discussion, that is relevant to not kafka - it'd 
be good for Spark to have a policy on far back does it want to support various 
versions and how that changes for major vs. minor releases of Spark.

For b) there is this PR: https://github.com/apache/spark/pull/10953 and Cody is 
working on the LRU caching like he said and here's the relevant email on the 
Kafka mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/Upgrading-to-Kafka-0-9-x-td16466.html

If you'd like to review the PR, that'd be appreciated. Thanks!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180663#comment-15180663
 ] 

Mark Grover commented on SPARK-13670:
-

I have a mac, I can do some more testing.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180516#comment-15180516
 ] 

Mark Grover commented on SPARK-13670:
-

And, it's guaranteed that Main would never return that output?

I feel like we may be overloading too much here but again, I don't have any 
non-ugly ideas either. We may just have to choose our poison.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180463#comment-15180463
 ] 

Mark Grover commented on SPARK-13670:
-

Thanks for looking at this. What do you think of #2 - writing the output to a 
file?

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179491#comment-15179491
 ] 

Mark Grover commented on SPARK-13670:
-

cc [~vanzin] as fyi.
I ran into this when I was running {{dev/mima}}. It should have failed because 
I didn't have the tools jar and launcher.main() did throw an exception but the 
subshell ate it all up and the mima testing continued, eventually giving me a 
ton of errors since the exclusions list wasn't correctly populated.

Anyways, there are a few ways I can think of fixing it, open to others as well:
1. We could just fix the symptom and have the dev/mima script, explicitly check 
for tools directory and tools jar in bash and break, if it's not present. 
However, a) that's just fixing the symptom, not the root cause, b) we already 
have those checks in launcher code.
2. We could have the command write the output to a file instead of reading it 
directly from a subshell. Then, that makes error handling easier.
3. We can have the subshell kill the parent process on error, essentially 
running the subshell like so:
{code}
("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" || kill 
$$)
{code}
The side effect is that any other subshells spawned by the parent, i.e. 
spark-class will be terminated as well. Based on a quick look, I didn't see any 
other subshells though.

Thoughts? Preferences?

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-13670:

Description: 
There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}. This is because the launcher.Main is run within a subshell.

  was:
There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}.
This is because the launcher.Main is run within a subshell.


> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)
Mark Grover created SPARK-13670:
---

 Summary: spark-class doesn't bubble up error from launcher command
 Key: SPARK-13670
 URL: https://issues.apache.org/jira/browse/SPARK-13670
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Mark Grover
Priority: Minor


There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}.
This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177315#comment-15177315
 ] 

Mark Grover commented on SPARK-12177:
-

One more thing as a potential con for Proposal 1:
There are places that have to use the kafka artifact. 'examples' subproject is 
a good example of that. The subproject pulls kafka artifact as a dependency and 
has example for Kafka usage. However, it can't depend on the new 
implementation's artifact at the same time because they depend on different 
versions of kafka. Therefore, unless I am missing something, new 
implementation's example can't go there. 

And, that's fine, we can put it within the subproject itself, instead of 
examples, but that won't necessarily work with tooling like run-example, etc.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176225#comment-15176225
 ] 

Mark Grover commented on SPARK-12177:
-

Let me clarify what I was saying:
There are 2 axes here - one is the new/old consumer API and other is the 
support for Kafka v0.8 and v0.9. Both Kafka v0.8 and v0.9 provide the old API, 
only v0.9 provides the new API.

bq. The fact that the 0.9 consumer is still considered beta by the Kafka 
project and that things are going to change in 0.10 is an argument for keeping 
the existing implementation as it is, not an argument for throwing it away 
prematurely. 
I totally agree with you, Cody, that the old API implementation is bug free and 
I am definitely not proposing to throw away that implementation. My proposal is 
that both the old implementation and the new will rely on depend against the 
same version of Kafka - that being 0.9.

Based on what I now understand (and please correct me if I am wrong), I think 
what you are proposing is:
Proposal 1:
2 subprojects - one with old implementation and one with new. The 'old' 
subproject will be built against Kafka 0.8, and will have it's own assembly and 
the new subproject will use the new API and will be built against Kafka 0.9 and 
will have it's own assembly.

And, what I am proposing is:
Proposal 2:
2 subprojects - one with old implementation and one with new. Both the 
implementations will be built against Kafka 0.9, they both end up in one single 
Kafka assembly artifact.

Pro of Proposal 1 is that folks who want to use the old implementation with 
Kafka 0.8 brokers can use it, without upgrading their brokers. Con of proposal 
1 is that it doesn't allow for re-use of any code between the old and new 
implementation. This can be a good thing if we don't want to share any code in 
the new implementation but there is a definitely a bunch of test code that I 
think, it'd be good to share.

Pro of Proposal 2 - test code, etc. can be shared, there will be a single 
artifact that folks would need to run the old direct stream implementation or 
the new one.
The con is, of course, that folks would have to upgrade their brokers to Kafka 
0.9, if they want to use Spark 2.0.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175999#comment-15175999
 ] 

Mark Grover commented on SPARK-12177:
-

I think the core of the question is a much broader Spark question - how many 
past versions to support?

To add some more color to the question at hand, Kafka has already decided that 
the [next version of Kafka will be 
0.10.0|https://github.com/apache/kafka/commit/b084c485e25bfe77154e805219b24714d59c396c]
 (instead of 0.9.1) and this next version will have yet another protocol 
change. So, where do we go on from there? Supporting Kafka 0.8, 0.9 and 0.10.0 
in 2.x?

I still think Spark 2.0 is a good time to drop support for Kafka 0.8.x. Other 
projects are doing it, that too, in their minor releases (links to Flume and 
Storm JIRAs are on the PR) and Kafka is moving fast with protocol changes in 
every new non-maintenance release and it will become a huge hassle to keep up 
with all the past releases.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174856#comment-15174856
 ] 

Mark Grover commented on SPARK-12177:
-

Hi [~tdas] and [~rxin], can you help us with your opinion on these questions, 
so we can unblock this work:
1. Should we support both Kafka 0.8 and 0.9 or just 0.9? The pros and cons are 
listed [here|https://github.com/apache/spark/pull/11143#issuecomment-182154267] 
along with what other projects are doing.
2. Should we make a separate project for the implementation using the new kafka 
consumer API with the same class names (e.g. KafkaRDD, etc.), or create new 
classes like hadoop did, in the same subproject e.g. NewKafkaRDD, etc.

Thanks!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-03-01 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174149#comment-15174149
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Cody, I appreciate your thoughts. I have been keeping most of my 
commentary on the PRs but I will post some parts of it here for the sake of 
argument.

bq. No one (as far as I can tell) is actually doing integration testing of 
these existing PRs using the new kafka security features.
We need actual manual integration testing and benchmarking, ideally with 
production loads.
Agreed. The code in [my PR for the new security 
API|https://github.com/apache/spark/pull/10953/files] was integration tested by 
me against a distributed Kafka and ZK cluster, albeit manually. Working on 
adding automated integration tests is on my list of things to do, however, that 
PR is bit rotting because it's blocked by [the 
PR|https://github.com/apache/spark/pull/11143] to upgrade Kafka to 0.9.0.

Your comment about caching consumers on executors is an excellent one. I 
haven't invested much time there because the way I was thinking of doing this 
was in several steps:
1. Upgrade Kafka to 0.9 (with or without 0.8 support, pending decision on 
https://github.com/apache/spark/pull/11143)
2. Add support for the new consumer API 
(https://github.com/apache/spark/pull/10953/files)
3. Add Kerberos/SASL support for authentication and SSL support for encryption 
over wire. This work is blocked until delegation token support is added in 
Kafka (https://issues.apache.org/jira/browse/KAFKA-1696). I have been following 
that design discussion closely Kafka mailing list.

Thanks for sharing your preference. I understand where you are coming from, and 
think that's reasonable. I had gotten feedback to the contrary on [this 
PR|https://github.com/apache/spark/pull/10953/files] so I changed my original 
implementation which had separate subprojects, to all be in the same project. I 
don't mind changing it back, especially if we are going to keep 0.8 support.

Related to not hiding the fact that the consumer is new, is concerned: I agree 
with you, KafkaUtils, for example has exposed TopicAndPartition, 
MessageAndMetadata classes. And, I think we may have to expose their new API 
equivalent TopicPartition and ConsumerRecord in KafkaUtils.

In any case, I'd appreciate your help in moving this forward. I think the first 
step is to come to a resolution on https://github.com/apache/spark/pull/11143. 
Perhaps you, I, [~tdas] and anyone else who's interested could get on a call to 
sort this out? I will post the call details here so anyone would be able to 
join in. Other methods of communication work too, my goal is to move that 
conversation forward.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13252) Bump up Kafka to 0.9.0.0

2016-02-09 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-13252:

Summary: Bump up Kafka to 0.9.0.0  (was: [STREAMING] Bump up Kafka to 
0.9.0.0)

> Bump up Kafka to 0.9.0.0
> 
>
> Key: SPARK-13252
> URL: https://issues.apache.org/jira/browse/SPARK-13252
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>  Labels: kafka
>
> Apache Kafka release 0.9.0.0 came out some time and we should add support for 
> it. This JIRA is related to SPARK-12177 which is related to adding support 
> for new consumer API only available starting v0.9.0.0
> However, we should upgrade Kafka to 0.9.0.0 regardless of when  (and before) 
> the support for the new consumer API gets added.
> We also use some non-public APIs from Kafka which have changed in 0.9.0.0 
> release. So, this change should also take care of updating those usages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13251) Bump up Kafka to 0.9.0.0

2016-02-09 Thread Mark Grover (JIRA)
Mark Grover created SPARK-13251:
---

 Summary: Bump up Kafka to 0.9.0.0
 Key: SPARK-13251
 URL: https://issues.apache.org/jira/browse/SPARK-13251
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Mark Grover


Kafka release v0.9.0 and we should bump our Kafka support to be with that 
version.

This is related to SPARK-12177 which pertains to adding support for the new 
kafka consumer API. However, this is a much, much smaller subset of that 
change, which simply bumps up the version of Kafka 0.9. We also use some 
non-public APIs of Kafka (AdminUtils, in particular) that have changed between 
the 2 releases, so this JIRA will take care of related fixes there as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13252) [STREAMING] Bump up Kafka to 0.9.0.0

2016-02-09 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-13252:

Summary: [STREAMING] Bump up Kafka to 0.9.0.0  (was: Bump up Kafka to 
0.9.0.0)

> [STREAMING] Bump up Kafka to 0.9.0.0
> 
>
> Key: SPARK-13252
> URL: https://issues.apache.org/jira/browse/SPARK-13252
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>  Labels: kafka
>
> Apache Kafka release 0.9.0.0 came out some time and we should add support for 
> it. This JIRA is related to SPARK-12177 which is related to adding support 
> for new consumer API only available starting v0.9.0.0
> However, we should upgrade Kafka to 0.9.0.0 regardless of when  (and before) 
> the support for the new consumer API gets added.
> We also use some non-public APIs from Kafka which have changed in 0.9.0.0 
> release. So, this change should also take care of updating those usages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13252) Bump up Kafka to 0.9.0.0

2016-02-09 Thread Mark Grover (JIRA)
Mark Grover created SPARK-13252:
---

 Summary: Bump up Kafka to 0.9.0.0
 Key: SPARK-13252
 URL: https://issues.apache.org/jira/browse/SPARK-13252
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Mark Grover


Apache Kafka release 0.9.0.0 came out some time and we should add support for 
it. This JIRA is related to SPARK-12177 which is related to adding support for 
new consumer API only available starting v0.9.0.0

However, we should upgrade Kafka to 0.9.0.0 regardless of when  (and before) 
the support for the new consumer API gets added.

We also use some non-public APIs from Kafka which have changed in 0.9.0.0 
release. So, this change should also take care of updating those usages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13252) Bump up Kafka to 0.9.0.0

2016-02-09 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140465#comment-15140465
 ] 

Mark Grover commented on SPARK-13252:
-

Yeah, this is a very small subset of the patch for SPARK-12177. SPARK-12177 
deals with adding support for new consumer API for which we have to bump to 
Kafka 0.9.0.0 anyways. So, this JIRA takes care of that bumping up. And, the 
support for new API can come later.

> Bump up Kafka to 0.9.0.0
> 
>
> Key: SPARK-13252
> URL: https://issues.apache.org/jira/browse/SPARK-13252
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>  Labels: kafka
>
> Apache Kafka release 0.9.0.0 came out some time and we should add support for 
> it. This JIRA is related to SPARK-12177 which is related to adding support 
> for new consumer API only available starting v0.9.0.0
> However, we should upgrade Kafka to 0.9.0.0 regardless of when  (and before) 
> the support for the new consumer API gets added.
> We also use some non-public APIs from Kafka which have changed in 0.9.0.0 
> release. So, this change should also take care of updating those usages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-02-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15137771#comment-15137771
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Rama,
This particular PR adds support for the new API. There is some small code for 
SSL support in it too but I haven't invested much time in testing that, apart 
from the simple unit test that was written for it. Kerberos (SASL) will have to 
done incrementally in another patch because, it can't be done until Kafka 
supports delegation tokens (which is still not there yet: 
https://issues.apache.org/jira/browse/KAFKA-1696)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-02-02 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128465#comment-15128465
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Cody, will do!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-27 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120113#comment-15120113
 ] 

Mark Grover commented on SPARK-12177:
-

I have issued a new PR https://github.com/apache/spark/pull/10953 for this 
which contains all of Nikita's changes as well. Please feel free to review and 
comment there.

The python implementation is not in that PR just yet, it's being worked on 
separately at 
https://github.com/markgrover/spark/tree/kafka09-integration-python (for now, 
anyways).

The new package is called 'newapi' instead of 'v09'.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2016-01-24 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114383#comment-15114383
 ] 

Mark Grover commented on SPARK-11796:
-

[~blbradley] I put instructions 
[here|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningDockerintegrationtests]
 on how to make tests pass.

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2016-01-24 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114590#comment-15114590
 ] 

Mark Grover commented on SPARK-11796:
-

Awesome, thanks for sharing.

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-22 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112785#comment-15112785
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Mario,
Thanks for checking. I was still hoping to do everything in one assembly, so 
far it's looking good.

Yeah, I'll take care of renaming the packages/python files to something other 
than v09.

python part is coming along ok. Will keep you posted on the jira.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12960) Some examples are missing support for python2

2016-01-21 Thread Mark Grover (JIRA)
Mark Grover created SPARK-12960:
---

 Summary: Some examples are missing support for python2
 Key: SPARK-12960
 URL: https://issues.apache.org/jira/browse/SPARK-12960
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Mark Grover
Priority: Minor


Without importing the print_function, the lines later on like 
{code}
print("Usage: direct_kafka_wordcount.py  ", file=sys.stderr)
{code}
fail when using python2.*. Import fixes that problem and doesn't break anything 
on python3 either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-20 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110082#comment-15110082
 ] 

Mark Grover commented on SPARK-12177:
-

Hi Mario,
I may have misunderstood some parts of your previous comment and if so, I 
apologize in advance.

bq. i think that is not required when client uses a v0.9 jar though consuming 
only the older high level/low level API and talking to a v0.8 kafka cluster.
Based on what I understand, that's not the case. If one uses the kafka v9 jar 
even when using the old consumer API, it can only work a Kafka v9 broker. So, 
if we have to support both Kafka v08 and Kafka v09 brokers with Spark (which I 
believe we do), we have to have both Kafka v08 and Kafka v09 jars in our 
assembly. As far as I understand, simply having Kafka v09 jar only will not 
help.

bq. 1 thought around not introducing the version in the package name or class 
name (I see that Flink does it in the class name) was to avoid forcing us to 
create v0.10/v0.11 packages (and customers to change code and recompile), even 
if those releases of kafka don’t have client-api’ or otherwise such changes 
that warrant us to make a new version
I totally agree with you on this note. I was actually thinking of renaming all 
the v09 packages to be something different (like 'new'? But may be there's a 
better term) because as very aptly pointed out that it would be very confusing 
as we support later kafka versions.

bq. That’s why 1 earlier idea i mentioned in this JIRA was 'The public API 
signatures (of KafkaUtils in v0.9 subproject) are different and do not clash 
(with KafkaUtils in original kafka subproject) and hence can be added to the 
existing (original kafka subproject) KafkaUtils class.’ This also addresses the 
issues u mention above. Cody mentioned that we need to get others on the same 
page for this idea, so i guess we really need the committers to chime in here. 
Of course i forgot to answer’s Nikita’s followup question - 'do you mean that 
we would change the original KafkaUtils by adding new functions for new 
DirectIS/KafkaRDD but using them from separate module with kafka09 classes’ ? 
To be clear, these new public methods added to original kafka subproject’s 
‘KafkaUtils' ,will make use of 
DirectKafkaInputDStream,KafkaRDD,KafkaRDDPartition,OffsetRange classes that are 
in a new v09 package (internal of course). In short we don’t have a new 
subproject. (I skipped class KafkaCluster class from the list, because i am 
thinking it makes more sense to call this class something like 'KafkaClient' 
instead going forward)

At the core of it, I am not 100% sure if we can hide/abstract the fact away 
from our users that we have completely changed the consumer API from underneath 
us. I can think more about it but would appreciate more thoughts/insights along 
this direction, especially if you feel strongly about this.

Thanks again, Mario!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107267#comment-15107267
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Mario! 
bq. We should also have a python/pyspark/streaming/kafka-v09.py as well that 
matches to our external/kafka-v09
I agree, I will look into this.
bq. Why do you have the Broker.scala class? Unless i am missing something, it 
should be knocked off
Yeah, I noticed that too and I agree. This should be pretty simple to take out. 
I also 
[noticed|https://issues.apache.org/jira/browse/SPARK-12177?focusedCommentId=15089750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15089750]
 that the v09 example picking up some Kafka v08 jars so I am working on fixing 
that too.
bq. I think the package should be 'org.apache.spark.streaming.kafka' only in 
external/kafka-v09 and not 'org.apache.spark.streaming.kafka.v09'. This is 
because we produce a jar with a diff name (user picks which one and even if 
he/she mismatches, it errors correctly since the KafkaUtils method signatures 
are different)
I totally understand what you mean. However, kafka has its [own assembly in 
Spark|https://github.com/apache/spark/tree/master/external/kafka-assembly] and 
the way the code is structured right now, both the new API and old API would go 
in the same assembly so it's important to have a different package name. Also, 
I think for our end users transitioning from old to new API, I foresee them 
having 2 versions of their spark-kafka app. One that works with the old API and 
one with the new API. And, I think it would be an easier transition if they 
could include both the kafka API versions in the spark classpath and pick and 
choose which app to run without mucking with maven dependencies and 
re-compiling when they want to switch. Let me know if you disagree.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107669#comment-15107669
 ] 

Mark Grover commented on SPARK-12177:
-

Posting an update. Took out Broker.scala, the example picking wrong version of 
Kafka was already taken care of by Nikita. I am looking into the python stuff 
now.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-12 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094445#comment-15094445
 ] 

Mark Grover commented on SPARK-12426:
-

Thanks Sean, if you could add this, that'd be great.

h2. Running docker integration tests
In order to run [docker integration 
tests|https://github.com/apache/spark/tree/master/docker-integration-tests], 
you have to install docker engine on your box. The instructions for 
installation can be found at https://docs.docker.com/engine/installation/. Once 
installed, the docker service needs to be started, if not already running. On 
Linux, this can be done by {{sudo service docker start}}.
These integration tests run as a part of a regular Spark unit test run, 
therefore, it's necessary for docker engine to be installed and running if you 
want all Spark tests to pass.

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> 

[jira] [Resolved] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-12 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-12426.
-
Resolution: Fixed

Thanks Sean! Marking this as resolved since all the necessary information is 
now at 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningDockerintegrationtests

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)
>   at 
> 

[jira] [Comment Edited] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-11 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083482#comment-15083482
 ] 

Mark Grover edited comment on SPARK-12426 at 1/11/16 8:56 PM:
--

Sean and Josh,
I got to the bottom of this. This is because docker sucks when bubbling up the 
error that docker engine is not running on the machine running the unit tests. 
The instructions for installing docker engine on various OSs are at 
https://docs.docker.com/engine/installation/
Once installed the docker service needs to be started, if it's not already 
running. On Linux, this is simply {{sudo service docker start}} and then our 
docker integration tests pass.

Sorry that I didn't get a chance to look into it around 1.6 rc time, holidays 
got in the way.

I am thinking of adding this info on [this wiki 
page|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ].
 Please let me know if you think there is a better place, that's the best I 
could find. I don't seem to have access to edit that page, can one of you 
please give me access?

Also, I was trying to search in the code for any puppet recipes we maintain for 
the setting up build slaves. If our Jenkins infra were wiped out, how do we 
make sure docker-engine is installed and running? How do we maintain keep track 
of build dependencies? Thanks in advance!


was (Author: mgrover):
Sean and Josh,
I got to the bottom of this. This is because docker sucks when bubbling up the 
error that docker engine is not running on the machine running the unit tests. 
The instructions for installing docker engine on various OSs are at 
https://docs.docker.com/engine/installation/
Once installed the docker service needs to be started, if it's not already 
running. On Linux, this is simply {{sudo service docker start}} and then our 
docker integration tests pass.

Sorry that I didn't get a chance to look into it around 1.6 rc time, holidays 
got in the way.

I am thinking of adding this info on [this wiki 
page|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ].
 Please let me know if you think there is a better place, that's the best I 
could find. I don't seem to have access to edit that page, can one of you 
please give me access?

Also, I was trying to search in the code for any puppet recipes we maintain for 
the setting up build slaves. In order, if our Jenkins infra were wiped out, how 
do we make sure docker-engine is installed and running? How do we maintain keep 
track of build dependencies? Thanks in advance!

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at 

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-11 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092455#comment-15092455
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Nikita. And, I will be issuing PR's to your kafka09-integration branch 
so it can become the single source of truth until this change gets merged into 
spark. And, I believe Spark community prefers discussion on PRs once they are 
filed, so you'll hear more from me there:-)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-09 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090717#comment-15090717
 ] 

Mark Grover commented on SPARK-12177:
-

#1 Sounds great, thanks!
#2 Yeah, that's the only way I can think of for now but let me ponder a bit 
more. 

Thanks! Looking forward to it.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089750#comment-15089750
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks for working on this, Nikita. I'd like to help out. Here are a few things 
of feedback that I have:
1. I tried rebasing what you have in your current branch to upstream master 
(yours still seems to be based off of pre-1.6.0 code) but mostly because of 
some commits related to import ordering that happened on spark trunk relatively 
recently, I found it easier to migrate/copy the code for kafka-v09 and make the 
minor changes to examples and root pom instead of doing a 'git rebase'.
2. I also noticed that the v09DirectKafkaWordCount example is pulling at least 
the ConsumerConfig class from Kafka 0.8.2.1. This because the examples pom 
contains both kafka 0.8.2.1 and 0.9.0 dependencies and somewhat arbitrarily 
puts the 0.8.2.1 ahead. Since the ConsumerConfig class is available in both 
under the same namespace, we end up pulling 0.8.2.1. We should fix that.


In general, I may have a few more changes/fixes that I'd like to contribute to 
your pull request. Would it be possible for us to collaborate? What's the best 
way to do so? Reopening the pull request and me adding to it? Or, just me 
issuing pull requests to [your 
branch|https://github.com/nikit-os/spark/tree/kafka-09-consumer-api]? Thanks!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-08 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090179#comment-15090179
 ] 

Mark Grover commented on SPARK-12426:
-

[~sowen]/[~joshrosen] Just a reminder about this, I'd appreciate your response. 
Thanks!

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
>   at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487)
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: 

[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again

2016-01-05 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083482#comment-15083482
 ] 

Mark Grover commented on SPARK-12426:
-

Sean and Josh,
I got to the bottom of this. This is because docker sucks when bubbling up the 
error that docker engine is not running on the machine running the unit tests. 
The instructions for installing docker engine on various OSs are at 
https://docs.docker.com/engine/installation/
Once installed the docker service needs to be started, if it's not already 
running. On Linux, this is simply {{sudo service docker start}} and then our 
docker integration tests pass.

Sorry that I didn't get a chance to look into it around 1.6 rc time, holidays 
got in the way.

I am thinking of adding this info on [this wiki 
page|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ].
 Please let me know if you think there is a better place, that's the best I 
could find. I don't seem to have access to edit that page, can one of you 
please give me access?

Also, I was trying to search in the code for any puppet recipes we maintain for 
the setting up build slaves. In order, if our Jenkins infra were wiped out, how 
do we make sure docker-engine is installed and running? How do we maintain keep 
track of build dependencies? Thanks in advance!

> Docker JDBC integration tests are failing again
> ---
>
> Key: SPARK-12426
> URL: https://issues.apache.org/jira/browse/SPARK-12426
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Mark Grover
>
> The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to 
> be failing again on my machine (Ubuntu Precise). This was the same box that I 
> tested my previous commit on. Also, I am not confident this failure has much 
> to do with Spark, since a well known commit where the tests were passing, 
> fails now, in the same environment.
> [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
> on his Ubuntu 15 box as well.
> Here's the error, fyi:
> {code}
> 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> *** RUN ABORTED ***
>   com.spotify.docker.client.DockerException: 
> java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   ...
>   Cause: java.util.concurrent.ExecutionException: 
> com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
> java.io.IOException: No such file or directory
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
>   at 
> com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
>   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
>   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
>   at 
> org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
>   ...
>   Cause: 

[jira] [Created] (SPARK-12426) Docker JDBC integration tests are failing again

2015-12-18 Thread Mark Grover (JIRA)
Mark Grover created SPARK-12426:
---

 Summary: Docker JDBC integration tests are failing again
 Key: SPARK-12426
 URL: https://issues.apache.org/jira/browse/SPARK-12426
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 1.6.0
Reporter: Mark Grover


The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to be 
failing again on my machine (Ubuntu Precise). This was the same box that I 
tested my previous commit on. Also, I am not confident this failure has much to 
do with Spark, since a well known commit where the tests were passing, fails 
now, in the same environment.

[~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing 
on his Ubuntu 15 box as well.

Here's the error, fyi:
{code}
15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
*** RUN ABORTED ***
  com.spotify.docker.client.DockerException: 
java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141)
  at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082)
  at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
  at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
  at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
  ...
  Cause: java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
  at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080)
  at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76)
  at 
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58)
  at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
  at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58)
  ...
  Cause: com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.io.IOException: No such file or directory
  at 
org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:481)
  at 
org.glassfish.jersey.apache.connector.ApacheConnector$1.run(ApacheConnector.java:491)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at 
jersey.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
  at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50)
  at 
jersey.repackaged.com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37)
  at 
org.glassfish.jersey.apache.connector.ApacheConnector.apply(ApacheConnector.java:487)
15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
  at org.glassfish.jersey.client.ClientRuntime$2.run(ClientRuntime.java:177)
  ...
  Cause: java.io.IOException: No such file or directory
  at jnr.unixsocket.UnixSocketChannel.doConnect(UnixSocketChannel.java:94)
  at jnr.unixsocket.UnixSocketChannel.connect(UnixSocketChannel.java:102)
  at 
com.spotify.docker.client.ApacheUnixSocket.connect(ApacheUnixSocket.java:73)
  at 

[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-12-10 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051937#comment-15051937
 ] 

Mark Grover commented on SPARK-11796:
-

Excellent, thank you!

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11796) Docker JDBC integration tests fail in Maven build due to dependency issue

2015-12-10 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051931#comment-15051931
 ] 

Mark Grover commented on SPARK-11796:
-

Hey [~joshrosen], just checking if you have removed the 
{{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag from the Jenkins 
builds. I am happy to do that too but I don't have the privs.

> Docker JDBC integration tests fail in Maven build due to dependency issue
> -
>
> Key: SPARK-11796
> URL: https://issues.apache.org/jira/browse/SPARK-11796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Mark Grover
> Fix For: 1.6.0
>
>
> Our new Docker integration tests for JDBC dialects are failing in the Maven 
> builds. For now, I've disabled this for Maven by adding the 
> {{-Dtest.exclude.tags=org.apache.spark.tags.DockerTest}} flag to our Jenkins 
> builds, but we should fix this soon. The test failures seem to be related to 
> dependency or classpath issues:
> {code}
> *** RUN ABORTED ***
>   java.lang.NoSuchMethodError: 
> org.apache.http.impl.client.HttpClientBuilder.setConnectionManagerShared(Z)Lorg/apache/http/impl/client/HttpClientBuilder;
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnector.(ApacheConnector.java:240)
>   at 
> org.glassfish.jersey.apache.connector.ApacheConnectorProvider.getConnector(ApacheConnectorProvider.java:115)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.initRuntime(ClientConfig.java:418)
>   at 
> org.glassfish.jersey.client.ClientConfig$State.access$000(ClientConfig.java:88)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:120)
>   at 
> org.glassfish.jersey.client.ClientConfig$State$3.get(ClientConfig.java:117)
>   at 
> org.glassfish.jersey.internal.util.collection.Values$LazyValueImpl.get(Values.java:340)
>   at 
> org.glassfish.jersey.client.ClientConfig.getRuntime(ClientConfig.java:726)
>   at 
> org.glassfish.jersey.client.ClientRequest.getConfiguration(ClientRequest.java:285)
>   at 
> org.glassfish.jersey.client.JerseyInvocation.validateHttpMethodAndEntity(JerseyInvocation.java:126)
>   ...
> {code}
> To reproduce locally: {{build/mvn -pl docker-integration-tests package}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12257) Non partitioned insert into a partitioned Hive table doesn't fail

2015-12-09 Thread Mark Grover (JIRA)
Mark Grover created SPARK-12257:
---

 Summary: Non partitioned insert into a partitioned Hive table 
doesn't fail
 Key: SPARK-12257
 URL: https://issues.apache.org/jira/browse/SPARK-12257
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Mark Grover
Priority: Minor


I am using Spark 1.5.1 but I anticipate this to be a problem with master as 
well (will check later).

I have a dataframe, and a partitioned Hive table that I want to insert the 
contents of the data frame into.

Let's say mytable is a non-partitioned Hive table and mytable_partitioned is a 
partitioned Hive table. In Hive, if you try to insert from the non-partitioned 
mytable table into mytable_partitioned without specifying the partition, the 
query fails, as expected:
{quote}
INSERT INTO mytable_partitioned SELECT * FROM mytable;
{quote}
Error: Error while compiling statement: FAILED: SemanticException 1:12 Need to 
specify partition columns because the destination table is partitioned. Error 
encountered near token 'mytable_partitioned' (state=42000,code=4)
{quote}

However, if I do the same in Spark SQL:
{code}
val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
sqlContext.sql("INSERT INTO mytable_partitioned SELECT * FROM my_df_temp_table")
{code}
This appears to succeed but does no insertion. This should fail with an error 
stating the data is being inserted into a partitioned table without specifying 
the name of the partition.

Of course, the name of the partition is explicitly specified, both Hive and 
Spark SQL do the right thing and function correctly.
In hive:
{code}
INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * FROM mytable;
{code}
In Spark SQL:
{code}
val myDfTempTable = myDf.registerTempTable("my_df_temp_table")
sqlContext.sql("INSERT INTO mytable_partitioned PARTITION (y='abc') SELECT * 
FROM my_df_temp_table")
{code}

And, here are the definitions of my tables, as reference:
{code}
CREATE TABLE mytable(x INT);
CREATE TABLE mytable_partitioned (x INT) PARTITIONED BY (y INT);
{code}

You will also need to insert some dummy data into mytable to ensure that the 
insertion is actually not working:
{code}
#!/bin/bash
rm -rf data.txt;
for i in {0..9}; do
echo $i >> data.txt
done
sudo -u hdfs hadoop fs -put data.txt /user/hive/warehouse/mytable
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9896) Parquet Schema Assertion

2015-11-17 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009920#comment-15009920
 ] 

Mark Grover commented on SPARK-9896:


Posting here in case someone runs into this as well. Based on my experience, 
this happens when the client configuration for accessing HDFS/S3 is incorrect 
(which was the case for me) or if possibly if HDFS is inaccessible.

In my case, I was accessing a remote non-secure Hadoop cluster from a node 
which had secure HDFS configuration. So, while the error message should be 
something more relevant and unrelated to Parquet, this inaccessibility shows up 
as a Parquet metadata error. For me, I fixed it by doing a kdestroy on the 
gateway/client node and was able to run it fine. 

If you are hitting this, try accessing s3 or hdfs from that node, using the 
client configuration (say using 'hadoop fs -ls '). If that doesn't 
succeed, that's your root cause.

I will file a separate JIRA for improving the error message.

> Parquet Schema Assertion
> 
>
> Key: SPARK-9896
> URL: https://issues.apache.org/jira/browse/SPARK-9896
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>
> Need to investigate more, but I'm seeing this all of a sudden.
> {code}
> java.lang.AssertionError: assertion failed: No predefined schema found, and 
> no Parquet data files or summary files found under 
> s3n:/.../databricks-performance-datasets/tpcds/sf1500-parquet/useDecimal=true/parquet/item
> {code}
> Possibly related to [SPARK-9407].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11794) Improve error message when HDFS/S3 access is misconfigured when using Parquet

2015-11-17 Thread Mark Grover (JIRA)
Mark Grover created SPARK-11794:
---

 Summary: Improve error message when HDFS/S3 access is 
misconfigured when using Parquet
 Key: SPARK-11794
 URL: https://issues.apache.org/jira/browse/SPARK-11794
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Mark Grover


I had a scenario where I was accessing HDFS from a node set up for secure HDFS 
access. However, my actual HDFS cluster was set up in non-secure mode. This 
mismatch of configuration didn't allow me to access HDFS. However, when running 
a Spark app on some Parquet data on the said inaccessible HDFS, one gets a 
message like this:

{code}
java.lang.AssertionError: assertion failed: No predefined schema found, and no 
Parquet data files or summary files found under 
{code}

Which is very misleading. And, I am not the only one this has happened to. See 
SPARK-9565 for another similar situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11249) [Launcher] Launcher library fails is app resource is not added

2015-11-03 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987482#comment-14987482
 ] 

Mark Grover commented on SPARK-11249:
-

Thanks for filing this, Hari. I am a little torn about this and would 
appreciate your and other folks' (cc [~vanzin]) input. I poked around at 
SparkSubmit* code and [the relevant 
docs|https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management].

[The 
code|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L239
 currently throws an exception if a primary resource is not defined. Primary 
resource being, say a jar, when submitting a scala/java spark app using 
spark-submit. Also, it seems that the original intended purpose of 
{{--packages}} was related to fetching the dependencies from maven, not 
necessarily the actual app jar. In other words, it's meant to simplify usage of 
{{--jars}} when using spark-submit.

Technically, users can download their application jar using {{--packages}} 
similar to how they can supply their application jar using {{--jars}} but even 
in the latter case, they'd still have to specify the app jar separately.

So, I think we have 2 options:
1. Make specifying the primary resource optional. In such a case, this would 
apply only if {{--jars}} or {{--packages}} is being used in the same command 
line and when the primary resource is a jar.
2. Leave things the way they are, and perhaps, clarify the documentation to say 
that the intended purpose of {{--packages}} is to use it for dependency 
management, not necessarily for fetching the app jar. And, if you are using it 
for fetching the app jar, then you'd still need to specify an existing dummy 
jar for primary resource.

What do you think? Any other option that I missed? Thanks in advance for your 
help!

> [Launcher] Launcher library fails is app resource is not added
> --
>
> Key: SPARK-11249
> URL: https://issues.apache.org/jira/browse/SPARK-11249
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>
> If the resource is downloaded via --packages, the app resource is not 
> required.  But the launcher library gets confused and assumes the first arg 
> is the resource to be passed and all args don't get passed to the app itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10965) Optimize filesEqualRecursive

2015-10-16 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover resolved SPARK-10965.
-
Resolution: Won't Fix

Thanks Sean. Marking this as Won't Fix since I don't think this is super 
important.

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10965) Optimize filesEqualRecursive

2015-10-07 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947234#comment-14947234
 ] 

Mark Grover commented on SPARK-10965:
-

Thanks Sean.

I haven't really decided on the approach yet but will keep you posted.

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10965) Optimize filesEqualRecursive

2015-10-06 Thread Mark Grover (JIRA)
Mark Grover created SPARK-10965:
---

 Summary: Optimize filesEqualRecursive
 Key: SPARK-10965
 URL: https://issues.apache.org/jira/browse/SPARK-10965
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Mark Grover
Priority: Minor


When we try to download dependencies, if there is a file at the destination 
already, we compare if the files are equal (recursively, if they are 
directories). For files, we compare their bytes. Now, these dependencies can be 
jars and be really large and byte-by-byte comparisons can super slow.

I think it'd be better to do a checksum.
Here's the code in question:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10965) Optimize filesEqualRecursive

2015-10-06 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946070#comment-14946070
 ] 

Mark Grover commented on SPARK-10965:
-

I would love to work on this. Can someone please assign this to me? Also, how 
can I assign JIRAs to myself that I want to work on? Do I have bother someone 
every time? Thanks!

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.

2015-09-10 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739502#comment-14739502
 ] 

Mark Grover commented on SPARK-9790:


I was waiting on SPARK-8167 to get committed. That just committed yesterday, so 
I will merge it down and re-test. It should be ready then. Thanks for your 
interest.

> [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
> --
>
> Key: SPARK-9790
> URL: https://issues.apache.org/jira/browse/SPARK-9790
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.1
>Reporter: Mark Grover
>Priority: Minor
> Attachments: error_showing_in_UI.png
>
>
> When an executor is killed by yarn because it exceeds the memory overhead, 
> the only thing spark knows is that the executor is lost. The user has to go 
> track search through the NM logs to figure out that its been killed by yarn.
> It would be much nicer if the spark-driver could be notified why the executor 
> was killed. Ideally it could both log an explanatory message, and update the 
> UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9790) [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.

2015-08-10 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-9790:
---
Attachment: error_showing_in_UI.png

Attaching an image of what the error message in the UI would now look like.

 [YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
 --

 Key: SPARK-9790
 URL: https://issues.apache.org/jira/browse/SPARK-9790
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.1
Reporter: Mark Grover
 Attachments: error_showing_in_UI.png


 When an executor is killed by yarn because it exceeds the memory overhead, 
 the only thing spark knows is that the executor is lost. The user has to go 
 track search through the NM logs to figure out that its been killed by yarn.
 It would be much nicer if the spark-driver could be notified why the executor 
 was killed. Ideally it could both log an explanatory message, and update the 
 UI (and the eventLog) so that it was clear why the executor was lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >