[jira] [Commented] (SPARK-25328) Add a test case for having two columns as the grouping key in group aggregate pandas UDF

2018-09-03 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602630#comment-16602630
 ] 

Xiao Li commented on SPARK-25328:
-

cc [~icexelloss] [~bryanc] [~hyukjin.kwon]

> Add a test case for having two columns as the grouping key in group aggregate 
> pandas UDF
> 
>
> Key: SPARK-25328
> URL: https://issues.apache.org/jira/browse/SPARK-25328
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> https://github.com/apache/spark/pull/20295 added an alternative interface for 
> group aggregate pandas UDFs. It does not have any test case that have more 
> than one columns as the grouping key. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25328) Add a test case for having two columns as the grouping key in group aggregate pandas UDF

2018-09-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25328:
---

 Summary: Add a test case for having two columns as the grouping 
key in group aggregate pandas UDF
 Key: SPARK-25328
 URL: https://issues.apache.org/jira/browse/SPARK-25328
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Xiao Li


https://github.com/apache/spark/pull/20295 added an alternative interface for 
group aggregate pandas UDFs. It does not have any test case that have more than 
one columns as the grouping key. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25310) ArraysOverlap may throw a CompileException

2018-09-03 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-25310.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

Issue resolved by pull request 22317
https://github.com/apache/spark/pull/22317

> ArraysOverlap may throw a CompileException
> --
>
> Key: SPARK-25310
> URL: https://issues.apache.org/jira/browse/SPARK-25310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> Invoking {{ArraysOverlap}} function with non-nullable array type throws the 
> following error in the code generation phase.
> {code:java}
> Code generation of arrays_overlap([1,2,3], [4,5,3]) failed:
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25307) ArraySort function may return a error in the code generation phase.

2018-09-03 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-25307.
---
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.4.0

Issue resolved by pull request 22314
https://github.com/apache/spark/pull/22314

> ArraySort function may return a error in the code generation phase.
> ---
>
> Key: SPARK-25307
> URL: https://issues.apache.org/jira/browse/SPARK-25307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Sorting array of booleans (not nullable) returns a compilation error in the 
> code generation phase. Below is the compilation error :
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 51, Column 23: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 51, Column 23: No applicable constructor/method found for actual parameters 
> "boolean[]"; candidates are: "public static void 
> java.util.Arrays.sort(long[])", "public static void 
> java.util.Arrays.sort(long[], int, int)", "public static void 
> java.util.Arrays.sort(byte[], int, int)", "public static void 
> java.util.Arrays.sort(float[])", "public static void 
> java.util.Arrays.sort(float[], int, int)", "public static void 
> java.util.Arrays.sort(char[])", "public static void 
> java.util.Arrays.sort(char[], int, int)", "public static void 
> java.util.Arrays.sort(short[], int, int)", "public static void 
> java.util.Arrays.sort(short[])", "public static void 
> java.util.Arrays.sort(byte[])", "public static void 
> java.util.Arrays.sort(java.lang.Object[], int, int, java.util.Comparator)", 
> "public static void java.util.Arrays.sort(java.lang.Object[], 
> java.util.Comparator)", "public static void java.util.Arrays.sort(int[])", 
> "public static void java.util.Arrays.sort(java.lang.Object[], int, int)", 
> "public static void java.util.Arrays.sort(java.lang.Object[])", "public 
> static void java.util.Arrays.sort(double[])", "public static void 
> java.util.Arrays.sort(double[], int, int)", "public static void 
> java.util.Arrays.sort(int[], int, int)"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25308) ArrayContains function may return a error in the code generation phase.

2018-09-03 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-25308.
---
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.4.0

Issue resolved by pull request 22315
https://github.com/apache/spark/pull/22315

> ArrayContains function may return a error in the code generation phase.
> ---
>
> Key: SPARK-25308
> URL: https://issues.apache.org/jira/browse/SPARK-25308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>
> Invoking ArrayContains function with non nullable array type throws the 
> following error in the code generation phase.
> {code}
> Code generation of array_contains([1,2,3], 1) failed:
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 40, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 40, Column 11: Expression "isNull_0" is not an rvalue
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 40, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 40, Column 11: Expression "isNull_0" is not an rvalue
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25319:
---
Target Version/s: 2.4.0  (was: 2.3.0)

> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.4.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25319:
---
Fix Version/s: (was: 2.3.0)
   2.4.0

> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.4.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25325:
---
Summary: ML, Graph 2.4 QA: Update user guide for new features & APIs  (was: 
ML, Graph 2.3 QA: Update user guide for new features & APIs)

> ML, Graph 2.4 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25327) Update MLlib, GraphX websites for 2.4

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25327:
---
Affects Version/s: 2.4.0
 Target Version/s: 2.4.0
  Summary: Update MLlib, GraphX websites for 2.4  (was: CLONE - 
Update MLlib, GraphX websites for 2.3)

> Update MLlib, GraphX websites for 2.4
> -
>
> Key: SPARK-25327
> URL: https://issues.apache.org/jira/browse/SPARK-25327
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25325) ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25325:
---
Affects Version/s: 2.4.0
 Target Version/s: 2.4.0
Fix Version/s: (was: 2.3.0)
  Summary: ML, Graph 2.3 QA: Update user guide for new features & 
APIs  (was: CLONE - ML, Graph 2.3 QA: Update user guide for new features & APIs)

> ML, Graph 2.3 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25326:
---
Affects Version/s: (was: 2.3.0)
   2.4.0
 Target Version/s: 2.4.0  (was: 2.3.0)
Fix Version/s: (was: 2.3.0)
  Summary: ML, Graph 2.4 QA: Programming guide update and migration 
guide  (was: CLONE - ML, Graph 2.3 QA: Programming guide update and migration 
guide)

> ML, Graph 2.4 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-25326
> URL: https://issues.apache.org/jira/browse/SPARK-25326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25324:
---
Affects Version/s: 2.4.0
 Target Version/s: 2.4.0
Fix Version/s: (was: 2.3.0)
  Summary: ML 2.4 QA: API: Java compatibility, docs  (was: CLONE - 
ML 2.3 QA: API: Java compatibility, docs)

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25321:
---
Affects Version/s: (was: 2.3.0)
   2.4.0
 Target Version/s: 2.4.0  (was: 2.3.0)
Fix Version/s: (was: 2.3.0)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25323:
---
Target Version/s: 2.4.0
 Summary: ML 2.4 QA: API: Python API coverage  (was: CLONE - ML 2.3 
QA: API: Python API coverage)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25323) CLONE - ML 2.3 QA: API: Python API coverage

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25323:
---
Affects Version/s: (was: 2.3.0)
   2.4.0
 Target Version/s:   (was: 2.3.0)

> CLONE - ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bryan Cutler
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25320:
---
Affects Version/s: (was: 2.3.0)
   2.4.0
 Target Version/s: 2.4.0  (was: 2.3.0)

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25322:
---
Affects Version/s: 2.4.0
Fix Version/s: (was: 2.3.0)
  Summary: ML, Graph 2.4 QA: API: Experimental, DeveloperApi, 
final, sealed audit  (was: CLONE - ML, Graph 2.3 QA: API: Experimental, 
DeveloperApi, final, sealed audit)

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Nick Pentreath
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25321:
---
Description: 
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue

  was:
Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue 
(SPARK-23111 for {{2.3}})

Summary: ML, Graph 2.4 QA: API: New Scala APIs, docs  (was: CLONE - ML, 
Graph 2.3 QA: API: New Scala APIs, docs)

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25319:
---
Description: 
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX. *SparkR is separate.

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.
h2. API
 * Check binary API compatibility for Scala/Java
 * Audit new public APIs (from the generated html doc)
 ** Scala
 ** Java compatibility
 ** Python coverage
 * Check Experimental, DeveloperApi tags

h2. Algorithms and performance
 * Performance tests

h2. Documentation and example code
 * For new algorithms, create JIRAs for updating the user guide sections & 
examples
 * Update Programming Guide
 * Update website

  was:
This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX. *SparkR is separate: SPARK-23114.*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.
h2. API
 * Check binary API compatibility for Scala/Java
 * Audit new public APIs (from the generated html doc)
 ** Scala
 ** Java compatibility
 ** Python coverage
 * Check Experimental, DeveloperApi tags

h2. Algorithms and performance
 * Performance tests

h2. Documentation and example code
 * For new algorithms, create JIRAs for updating the user guide sections & 
examples
 * Update Programming Guide
 * Update website


> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25326) CLONE - ML, Graph 2.3 QA: Programming guide update and migration guide

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25326:
--

 Summary: CLONE - ML, Graph 2.3 QA: Programming guide update and 
migration guide
 Key: SPARK-25326
 URL: https://issues.apache.org/jira/browse/SPARK-25326
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 2.3.0
Reporter: Weichen Xu
Assignee: Nick Pentreath
 Fix For: 2.3.0


Before the release, we need to update the MLlib and GraphX Programming Guides. 
Updates will include:
 * Add migration guide subsection.
 ** Use the results of the QA audit JIRAs.
 * Check phrasing, especially in main sections (for outdated items such as "In 
this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25324) CLONE - ML 2.3 QA: API: Java compatibility, docs

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25324:
--

 Summary: CLONE - ML 2.3 QA: API: Java compatibility, docs
 Key: SPARK-25324
 URL: https://issues.apache.org/jira/browse/SPARK-25324
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Java API, ML, MLlib
Reporter: Weichen Xu
Assignee: Weichen Xu
 Fix For: 2.3.0


Check Java compatibility for this release:
* APIs in {{spark.ml}}
* New APIs in {{spark.mllib}} (There should be few, if any.)

Checking compatibility means:
* Checking for differences in how Scala and Java handle types. Some items to 
look out for are:
** Check for generic "Object" types where Java cannot understand complex Scala 
types.
*** *Note*: The Java docs do not always match the bytecode. If you find a 
problem, please verify it using {{javap}}.
** Check Scala objects (especially with nesting!) carefully.  These may not be 
understood in Java, or they may be accessible only via the weirdly named Java 
types (with "$" or "#") which are generated by the Scala compiler.
** Check for uses of Scala and Java enumerations, which can show up oddly in 
the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
using enumerations, and have instead favored plain strings.)
* Check for differences in generated Scala vs Java docs.  E.g., one past issue 
was that Javadocs did not respect Scala's package private modifier.

If you find issues, please comment here, or for larger items, create separate 
JIRAs and link here as "requires".
* Remember that we should not break APIs from previous releases.  If you find a 
problem, check if it was introduced in this Spark release (in which case we can 
fix it) or in a previous one (in which case we can create a java-friendly 
version of the API).
* If needed for complex issues, create small Java unit tests which execute each 
method.  (Algorithmic correctness can be checked in Scala.)

Recommendations for how to complete this task:
* There are not great tools.  In the past, this task has been done by:
** Generating API docs
** Building JAR and outputting the Java class signatures for MLlib
** Manually inspecting and searching the docs and class signatures for issues
* If you do have ideas for better tooling, please say so we can make this task 
easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25323) CLONE - ML 2.3 QA: API: Python API coverage

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25323:
--

 Summary: CLONE - ML 2.3 QA: API: Python API coverage
 Key: SPARK-25323
 URL: https://issues.apache.org/jira/browse/SPARK-25323
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, PySpark
Affects Versions: 2.3.0
Reporter: Weichen Xu
Assignee: Bryan Cutler


For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
generated HTML doc and compare the Scala & Python versions.
* *GOAL*: Audit and create JIRAs to fix in the next release.
* *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.

We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
*Please use a _separate_ JIRA (linked below as "requires") for this list of 
to-do items.*




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-03 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-25320:
---
Summary: ML, Graph 2.4 QA: API: Binary incompatible changes  (was: CLONE - 
ML, Graph 2.3 QA: API: Binary incompatible changes)

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Bago Amirbekian
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25320) CLONE - ML, Graph 2.3 QA: API: Binary incompatible changes

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25320:
--

 Summary: CLONE - ML, Graph 2.3 QA: API: Binary incompatible changes
 Key: SPARK-25320
 URL: https://issues.apache.org/jira/browse/SPARK-25320
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 2.3.0
Reporter: Weichen Xu
Assignee: Bago Amirbekian


Generate a list of binary incompatible changes using MiMa and create new JIRAs 
for issues found. Filter out false positives as needed.

If you want to take this task, look at the analogous task from the previous 
release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25322) CLONE - ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25322:
--

 Summary: CLONE - ML, Graph 2.3 QA: API: Experimental, 
DeveloperApi, final, sealed audit
 Key: SPARK-25322
 URL: https://issues.apache.org/jira/browse/SPARK-25322
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Reporter: Weichen Xu
Assignee: Nick Pentreath
 Fix For: 2.3.0


We should make a pass through the items marked as Experimental or DeveloperApi 
and see if any are stable enough to be unmarked.

We should also check for items marked final or sealed to see if they are stable 
enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25327) CLONE - Update MLlib, GraphX websites for 2.3

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25327:
--

 Summary: CLONE - Update MLlib, GraphX websites for 2.3
 Key: SPARK-25327
 URL: https://issues.apache.org/jira/browse/SPARK-25327
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Reporter: Weichen Xu
Assignee: Nick Pentreath


Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25321) CLONE - ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25321:
--

 Summary: CLONE - ML, Graph 2.3 QA: API: New Scala APIs, docs
 Key: SPARK-25321
 URL: https://issues.apache.org/jira/browse/SPARK-25321
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Affects Versions: 2.3.0
Reporter: Weichen Xu
Assignee: Yanbo Liang
 Fix For: 2.3.0


Audit new public Scala APIs added to MLlib & GraphX. Take note of:
 * Protected/public classes or methods. If access can be more private, then it 
should be.
 * Also look for non-sealed traits.
 * Documentation: Missing? Bad links or formatting?

*Make sure to check the object doc!*

As you find issues, please create JIRAs and link them to this issue. 

For *user guide issues* link the new JIRAs to the relevant user guide QA issue 
(SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25325) CLONE - ML, Graph 2.3 QA: Update user guide for new features & APIs

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25325:
--

 Summary: CLONE - ML, Graph 2.3 QA: Update user guide for new 
features & APIs
 Key: SPARK-25325
 URL: https://issues.apache.org/jira/browse/SPARK-25325
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, GraphX, ML, MLlib
Reporter: Weichen Xu
Assignee: Nick Pentreath
 Fix For: 2.3.0


Check the user guide vs. a list of new APIs (classes, methods, data members) to 
see what items require updates to the user guide.

For each feature missing user guide doc:
* Create a JIRA for that feature, and assign it to the author of the feature
* Link it to (a) the original JIRA which introduced that feature ("related to") 
and (b) to this JIRA ("requires").

For MLlib:
* This task does not include major reorganizations for the programming guide.
* We should now begin copying algorithm details from the spark.mllib guide to 
spark.ml as needed, rather than just linking back to the corresponding 
algorithms in the spark.mllib user guide.

If you would like to work on this task, please comment, and we can create & 
link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-09-03 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-25319:
--

 Summary: Spark MLlib, GraphX 2.4 QA umbrella
 Key: SPARK-25319
 URL: https://issues.apache.org/jira/browse/SPARK-25319
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, GraphX, ML, MLlib
Reporter: Weichen Xu
Assignee: Joseph K. Bradley
 Fix For: 2.3.0


This JIRA lists tasks for the next Spark release's QA period for MLlib and 
GraphX. *SparkR is separate: SPARK-23114.*

The list below gives an overview of what is involved, and the corresponding 
JIRA issues are linked below that.
h2. API
 * Check binary API compatibility for Scala/Java
 * Audit new public APIs (from the generated html doc)
 ** Scala
 ** Java compatibility
 ** Python coverage
 * Check Experimental, DeveloperApi tags

h2. Algorithms and performance
 * Performance tests

h2. Documentation and example code
 * For new algorithms, create JIRAs for updating the user guide sections & 
examples
 * Update Programming Guide
 * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25318:


Assignee: (was: Apache Spark)

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Priority: Minor
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prone to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602571#comment-16602571
 ] 

Apache Spark commented on SPARK-25318:
--

User 'rezasafi' has created a pull request for this issue:
https://github.com/apache/spark/pull/22325

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Priority: Minor
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prone to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25318:


Assignee: Apache Spark

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prone to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Reza Safi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Safi updated SPARK-25318:
--
Description: 
SPARK-4105 provided a solution to block corruption issue by retrying the fetch 
or the stage. In the solution there is a step that wraps the input stream with 
compression and/or encryption. This step is prone to exceptions, but in the 
current code there is no exception handling for this step and this has caused 
confusion for the user. In fact we have customers who reported an exception 
like the following when SPARK-4105 is available to them:
{noformat}
2018-08-28 22:35:54,361 ERROR [Driver] 
org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to 
   stage failure: Task 452 in stage 209.0 failed 4 times, most recent 
failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
java.io.IOException: FAILED_TO_UNCOMPRESS(5)
  3976 at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
  3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
  3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
  3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
  3980 at 
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
  3981 at 
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
  3982 at 
org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
  3983 at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
  3984 at 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
  3985 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
  3986 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
  3987 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
  3988 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
  3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  3990 a
{noformat}
In this customer's version of spark, line 328 of 
ShuffleBlockFetcherIterator.scala is the line that the following occurs:
{noformat}
input = streamWrapper(blockId, in)
{noformat}
It would be nice to add exception handling around this line to avoid confusions.

  was:
SPARK-4105 provided a solution to block corruption issue by retrying the fetch 
or the stage. In the solution there is a step that wraps the input stream with 
compression and/or encryption. This step is prune to exceptions, but in the 
current code there is no exception handling for this step and this has caused 
confusion for the user.. In fact we have customers who reported an exception 
like the following when SPARK-4105 is available to them:
{noformat}
2018-08-28 22:35:54,361 ERROR [Driver] 
org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to 
   stage failure: Task 452 in stage 209.0 failed 4 times, most recent 
failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
java.io.IOException: FAILED_TO_UNCOMPRESS(5)
  3976 at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
  3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
  3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
  3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
  3980 at 
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
  3981 at 
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
  3982 at 
org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
  3983 at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
  3984 at 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
  3985 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
  3986 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
  3987 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
  3988 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
  3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  3990 a
{noformat}
In this customer's version of spark, line 328 of 
ShuffleBlockFetcherIterator.scala is 

[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

I confirmed this performance difference even after adding warmup. Let me 
investigate furthermore.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Reza Safi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602565#comment-16602565
 ] 

Reza Safi commented on SPARK-25318:
---

I will send a pr for this shortly

> Add exception handling when wrapping the input stream during the the fetch or 
> stage retry in response to a corrupted block
> --
>
> Key: SPARK-25318
> URL: https://issues.apache.org/jira/browse/SPARK-25318
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Reza Safi
>Priority: Minor
>
> SPARK-4105 provided a solution to block corruption issue by retrying the 
> fetch or the stage. In the solution there is a step that wraps the input 
> stream with compression and/or encryption. This step is prune to exceptions, 
> but in the current code there is no exception handling for this step and this 
> has caused confusion for the user.. In fact we have customers who reported an 
> exception like the following when SPARK-4105 is available to them:
> {noformat}
> 2018-08-28 22:35:54,361 ERROR [Driver] 
> org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
> java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due 
> tostage failure: Task 452 in stage 209.0 failed 4 times, most recent 
> failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   3976 at 
> org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
>   3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
>   3980 at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   3981 at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   3982 at 
> org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   3983 at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
>   3984 at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
>   3985 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
>   3986 at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
>   3987 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
>   3988 at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
>   3989 at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   3990 a
> {noformat}
> In this customer's version of spark, line 328 of 
> ShuffleBlockFetcherIterator.scala is the line that the following occurs:
> {noformat}
> input = streamWrapper(blockId, in)
> {noformat}
> It would be nice to add exception handling around this line to avoid 
> confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block

2018-09-03 Thread Reza Safi (JIRA)
Reza Safi created SPARK-25318:
-

 Summary: Add exception handling when wrapping the input stream 
during the the fetch or stage retry in response to a corrupted block
 Key: SPARK-25318
 URL: https://issues.apache.org/jira/browse/SPARK-25318
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1, 2.2.2, 2.1.3, 2.4.0
Reporter: Reza Safi


SPARK-4105 provided a solution to block corruption issue by retrying the fetch 
or the stage. In the solution there is a step that wraps the input stream with 
compression and/or encryption. This step is prune to exceptions, but in the 
current code there is no exception handling for this step and this has caused 
confusion for the user.. In fact we have customers who reported an exception 
like the following when SPARK-4105 is available to them:
{noformat}
2018-08-28 22:35:54,361 ERROR [Driver] 
org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: 
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to 
   stage failure: Task 452 in stage 209.0 failed 4 times, most recent 
failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): 
java.io.IOException: FAILED_TO_UNCOMPRESS(5)
  3976 at 
org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
  3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
  3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395)
  3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431)
  3980 at 
org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
  3981 at 
org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
  3982 at 
org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
  3983 at 
org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159)
  3984 at 
org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219)
  3985 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48)
  3986 at 
org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47)
  3987 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328)
  3988 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55)
  3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  3990 a
{noformat}
In this customer's version of spark, line 328 of 
ShuffleBlockFetcherIterator.scala is the line that the following occurs:
{noformat}
input = streamWrapper(blockId, in)
{noformat}
It would be nice to add exception handling around this line to avoid confusions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25237) FileScanRdd's inputMetrics is wrong when select the datasource table with limit

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602542#comment-16602542
 ] 

Apache Spark commented on SPARK-25237:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22324

> FileScanRdd's inputMetrics is wrong  when select the datasource table with 
> limit
> 
>
> Key: SPARK-25237
> URL: https://issues.apache.org/jira/browse/SPARK-25237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: du
>Priority: Major
>
> In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead 
>  every 1000 rows or when close the iterator.
> but when close the iterator,  we will invoke updateBytesReadWithFileSize to 
> increase the inputMetrics's bytesRead with file's length.
> this will result in the inputMetrics's bytesRead is wrong when run the query 
> with limit such as select * from table limit 1.
> because we do not support for Hadoop 2.5 and earlier now, we always get the 
> bytesRead from  Hadoop FileSystem statistics other than files's length.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25237) FileScanRdd's inputMetrics is wrong when select the datasource table with limit

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602543#comment-16602543
 ] 

Apache Spark commented on SPARK-25237:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22324

> FileScanRdd's inputMetrics is wrong  when select the datasource table with 
> limit
> 
>
> Key: SPARK-25237
> URL: https://issues.apache.org/jira/browse/SPARK-25237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: du
>Priority: Major
>
> In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead 
>  every 1000 rows or when close the iterator.
> but when close the iterator,  we will invoke updateBytesReadWithFileSize to 
> increase the inputMetrics's bytesRead with file's length.
> this will result in the inputMetrics's bytesRead is wrong when run the query 
> with limit such as select * from table limit 1.
> because we do not support for Hadoop 2.5 and earlier now, we always get the 
> bytesRead from  Hadoop FileSystem statistics other than files's length.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-09-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602528#comment-16602528
 ] 

Hyukjin Kwon commented on SPARK-25293:
--

[~omkar999], would you be able to test this and see if the issue still exists 
in the upper Spark version?

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602512#comment-16602512
 ] 

Jungtaek Lim commented on SPARK-25317:
--

Why not running test with JMH, applying warmup and iteration? Not sure it can 
be applied to scala test, but the Java test code should be simple if these 
Spark classes are aware of interop.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15

2018-09-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-20395:
-

Assignee: DB Tsai

> Update Scala to 2.11.11 and zinc to 0.3.15
> --
>
> Key: SPARK-20395
> URL: https://issues.apache.org/jira/browse/SPARK-20395
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Jeremy Smith
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 2.4.0
>
>
> Update Scala to 2.11.11, which was released yesterday:
> https://github.com/scala/scala/releases/tag/v2.11.11
> Since it's a patch version upgrade and binary compatibility is guaranteed, 
> impact should be minimal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15

2018-09-03 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602510#comment-16602510
 ] 

Dongjoon Hyun edited comment on SPARK-20395 at 9/4/18 12:59 AM:


In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0) by [~dbtsai].
In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0) by [~srowen].


was (Author: dongjoon):
In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0)
In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0)

> Update Scala to 2.11.11 and zinc to 0.3.15
> --
>
> Key: SPARK-20395
> URL: https://issues.apache.org/jira/browse/SPARK-20395
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Jeremy Smith
>Priority: Minor
> Fix For: 2.4.0
>
>
> Update Scala to 2.11.11, which was released yesterday:
> https://github.com/scala/scala/releases/tag/v2.11.11
> Since it's a patch version upgrade and binary compatibility is guaranteed, 
> impact should be minimal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15

2018-09-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-20395.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0)
In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0)

> Update Scala to 2.11.11 and zinc to 0.3.15
> --
>
> Key: SPARK-20395
> URL: https://issues.apache.org/jira/browse/SPARK-20395
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Jeremy Smith
>Priority: Minor
> Fix For: 2.4.0
>
>
> Update Scala to 2.11.11, which was released yesterday:
> https://github.com/scala/scala/releases/tag/v2.11.11
> Since it's a patch version upgrade and binary compatibility is guaranteed, 
> impact should be minimal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15

2018-09-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-20395:
---

> Update Scala to 2.11.11 and zinc to 0.3.15
> --
>
> Key: SPARK-20395
> URL: https://issues.apache.org/jira/browse/SPARK-20395
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build, Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Jeremy Smith
>Priority: Minor
>
> Update Scala to 2.11.11, which was released yesterday:
> https://github.com/scala/scala/releases/tag/v2.11.11
> Since it's a patch version upgrade and binary compatibility is guaranteed, 
> impact should be minimal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

Let me run this on 2.3 and master.
One question. This benchmark does not have an warm up loop. In other words, 
this benchmark may include execution time on an interpreter, too. Is this 
behavior intentional?

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25317:
-
Description: 
eThere is a performance regression when calculating hash code for UTF8String:
{code:java}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}
To run this test in 2.3, we need to add
{code:java}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us

  was:
There is a performance regression when calculating hash code for UTF8String:

{code}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}

To run this test in 2.3, we need to add
{code}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us


> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import 

[jira] [Comment Edited] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602503#comment-16602503
 ] 

Wenchen Fan edited comment on SPARK-25317 at 9/4/18 12:24 AM:
--

cc [~kiszk]  [~rednaxelafx]


was (Author: cloud_fan):
cc [~kiszk] 

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> There is a performance regression when calculating hash code for UTF8String:
> {code}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602503#comment-16602503
 ] 

Wenchen Fan commented on SPARK-25317:
-

cc [~kiszk] 

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> There is a performance regression when calculating hash code for UTF8String:
> {code}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25317:
---

 Summary: MemoryBlock performance regression
 Key: SPARK-25317
 URL: https://issues.apache.org/jira/browse/SPARK-25317
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan


There is a performance regression when calculating hash code for UTF8String:

{code}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}

To run this test in 2.3, we need to add
{code}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25317:

Priority: Blocker  (was: Major)

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> There is a performance regression when calculating hash code for UTF8String:
> {code}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25316) Spark error - ERROR ContextCleaner: Error cleaning broadcast 22, Exception thrown in awaitResult:

2018-09-03 Thread Vidya (JIRA)
Vidya created SPARK-25316:
-

 Summary: Spark error - ERROR ContextCleaner: Error cleaning 
broadcast 22,  Exception thrown in awaitResult: 
 Key: SPARK-25316
 URL: https://issues.apache.org/jira/browse/SPARK-25316
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.2.2
Reporter: Vidya


While running spark load on EMR with c3 instaces, we see following error 

ERROR ContextCleaner: Error cleaning broadcast 22
org.apache.spark.SparkException: Exception thrown in awaitResult: 

 

Whats the cause of the error and how do we fix it?

 

Stage 30:=> (374 + 20) / 600]
[Stage 30:=> (419 + 20) / 600]
[Stage 30:==> (471 + 4) / 600]18/08/02 
21:06:09 ERROR TransportResponseHandler: Still have 1 requests outstanding when 
connection from /10.154.21.145:45990 is closed
18/08/02 21:06:09 ERROR ContextCleaner: Error cleaning broadcast 22
org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
 at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:161)
 at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306)
 at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
 at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
 at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
 at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
 at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1279)
 at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
 at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
 at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-03 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602440#comment-16602440
 ] 

Dilip Biswal commented on SPARK-25279:
--

[~zzcclp] Hmmn.. i don't know whats happening in the :paste mode. Let me cc the 
experts.
cc [~cloud_fan] [~viirya] 
Hi Wenchen and Simon, Do you know the reason for the failure ?

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> ) 
>  

[jira] [Resolved] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-09-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-25117.
--
Resolution: Fixed
  Assignee: Dilip Biswal

> Add EXEPT ALL and INTERSECT ALL support in R.
> -
>
> Key: SPARK-25117
> URL: https://issues.apache.org/jira/browse/SPARK-25117
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.

2018-09-03 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-25117:
-
Fix Version/s: 2.4.0

> Add EXEPT ALL and INTERSECT ALL support in R.
> -
>
> Key: SPARK-25117
> URL: https://issues.apache.org/jira/browse/SPARK-25117
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2018-09-03 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali closed SPARK-22190.
---

> Add Spark executor task metrics to Dropwizard metrics
> -
>
> Key: SPARK-22190
> URL: https://issues.apache.org/jira/browse/SPARK-22190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: SparkTaskMetrics_Grafana_example.PNG
>
>
> I would like to propose to expose Spark executor task metrics using the 
> Dropwizard metrics. I have developed a simple implementation and run a few 
> tests using Graphite sink and Grafana visualization and this appears to me a 
> good source of information for monitoring and troubleshooting the progress of 
> Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2018-09-03 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali closed SPARK-21829.
---

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2018-09-03 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali closed SPARK-21519.
---

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.3.0
>
>
> This proposes an option to the JDBC datasource, tentatively called 
> "sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements)
>  . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell --jars ojdb6.jar
> val df = spark.read
>.format("jdbc")
>.option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name")
>.option("driver", "oracle.jdbc.driver.OracleDriver")
>.option("dbtable", "(select 1, sysdate, systimestamp, 
> current_timestamp, localtimestamp from dual)")
>.option("user", "MYUSER")
>.option("password", "MYPASSWORD").option("fetchsize",1000)
>.option("sessionInitStatement", preambleSQL)
>.load()
> df.show(5,false)
> {code}
> *Comments:* This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however I believe it can be 
> useful for other target DBs too, as it is quite generic.   
> The code executed by the option "sessionInitStatement" is just the 
> user-provided string fed through the execute method of the JDBC connection, 
> so it can use the features of the target database language/syntax. When using 
> sessionInitStatement for querying Oracle, for example, the user-provided 
> command can be a SQL statement or a PL/SQL block grouping multiple commands 
> and logic.   
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities of injecting user-provided SQL (and 
> PL/SQL) that this opens.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25309) Sci-kit Learn like Auto Pipeline Parallelization in Spark

2018-09-03 Thread Ravi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi updated SPARK-25309:
-
Description: SPARK-19357 and SPARK-21911 have helped parallelize Pipelines 
in Spark. However, instead of setting the parallelism Parameter in the 
CrossValidator it would be good to have something like njobs=-1 (like Scikit 
Learn) where the Pipeline DAG could be automatically parallelized and scheduled 
based on the resources allocated to the Spark Session instead of having the 
user pick the integer value for this parameter.   (was: SPARK-19357 and 
SPARK-21911 have helped parallelize Pipelines in Spark. However, instead of 
setting the parallelism Parameter in the CrossValidator it would be good to 
have something like njobs=-1 (like Scikit Learn) where the Pipleline DAG could 
be automatically parallelized and scheduled based on the resources allocated to 
the Spark Session instead of having the user pick the integer value for this 
parameter. )

> Sci-kit Learn like Auto Pipeline Parallelization in Spark 
> --
>
> Key: SPARK-25309
> URL: https://issues.apache.org/jira/browse/SPARK-25309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ravi
>Priority: Critical
>
> SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. 
> However, instead of setting the parallelism Parameter in the CrossValidator 
> it would be good to have something like njobs=-1 (like Scikit Learn) where 
> the Pipeline DAG could be automatically parallelized and scheduled based on 
> the resources allocated to the Spark Session instead of having the user pick 
> the integer value for this parameter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-03 Thread Zhichao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602308#comment-16602308
 ] 

Zhichao  Zhang edited comment on SPARK-25279 at 9/3/18 4:12 PM:


[~dkbiswal], I followed your steps to run code successfully, but if I pasted 
all code and then ran the code, the error occured:
{code:java}
scala>:paste
 ...copy all code here
 Crtl + D{code}

 what is the difference between these two mode?


was (Author: zzcclp):
[~dkbiswal], I followed your steps to run code successfully, but if I pasted 
all code and then ran the code, the error occured:
scala>:paste
...copy all code here
Crtl + D
what is the difference between these two mode?

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> 

[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc

2018-09-03 Thread Zhichao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602308#comment-16602308
 ] 

Zhichao  Zhang commented on SPARK-25279:


[~dkbiswal], I followed your steps to run code successfully, but if I pasted 
all code and then ran the code, the error occured:
scala>:paste
...copy all code here
Crtl + D
what is the difference between these two mode?

> Throw exception: zzcclp   java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
> ---
>
> Key: SPARK-25279
> URL: https://issues.apache.org/jira/browse/SPARK-25279
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.1
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Hi dev: 
>   I am using Spark-Shell to run the example which is in section 
> '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'],
>  
> and there is an error: 
> {code:java}
> Caused by: java.io.NotSerializableException: 
> org.apache.spark.sql.TypedColumn 
> Serialization stack: 
>         - object not serializable (class: org.apache.spark.sql.TypedColumn, 
> value: 
> myaverage() AS `average_salary`) 
>         - field (class: $iw, name: averageSalary, type: class 
> org.apache.spark.sql.TypedColumn) 
>         - object (class $iw, $iw@4b2f8ae9) 
>         - field (class: MyAverage$, name: $outer, type: class $iw) 
>         - object (class MyAverage$, MyAverage$@2be41d90) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, 
> MyAverage(Employee)) 
>         - field (class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> name: aggregateFunction, type: class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) 
>         - object (class 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, 
> partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), 
> Some(class Employee), Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)) 
>         - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@5e92c46f) 
>         - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy) 
>         - object (class scala.collection.immutable.$colon$colon, 
> List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0))) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: 
> aggregateExpressions, type: interface scala.collection.Seq) 
>         - object (class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, 
> ObjectHashAggregate(keys=[], 
> functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class 
> Employee)), Some(class Employee), 
> Some(StructType(StructField(name,StringType,true), 
> StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, 
> Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, 
> Average, true])).count AS count#26L, newInstance(class Average), input[0, 
> double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) 
> +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: 
> InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json],
>  
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct 
> ) 
>         - field (class: 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1,
>  
> name: $outer, type: class 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) 
>         - object (class 
> 

[jira] [Created] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0

2018-09-03 Thread Zhenhao Li (JIRA)
Zhenhao Li created SPARK-25315:
--

 Summary: setting "auto.offset.reset" to "earliest" has no effect 
in Structured Streaming with Spark 2.3.1 and Kafka 1.0
 Key: SPARK-25315
 URL: https://issues.apache.org/jira/browse/SPARK-25315
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
 Environment: Standalone; running in IDEA
Reporter: Zhenhao Li


The following code won't read from the beginning of the topic

```

{code:java}
val kafkaOptions = Map[String, String](
 "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS,
 "subscribe" -> TOPIC,
 "group.id" -> GROUP_ID,
 "auto.offset.reset" -> "earliest"
)

val myStream = sparkSession
.readStream
.format("kafka")
.options(kafkaOptions)
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

  myStream
.writeStream
.format("console")
.start()
.awaitTermination()
{code}

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24767) Propagate MDC to spark-submit thread in InProcessAppHandle

2018-09-03 Thread Yifei Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifei Huang resolved SPARK-24767.
-
Resolution: Won't Fix

> Propagate MDC to spark-submit thread in InProcessAppHandle
> --
>
> Key: SPARK-24767
> URL: https://issues.apache.org/jira/browse/SPARK-24767
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.3.1
>Reporter: Yifei Huang
>Priority: Major
>
> Currently, for the in-process launcher, we don't propagate MDCs to the thread 
> that runs spark-submit. Ideally, we'd do this so it's easier to map the child 
> thread to the parent thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25298) spark-tools build failure for Scala 2.12

2018-09-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25298:
-

Assignee: Darcy Shen

> spark-tools build failure for Scala 2.12
> 
>
> Key: SPARK-25298
> URL: https://issues.apache.org/jira/browse/SPARK-25298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Assignee: Darcy Shen
>Priority: Major
> Fix For: 2.4.0
>
>
> $ sbt--
> > ++ 2.12.6
> > compile
>  
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.\{universe => unv}
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.universe.runtimeMirror
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41:
>  not found: value runtimeMirror
> [error]   private val mirror = runtimeMirror(classLoader)
> [error]    ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43:
>  not found: value unv
> [error]   private def isPackagePrivate(sym: unv.Symbol) =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25298) spark-tools build failure for Scala 2.12

2018-09-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25298.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22310
[https://github.com/apache/spark/pull/22310]

> spark-tools build failure for Scala 2.12
> 
>
> Key: SPARK-25298
> URL: https://issues.apache.org/jira/browse/SPARK-25298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Assignee: Darcy Shen
>Priority: Major
> Fix For: 2.4.0
>
>
> $ sbt--
> > ++ 2.12.6
> > compile
>  
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.\{universe => unv}
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.universe.runtimeMirror
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41:
>  not found: value runtimeMirror
> [error]   private val mirror = runtimeMirror(classLoader)
> [error]    ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43:
>  not found: value unv
> [error]   private def isPackagePrivate(sym: unv.Symbol) =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25298) spark-tools build failure for Scala 2.12

2018-09-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25298:
--
Priority: Minor  (was: Major)

> spark-tools build failure for Scala 2.12
> 
>
> Key: SPARK-25298
> URL: https://issues.apache.org/jira/browse/SPARK-25298
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Assignee: Darcy Shen
>Priority: Minor
> Fix For: 2.4.0
>
>
> $ sbt--
> > ++ 2.12.6
> > compile
>  
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.\{universe => unv}
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23:
>  object runtime is not a member of package reflect
> [error] import scala.reflect.runtime.universe.runtimeMirror
> [error]  ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41:
>  not found: value runtimeMirror
> [error]   private val mirror = runtimeMirror(classLoader)
> [error]    ^
> [error] 
> /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43:
>  not found: value unv
> [error]   private def isPackagePrivate(sym: unv.Symbol) =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19728) PythonUDF with multiple parents shouldn't be pushed down when used as a predicate

2018-09-03 Thread Sergey Bahchissaraitsev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602040#comment-16602040
 ] 

Sergey Bahchissaraitsev commented on SPARK-19728:
-

This is still happening in 2.3.1 when using the UDF inside the join "on" 
condition.
e.g.
{quote}df1.join(df2, pred(df1.col_a, df2.col_b)).show()
{quote}

I have opened for this: SPARK-25314

>  PythonUDF with multiple parents shouldn't be pushed down when used as a 
> predicate
> --
>
> Key: SPARK-19728
> URL: https://issues.apache.org/jira/browse/SPARK-19728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Fix For: 2.2.0
>
>
> Prior to Spark 2.0 it was possible to use Python UDF output as a predicate:
> {code}
> from pyspark.sql.functions import udf
> from pyspark.sql.types import BooleanType
> df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"])
> df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"])
> pred = udf(lambda x, y: x == y, BooleanType())
> df1.join(df2).where(pred("col_a", "col_b")).show()
> {code}
> In Spark 2.0 this is no longer possible:
> {code}
> spark.conf.set("spark.sql.crossJoin.enabled", True)
> df1.join(df2).where(pred("col_a", "col_b")).show()
> ## ...
> ## Py4JJavaError: An error occurred while calling o731.showString.
> : java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, 
> col_b#135L), requires attributes from more than one child.
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25314) Invalid PythonUDF - requires attributes from more than one child - in "on" join condition

2018-09-03 Thread Sergey Bahchissaraitsev (JIRA)
Sergey Bahchissaraitsev created SPARK-25314:
---

 Summary: Invalid PythonUDF - requires attributes from more than 
one child - in "on" join condition
 Key: SPARK-25314
 URL: https://issues.apache.org/jira/browse/SPARK-25314
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.1
Reporter: Sergey Bahchissaraitsev


This is another variation of the SPARK-19728 which was tagged as resolved. So I 
base the example on it:



from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"])
df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"])
pred = udf(lambda x, y: x == y, BooleanType())

df1.join(df2, pred(df1.col_a, df2.col_b)).show()


This throws:
{quote}java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, 
col_b#135L), requires attributes from more than one child.
at scala.sys.package$.error(package.scala:27)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181)
 at scala.collection.immutable.Stream.foreach(Stream.scala:594)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
 at 
org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
 at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
 at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
 at 

[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601921#comment-16601921
 ] 

Apache Spark commented on SPARK-25262:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22323

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601919#comment-16601919
 ] 

Apache Spark commented on SPARK-25262:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22323

> Make Spark local dir volumes configurable with Spark on Kubernetes
> --
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Priority: Major
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601857#comment-16601857
 ] 

Apache Spark commented on SPARK-25312:
--

User 'npoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22322

> Add description for the conf spark.network.crypto.keyFactoryIterations
> --
>
> Key: SPARK-25312
> URL: https://issues.apache.org/jira/browse/SPARK-25312
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented 
> conf `spark.network.crypto.keyFactoryIterations`. We should document it like 
> what we did for the other confs spark.network.crypto.xyz in 
> https://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601856#comment-16601856
 ] 

Apache Spark commented on SPARK-25312:
--

User 'npoberezkin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22322

> Add description for the conf spark.network.crypto.keyFactoryIterations
> --
>
> Key: SPARK-25312
> URL: https://issues.apache.org/jira/browse/SPARK-25312
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented 
> conf `spark.network.crypto.keyFactoryIterations`. We should document it like 
> what we did for the other confs spark.network.crypto.xyz in 
> https://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25312:


Assignee: (was: Apache Spark)

> Add description for the conf spark.network.crypto.keyFactoryIterations
> --
>
> Key: SPARK-25312
> URL: https://issues.apache.org/jira/browse/SPARK-25312
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented 
> conf `spark.network.crypto.keyFactoryIterations`. We should document it like 
> what we did for the other confs spark.network.crypto.xyz in 
> https://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25312:


Assignee: Apache Spark

> Add description for the conf spark.network.crypto.keyFactoryIterations
> --
>
> Key: SPARK-25312
> URL: https://issues.apache.org/jira/browse/SPARK-25312
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented 
> conf `spark.network.crypto.keyFactoryIterations`. We should document it like 
> what we did for the other confs spark.network.crypto.xyz in 
> https://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: (was: Apache Spark)

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations

2018-09-03 Thread Nikita Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601834#comment-16601834
 ] 

Nikita Poberezkin commented on SPARK-25312:
---

I will add description

> Add description for the conf spark.network.crypto.keyFactoryIterations
> --
>
> Key: SPARK-25312
> URL: https://issues.apache.org/jira/browse/SPARK-25312
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented 
> conf `spark.network.crypto.keyFactoryIterations`. We should document it like 
> what we did for the other confs spark.network.crypto.xyz in 
> https://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601833#comment-16601833
 ] 

Apache Spark commented on SPARK-25313:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22320

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25313:


Assignee: Apache Spark

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25313:
--

 Summary: Fix regression in FileFormatWriter output schema
 Key: SPARK-25313
 URL: https://issues.apache.org/jira/browse/SPARK-25313
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


et's see the follow example:

val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query 
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from 
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is 
changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-03 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25313:
---
Description: 
In the follow example:

val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query 
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from 
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is 
changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.

  was:
et's see the follow example:

val location = "/tmp/t"
val df = spark.range(10).toDF("id")
df.write.format("parquet").saveAsTable("tbl")
spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
$location")
spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
println(spark.read.parquet(location).schema)
spark.table("tbl2").show()

The output column name in schema will be id instead of ID, thus the last query 
shows nothing from tbl2.
By enabling the debug message we can see that the output naming is changed from 
ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is 
changed in RemoveRedundantAliases.

To guarantee correctness, we should change the output columns from 
`Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
optimizer.


> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted

2018-09-03 Thread Nikunj Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599566#comment-16599566
 ] 

Nikunj Bansal edited comment on SPARK-25303 at 9/3/18 6:41 AM:
---

I have a potential fix for this and SPARK-25302 available.


was (Author: nikunj):
I have a potential fix for this and SPARK-25202 available.

> A DStream that is checkpointed should allow its parent(s) to be removed and 
> not persisted
> -
>
> Key: SPARK-25303
> URL: https://issues.apache.org/jira/browse/SPARK-25303
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Nikunj Bansal
>Priority: Major
>  Labels: Streaming, streaming
>
> A checkpointed DStream is supposed to cut the lineage to its parent(s) such 
> that any persisted RDDs for the parent(s) are removed. However, combined with 
> the issue in SPARK-25302, they result in the Input Stream RDDs being 
> persisted a lot longer than they are actually required.
> See also related bug SPARK-25302.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking

2018-09-03 Thread Xiaochen Ouyang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochen Ouyang updated SPARK-25311:

Description: 
IPV4 can pass the follwing check
{code:java}
  def checkHost(host: String, message: String = "") {
assert(host.indexOf(':') == -1, message)
  }
{code}
But IPV6 check failed.

  was:
IPV4 can pass the flowing check
{code:java}
  def checkHost(host: String, message: String = "") {
assert(host.indexOf(':') == -1, message)
  }
{code}
But IPV6 check failed.



> `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
> ---
>
> Key: SPARK-25311
> URL: https://issues.apache.org/jira/browse/SPARK-25311
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2
>Reporter: Xiaochen Ouyang
>Priority: Major
>
> IPV4 can pass the follwing check
> {code:java}
>   def checkHost(host: String, message: String = "") {
> assert(host.indexOf(':') == -1, message)
>   }
> {code}
> But IPV6 check failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org