[jira] [Commented] (SPARK-25328) Add a test case for having two columns as the grouping key in group aggregate pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602630#comment-16602630 ] Xiao Li commented on SPARK-25328: - cc [~icexelloss] [~bryanc] [~hyukjin.kwon] > Add a test case for having two columns as the grouping key in group aggregate > pandas UDF > > > Key: SPARK-25328 > URL: https://issues.apache.org/jira/browse/SPARK-25328 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > https://github.com/apache/spark/pull/20295 added an alternative interface for > group aggregate pandas UDFs. It does not have any test case that have more > than one columns as the grouping key. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25328) Add a test case for having two columns as the grouping key in group aggregate pandas UDF
Xiao Li created SPARK-25328: --- Summary: Add a test case for having two columns as the grouping key in group aggregate pandas UDF Key: SPARK-25328 URL: https://issues.apache.org/jira/browse/SPARK-25328 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.4.0 Reporter: Xiao Li https://github.com/apache/spark/pull/20295 added an alternative interface for group aggregate pandas UDFs. It does not have any test case that have more than one columns as the grouping key. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-25310. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.4.0 Issue resolved by pull request 22317 https://github.com/apache/spark/pull/22317 > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25307) ArraySort function may return a error in the code generation phase.
[ https://issues.apache.org/jira/browse/SPARK-25307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-25307. --- Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.4.0 Issue resolved by pull request 22314 https://github.com/apache/spark/pull/22314 > ArraySort function may return a error in the code generation phase. > --- > > Key: SPARK-25307 > URL: https://issues.apache.org/jira/browse/SPARK-25307 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Sorting array of booleans (not nullable) returns a compilation error in the > code generation phase. Below is the compilation error : > {code:java} > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 51, Column 23: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 51, Column 23: No applicable constructor/method found for actual parameters > "boolean[]"; candidates are: "public static void > java.util.Arrays.sort(long[])", "public static void > java.util.Arrays.sort(long[], int, int)", "public static void > java.util.Arrays.sort(byte[], int, int)", "public static void > java.util.Arrays.sort(float[])", "public static void > java.util.Arrays.sort(float[], int, int)", "public static void > java.util.Arrays.sort(char[])", "public static void > java.util.Arrays.sort(char[], int, int)", "public static void > java.util.Arrays.sort(short[], int, int)", "public static void > java.util.Arrays.sort(short[])", "public static void > java.util.Arrays.sort(byte[])", "public static void > java.util.Arrays.sort(java.lang.Object[], int, int, java.util.Comparator)", > "public static void java.util.Arrays.sort(java.lang.Object[], > java.util.Comparator)", "public static void java.util.Arrays.sort(int[])", > "public static void java.util.Arrays.sort(java.lang.Object[], int, int)", > "public static void java.util.Arrays.sort(java.lang.Object[])", "public > static void java.util.Arrays.sort(double[])", "public static void > java.util.Arrays.sort(double[], int, int)", "public static void > java.util.Arrays.sort(int[], int, int)" > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25308) ArrayContains function may return a error in the code generation phase.
[ https://issues.apache.org/jira/browse/SPARK-25308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-25308. --- Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.4.0 Issue resolved by pull request 22315 https://github.com/apache/spark/pull/22315 > ArrayContains function may return a error in the code generation phase. > --- > > Key: SPARK-25308 > URL: https://issues.apache.org/jira/browse/SPARK-25308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > > Invoking ArrayContains function with non nullable array type throws the > following error in the code generation phase. > {code} > Code generation of array_contains([1,2,3], 1) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 40, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 40, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 40, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 40, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25319: --- Target Version/s: 2.4.0 (was: 2.3.0) > Spark MLlib, GraphX 2.4 QA umbrella > --- > > Key: SPARK-25319 > URL: https://issues.apache.org/jira/browse/SPARK-25319 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Weichen Xu >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.4.0 > > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25319: --- Fix Version/s: (was: 2.3.0) 2.4.0 > Spark MLlib, GraphX 2.4 QA umbrella > --- > > Key: SPARK-25319 > URL: https://issues.apache.org/jira/browse/SPARK-25319 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Weichen Xu >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.4.0 > > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25325: --- Summary: ML, Graph 2.4 QA: Update user guide for new features & APIs (was: ML, Graph 2.3 QA: Update user guide for new features & APIs) > ML, Graph 2.4 QA: Update user guide for new features & APIs > --- > > Key: SPARK-25325 > URL: https://issues.apache.org/jira/browse/SPARK-25325 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Nick Pentreath >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > For MLlib: > * This task does not include major reorganizations for the programming guide. > * We should now begin copying algorithm details from the spark.mllib guide to > spark.ml as needed, rather than just linking back to the corresponding > algorithms in the spark.mllib user guide. > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25327) Update MLlib, GraphX websites for 2.4
[ https://issues.apache.org/jira/browse/SPARK-25327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25327: --- Affects Version/s: 2.4.0 Target Version/s: 2.4.0 Summary: Update MLlib, GraphX websites for 2.4 (was: CLONE - Update MLlib, GraphX websites for 2.3) > Update MLlib, GraphX websites for 2.4 > - > > Key: SPARK-25327 > URL: https://issues.apache.org/jira/browse/SPARK-25327 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Nick Pentreath >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25325) ML, Graph 2.3 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25325: --- Affects Version/s: 2.4.0 Target Version/s: 2.4.0 Fix Version/s: (was: 2.3.0) Summary: ML, Graph 2.3 QA: Update user guide for new features & APIs (was: CLONE - ML, Graph 2.3 QA: Update user guide for new features & APIs) > ML, Graph 2.3 QA: Update user guide for new features & APIs > --- > > Key: SPARK-25325 > URL: https://issues.apache.org/jira/browse/SPARK-25325 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Nick Pentreath >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > For MLlib: > * This task does not include major reorganizations for the programming guide. > * We should now begin copying algorithm details from the spark.mllib guide to > spark.ml as needed, rather than just linking back to the corresponding > algorithms in the spark.mllib user guide. > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25326: --- Affects Version/s: (was: 2.3.0) 2.4.0 Target Version/s: 2.4.0 (was: 2.3.0) Fix Version/s: (was: 2.3.0) Summary: ML, Graph 2.4 QA: Programming guide update and migration guide (was: CLONE - ML, Graph 2.3 QA: Programming guide update and migration guide) > ML, Graph 2.4 QA: Programming guide update and migration guide > -- > > Key: SPARK-25326 > URL: https://issues.apache.org/jira/browse/SPARK-25326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Nick Pentreath >Priority: Critical > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs. > * Check phrasing, especially in main sections (for outdated items such as > "In this release, ...") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25324: --- Affects Version/s: 2.4.0 Target Version/s: 2.4.0 Fix Version/s: (was: 2.3.0) Summary: ML 2.4 QA: API: Java compatibility, docs (was: CLONE - ML 2.3 QA: API: Java compatibility, docs) > ML 2.4 QA: API: Java compatibility, docs > > > Key: SPARK-25324 > URL: https://issues.apache.org/jira/browse/SPARK-25324 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Blocker > > Check Java compatibility for this release: > * APIs in {{spark.ml}} > * New APIs in {{spark.mllib}} (There should be few, if any.) > Checking compatibility means: > * Checking for differences in how Scala and Java handle types. Some items to > look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. These may not > be understood in Java, or they may be accessible only via the weirdly named > Java types (with "$" or "#") which are generated by the Scala compiler. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. (In {{spark.ml}}, we have largely tried to avoid > using enumerations, and have instead favored plain strings.) > * Check for differences in generated Scala vs Java docs. E.g., one past > issue was that Javadocs did not respect Scala's package private modifier. > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > * Remember that we should not break APIs from previous releases. If you find > a problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). > * If needed for complex issues, create small Java unit tests which execute > each method. (Algorithmic correctness can be checked in Scala.) > Recommendations for how to complete this task: > * There are not great tools. In the past, this task has been done by: > ** Generating API docs > ** Building JAR and outputting the Java class signatures for MLlib > ** Manually inspecting and searching the docs and class signatures for issues > * If you do have ideas for better tooling, please say so we can make this > task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25321: --- Affects Version/s: (was: 2.3.0) 2.4.0 Target Version/s: 2.4.0 (was: 2.3.0) Fix Version/s: (was: 2.3.0) > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25323) ML 2.4 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25323: --- Target Version/s: 2.4.0 Summary: ML 2.4 QA: API: Python API coverage (was: CLONE - ML 2.3 QA: API: Python API coverage) > ML 2.4 QA: API: Python API coverage > --- > > Key: SPARK-25323 > URL: https://issues.apache.org/jira/browse/SPARK-25323 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25323) CLONE - ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25323: --- Affects Version/s: (was: 2.3.0) 2.4.0 Target Version/s: (was: 2.3.0) > CLONE - ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-25323 > URL: https://issues.apache.org/jira/browse/SPARK-25323 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Bryan Cutler >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25320: --- Affects Version/s: (was: 2.3.0) 2.4.0 Target Version/s: 2.4.0 (was: 2.3.0) > ML, Graph 2.4 QA: API: Binary incompatible changes > -- > > Key: SPARK-25320 > URL: https://issues.apache.org/jira/browse/SPARK-25320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Bago Amirbekian >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
[ https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25322: --- Affects Version/s: 2.4.0 Fix Version/s: (was: 2.3.0) Summary: ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit (was: CLONE - ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit) > ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit > -- > > Key: SPARK-25322 > URL: https://issues.apache.org/jira/browse/SPARK-25322 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Nick Pentreath >Priority: Blocker > > We should make a pass through the items marked as Experimental or > DeveloperApi and see if any are stable enough to be unmarked. > We should also check for items marked final or sealed to see if they are > stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25321: --- Description: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue was: Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue (SPARK-23111 for {{2.3}}) Summary: ML, Graph 2.4 QA: API: New Scala APIs, docs (was: CLONE - ML, Graph 2.3 QA: API: New Scala APIs, docs) > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.3.0 > > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25319: --- Description: This JIRA lists tasks for the next Spark release's QA period for MLlib and GraphX. *SparkR is separate. The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Check binary API compatibility for Scala/Java * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance * Performance tests h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website was: This JIRA lists tasks for the next Spark release's QA period for MLlib and GraphX. *SparkR is separate: SPARK-23114.* The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Check binary API compatibility for Scala/Java * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance * Performance tests h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website > Spark MLlib, GraphX 2.4 QA umbrella > --- > > Key: SPARK-25319 > URL: https://issues.apache.org/jira/browse/SPARK-25319 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Weichen Xu >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.3.0 > > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate. > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25326) CLONE - ML, Graph 2.3 QA: Programming guide update and migration guide
Weichen Xu created SPARK-25326: -- Summary: CLONE - ML, Graph 2.3 QA: Programming guide update and migration guide Key: SPARK-25326 URL: https://issues.apache.org/jira/browse/SPARK-25326 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 2.3.0 Reporter: Weichen Xu Assignee: Nick Pentreath Fix For: 2.3.0 Before the release, we need to update the MLlib and GraphX Programming Guides. Updates will include: * Add migration guide subsection. ** Use the results of the QA audit JIRAs. * Check phrasing, especially in main sections (for outdated items such as "In this release, ...") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25324) CLONE - ML 2.3 QA: API: Java compatibility, docs
Weichen Xu created SPARK-25324: -- Summary: CLONE - ML 2.3 QA: API: Java compatibility, docs Key: SPARK-25324 URL: https://issues.apache.org/jira/browse/SPARK-25324 Project: Spark Issue Type: Sub-task Components: Documentation, Java API, ML, MLlib Reporter: Weichen Xu Assignee: Weichen Xu Fix For: 2.3.0 Check Java compatibility for this release: * APIs in {{spark.ml}} * New APIs in {{spark.mllib}} (There should be few, if any.) Checking compatibility means: * Checking for differences in how Scala and Java handle types. Some items to look out for are: ** Check for generic "Object" types where Java cannot understand complex Scala types. *** *Note*: The Java docs do not always match the bytecode. If you find a problem, please verify it using {{javap}}. ** Check Scala objects (especially with nesting!) carefully. These may not be understood in Java, or they may be accessible only via the weirdly named Java types (with "$" or "#") which are generated by the Scala compiler. ** Check for uses of Scala and Java enumerations, which can show up oddly in the other language's doc. (In {{spark.ml}}, we have largely tried to avoid using enumerations, and have instead favored plain strings.) * Check for differences in generated Scala vs Java docs. E.g., one past issue was that Javadocs did not respect Scala's package private modifier. If you find issues, please comment here, or for larger items, create separate JIRAs and link here as "requires". * Remember that we should not break APIs from previous releases. If you find a problem, check if it was introduced in this Spark release (in which case we can fix it) or in a previous one (in which case we can create a java-friendly version of the API). * If needed for complex issues, create small Java unit tests which execute each method. (Algorithmic correctness can be checked in Scala.) Recommendations for how to complete this task: * There are not great tools. In the past, this task has been done by: ** Generating API docs ** Building JAR and outputting the Java class signatures for MLlib ** Manually inspecting and searching the docs and class signatures for issues * If you do have ideas for better tooling, please say so we can make this task easier in the future! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25323) CLONE - ML 2.3 QA: API: Python API coverage
Weichen Xu created SPARK-25323: -- Summary: CLONE - ML 2.3 QA: API: Python API coverage Key: SPARK-25323 URL: https://issues.apache.org/jira/browse/SPARK-25323 Project: Spark Issue Type: Sub-task Components: Documentation, ML, PySpark Affects Versions: 2.3.0 Reporter: Weichen Xu Assignee: Bryan Cutler For new public APIs added to MLlib ({{spark.ml}} only), we need to check the generated HTML doc and compare the Scala & Python versions. * *GOAL*: Audit and create JIRAs to fix in the next release. * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. *Please use a _separate_ JIRA (linked below as "requires") for this list of to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-25320: --- Summary: ML, Graph 2.4 QA: API: Binary incompatible changes (was: CLONE - ML, Graph 2.3 QA: API: Binary incompatible changes) > ML, Graph 2.4 QA: API: Binary incompatible changes > -- > > Key: SPARK-25320 > URL: https://issues.apache.org/jira/browse/SPARK-25320 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Bago Amirbekian >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25320) CLONE - ML, Graph 2.3 QA: API: Binary incompatible changes
Weichen Xu created SPARK-25320: -- Summary: CLONE - ML, Graph 2.3 QA: API: Binary incompatible changes Key: SPARK-25320 URL: https://issues.apache.org/jira/browse/SPARK-25320 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 2.3.0 Reporter: Weichen Xu Assignee: Bago Amirbekian Generate a list of binary incompatible changes using MiMa and create new JIRAs for issues found. Filter out false positives as needed. If you want to take this task, look at the analogous task from the previous release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25322) CLONE - ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit
Weichen Xu created SPARK-25322: -- Summary: CLONE - ML, Graph 2.3 QA: API: Experimental, DeveloperApi, final, sealed audit Key: SPARK-25322 URL: https://issues.apache.org/jira/browse/SPARK-25322 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Reporter: Weichen Xu Assignee: Nick Pentreath Fix For: 2.3.0 We should make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. We should also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25327) CLONE - Update MLlib, GraphX websites for 2.3
Weichen Xu created SPARK-25327: -- Summary: CLONE - Update MLlib, GraphX websites for 2.3 Key: SPARK-25327 URL: https://issues.apache.org/jira/browse/SPARK-25327 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Reporter: Weichen Xu Assignee: Nick Pentreath Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25321) CLONE - ML, Graph 2.3 QA: API: New Scala APIs, docs
Weichen Xu created SPARK-25321: -- Summary: CLONE - ML, Graph 2.3 QA: API: New Scala APIs, docs Key: SPARK-25321 URL: https://issues.apache.org/jira/browse/SPARK-25321 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Affects Versions: 2.3.0 Reporter: Weichen Xu Assignee: Yanbo Liang Fix For: 2.3.0 Audit new public Scala APIs added to MLlib & GraphX. Take note of: * Protected/public classes or methods. If access can be more private, then it should be. * Also look for non-sealed traits. * Documentation: Missing? Bad links or formatting? *Make sure to check the object doc!* As you find issues, please create JIRAs and link them to this issue. For *user guide issues* link the new JIRAs to the relevant user guide QA issue (SPARK-23111 for {{2.3}}) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25325) CLONE - ML, Graph 2.3 QA: Update user guide for new features & APIs
Weichen Xu created SPARK-25325: -- Summary: CLONE - ML, Graph 2.3 QA: Update user guide for new features & APIs Key: SPARK-25325 URL: https://issues.apache.org/jira/browse/SPARK-25325 Project: Spark Issue Type: Sub-task Components: Documentation, GraphX, ML, MLlib Reporter: Weichen Xu Assignee: Nick Pentreath Fix For: 2.3.0 Check the user guide vs. a list of new APIs (classes, methods, data members) to see what items require updates to the user guide. For each feature missing user guide doc: * Create a JIRA for that feature, and assign it to the author of the feature * Link it to (a) the original JIRA which introduced that feature ("related to") and (b) to this JIRA ("requires"). For MLlib: * This task does not include major reorganizations for the programming guide. * We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide. If you would like to work on this task, please comment, and we can create & link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella
Weichen Xu created SPARK-25319: -- Summary: Spark MLlib, GraphX 2.4 QA umbrella Key: SPARK-25319 URL: https://issues.apache.org/jira/browse/SPARK-25319 Project: Spark Issue Type: Umbrella Components: Documentation, GraphX, ML, MLlib Reporter: Weichen Xu Assignee: Joseph K. Bradley Fix For: 2.3.0 This JIRA lists tasks for the next Spark release's QA period for MLlib and GraphX. *SparkR is separate: SPARK-23114.* The list below gives an overview of what is involved, and the corresponding JIRA issues are linked below that. h2. API * Check binary API compatibility for Scala/Java * Audit new public APIs (from the generated html doc) ** Scala ** Java compatibility ** Python coverage * Check Experimental, DeveloperApi tags h2. Algorithms and performance * Performance tests h2. Documentation and example code * For new algorithms, create JIRAs for updating the user guide sections & examples * Update Programming Guide * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25318: Assignee: (was: Apache Spark) > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Priority: Minor > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prone to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602571#comment-16602571 ] Apache Spark commented on SPARK-25318: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/22325 > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Priority: Minor > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prone to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25318: Assignee: Apache Spark > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Assignee: Apache Spark >Priority: Minor > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prone to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reza Safi updated SPARK-25318: -- Description: SPARK-4105 provided a solution to block corruption issue by retrying the fetch or the stage. In the solution there is a step that wraps the input stream with compression and/or encryption. This step is prone to exceptions, but in the current code there is no exception handling for this step and this has caused confusion for the user. In fact we have customers who reported an exception like the following when SPARK-4105 is available to them: {noformat} 2018-08-28 22:35:54,361 ERROR [Driver] org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 209.0 failed 4 times, most recent failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): java.io.IOException: FAILED_TO_UNCOMPRESS(5) 3976 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) 3980 at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) 3981 at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) 3982 at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) 3983 at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) 3984 at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) 3985 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) 3986 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) 3987 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) 3988 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) 3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 3990 a {noformat} In this customer's version of spark, line 328 of ShuffleBlockFetcherIterator.scala is the line that the following occurs: {noformat} input = streamWrapper(blockId, in) {noformat} It would be nice to add exception handling around this line to avoid confusions. was: SPARK-4105 provided a solution to block corruption issue by retrying the fetch or the stage. In the solution there is a step that wraps the input stream with compression and/or encryption. This step is prune to exceptions, but in the current code there is no exception handling for this step and this has caused confusion for the user.. In fact we have customers who reported an exception like the following when SPARK-4105 is available to them: {noformat} 2018-08-28 22:35:54,361 ERROR [Driver] org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 209.0 failed 4 times, most recent failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): java.io.IOException: FAILED_TO_UNCOMPRESS(5) 3976 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) 3980 at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) 3981 at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) 3982 at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) 3983 at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) 3984 at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) 3985 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) 3986 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) 3987 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) 3988 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) 3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 3990 a {noformat} In this customer's version of spark, line 328 of ShuffleBlockFetcherIterator.scala is
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567 ] Kazuaki Ishizaki commented on SPARK-25317: -- I confirmed this performance difference even after adding warmup. Let me investigate furthermore. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
[ https://issues.apache.org/jira/browse/SPARK-25318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602565#comment-16602565 ] Reza Safi commented on SPARK-25318: --- I will send a pr for this shortly > Add exception handling when wrapping the input stream during the the fetch or > stage retry in response to a corrupted block > -- > > Key: SPARK-25318 > URL: https://issues.apache.org/jira/browse/SPARK-25318 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Reza Safi >Priority: Minor > > SPARK-4105 provided a solution to block corruption issue by retrying the > fetch or the stage. In the solution there is a step that wraps the input > stream with compression and/or encryption. This step is prune to exceptions, > but in the current code there is no exception handling for this step and this > has caused confusion for the user.. In fact we have customers who reported an > exception like the following when SPARK-4105 is available to them: > {noformat} > 2018-08-28 22:35:54,361 ERROR [Driver] > org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: > java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due > tostage failure: Task 452 in stage 209.0 failed 4 times, most recent > failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > 3976 at > org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) > 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) > 3980 at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > 3981 at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > 3982 at > org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > 3983 at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) > 3984 at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) > 3985 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) > 3986 at > org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) > 3987 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) > 3988 at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) > 3989 at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > 3990 a > {noformat} > In this customer's version of spark, line 328 of > ShuffleBlockFetcherIterator.scala is the line that the following occurs: > {noformat} > input = streamWrapper(blockId, in) > {noformat} > It would be nice to add exception handling around this line to avoid > confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25318) Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block
Reza Safi created SPARK-25318: - Summary: Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block Key: SPARK-25318 URL: https://issues.apache.org/jira/browse/SPARK-25318 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1, 2.2.2, 2.1.3, 2.4.0 Reporter: Reza Safi SPARK-4105 provided a solution to block corruption issue by retrying the fetch or the stage. In the solution there is a step that wraps the input stream with compression and/or encryption. This step is prune to exceptions, but in the current code there is no exception handling for this step and this has caused confusion for the user.. In fact we have customers who reported an exception like the following when SPARK-4105 is available to them: {noformat} 2018-08-28 22:35:54,361 ERROR [Driver] org.apache.spark.deploy.yarn.ApplicationMaster:95 User class threw exception: java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 209.0 failed 4 times, most recent failure: Lost task 452.3 in stage y.0 (TID z, x, executor xx): java.io.IOException: FAILED_TO_UNCOMPRESS(5) 3976 at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) 3977 at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) 3978 at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:395) 3979 at org.xerial.snappy.Snappy.uncompress(Snappy.java:431) 3980 at org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) 3981 at org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) 3982 at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) 3983 at org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:159) 3984 at org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1219) 3985 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:48) 3986 at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$2.apply(BlockStoreShuffleReader.scala:47) 3987 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:328) 3988 at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:55) 3989 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 3990 a {noformat} In this customer's version of spark, line 328 of ShuffleBlockFetcherIterator.scala is the line that the following occurs: {noformat} input = streamWrapper(blockId, in) {noformat} It would be nice to add exception handling around this line to avoid confusions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25237) FileScanRdd's inputMetrics is wrong when select the datasource table with limit
[ https://issues.apache.org/jira/browse/SPARK-25237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602542#comment-16602542 ] Apache Spark commented on SPARK-25237: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22324 > FileScanRdd's inputMetrics is wrong when select the datasource table with > limit > > > Key: SPARK-25237 > URL: https://issues.apache.org/jira/browse/SPARK-25237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: du >Priority: Major > > In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead > every 1000 rows or when close the iterator. > but when close the iterator, we will invoke updateBytesReadWithFileSize to > increase the inputMetrics's bytesRead with file's length. > this will result in the inputMetrics's bytesRead is wrong when run the query > with limit such as select * from table limit 1. > because we do not support for Hadoop 2.5 and earlier now, we always get the > bytesRead from Hadoop FileSystem statistics other than files's length. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25237) FileScanRdd's inputMetrics is wrong when select the datasource table with limit
[ https://issues.apache.org/jira/browse/SPARK-25237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602543#comment-16602543 ] Apache Spark commented on SPARK-25237: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22324 > FileScanRdd's inputMetrics is wrong when select the datasource table with > limit > > > Key: SPARK-25237 > URL: https://issues.apache.org/jira/browse/SPARK-25237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: du >Priority: Major > > In FileScanRdd, we will update inputMetrics's bytesRead using updateBytesRead > every 1000 rows or when close the iterator. > but when close the iterator, we will invoke updateBytesReadWithFileSize to > increase the inputMetrics's bytesRead with file's length. > this will result in the inputMetrics's bytesRead is wrong when run the query > with limit such as select * from table limit 1. > because we do not support for Hadoop 2.5 and earlier now, we always get the > bytesRead from Hadoop FileSystem statistics other than files's length. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
[ https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602528#comment-16602528 ] Hyukjin Kwon commented on SPARK-25293: -- [~omkar999], would you be able to test this and see if the issue still exists in the upper Spark version? > Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx > instead of directly saving in outputDir > -- > > Key: SPARK-25293 > URL: https://issues.apache.org/jira/browse/SPARK-25293 > Project: Spark > Issue Type: Bug > Components: EC2, Java API, Spark Shell, Spark Submit >Affects Versions: 2.0.2 >Reporter: omkar puttagunta >Priority: Major > > [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating] > {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master > node on AWS EC2 > {quote} > Simple Test; reading pipe delimited file and writing data to csv. Commands > below are executed in spark-shell with master-url set > {{val df = > spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/") > val emailDf=df.filter("_c3='EML'") > emailDf.repartition(100).write.csv("/opt/outputFile/")}} > After executing the cmds above in spark-shell with master url set. > {quote}In {{worker1}} -> Each part file is created > in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}} > In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated > directly under outputDirectory specified during write. > {quote} > *Same thing happens with coalesce(100) or without specifying > repartition/coalesce!!! Tried with Java also!* > *_Quesiton_* > 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have > {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is > created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602512#comment-16602512 ] Jungtaek Lim commented on SPARK-25317: -- Why not running test with JMH, applying warmup and iteration? Not sure it can be applied to scala test, but the Java test code should be simple if these Spark classes are aware of interop. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15
[ https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-20395: - Assignee: DB Tsai > Update Scala to 2.11.11 and zinc to 0.3.15 > -- > > Key: SPARK-20395 > URL: https://issues.apache.org/jira/browse/SPARK-20395 > Project: Spark > Issue Type: Dependency upgrade > Components: Build, Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Jeremy Smith >Assignee: DB Tsai >Priority: Minor > Fix For: 2.4.0 > > > Update Scala to 2.11.11, which was released yesterday: > https://github.com/scala/scala/releases/tag/v2.11.11 > Since it's a patch version upgrade and binary compatibility is guaranteed, > impact should be minimal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15
[ https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602510#comment-16602510 ] Dongjoon Hyun edited comment on SPARK-20395 at 9/4/18 12:59 AM: In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0) by [~dbtsai]. In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0) by [~srowen]. was (Author: dongjoon): In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0) In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0) > Update Scala to 2.11.11 and zinc to 0.3.15 > -- > > Key: SPARK-20395 > URL: https://issues.apache.org/jira/browse/SPARK-20395 > Project: Spark > Issue Type: Dependency upgrade > Components: Build, Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Jeremy Smith >Priority: Minor > Fix For: 2.4.0 > > > Update Scala to 2.11.11, which was released yesterday: > https://github.com/scala/scala/releases/tag/v2.11.11 > Since it's a patch version upgrade and binary compatibility is guaranteed, > impact should be minimal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15
[ https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-20395. --- Resolution: Fixed Fix Version/s: 2.4.0 In SPARK-24418, Scala version becomes 2.11.12 (Spark 2.4.0) In SPARK-19810, Zinc version becomes 0.3.15 (Spark 2.3.0) > Update Scala to 2.11.11 and zinc to 0.3.15 > -- > > Key: SPARK-20395 > URL: https://issues.apache.org/jira/browse/SPARK-20395 > Project: Spark > Issue Type: Dependency upgrade > Components: Build, Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Jeremy Smith >Priority: Minor > Fix For: 2.4.0 > > > Update Scala to 2.11.11, which was released yesterday: > https://github.com/scala/scala/releases/tag/v2.11.11 > Since it's a patch version upgrade and binary compatibility is guaranteed, > impact should be minimal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20395) Update Scala to 2.11.11 and zinc to 0.3.15
[ https://issues.apache.org/jira/browse/SPARK-20395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-20395: --- > Update Scala to 2.11.11 and zinc to 0.3.15 > -- > > Key: SPARK-20395 > URL: https://issues.apache.org/jira/browse/SPARK-20395 > Project: Spark > Issue Type: Dependency upgrade > Components: Build, Spark Core >Affects Versions: 2.0.2, 2.1.0 >Reporter: Jeremy Smith >Priority: Minor > > Update Scala to 2.11.11, which was released yesterday: > https://github.com/scala/scala/releases/tag/v2.11.11 > Since it's a patch version upgrade and binary compatibility is guaranteed, > impact should be minimal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506 ] Kazuaki Ishizaki commented on SPARK-25317: -- Let me run this on 2.3 and master. One question. This benchmark does not have an warm up loop. In other words, this benchmark may include execution time on an interpreter, too. Is this behavior intentional? > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25317: - Description: eThere is a performance regression when calculating hash code for UTF8String: {code:java} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code:java} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us was: There is a performance regression when calculating hash code for UTF8String: {code} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import
[jira] [Comment Edited] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602503#comment-16602503 ] Wenchen Fan edited comment on SPARK-25317 at 9/4/18 12:24 AM: -- cc [~kiszk] [~rednaxelafx] was (Author: cloud_fan): cc [~kiszk] > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > There is a performance regression when calculating hash code for UTF8String: > {code} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602503#comment-16602503 ] Wenchen Fan commented on SPARK-25317: - cc [~kiszk] > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > There is a performance regression when calculating hash code for UTF8String: > {code} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25317) MemoryBlock performance regression
Wenchen Fan created SPARK-25317: --- Summary: MemoryBlock performance regression Key: SPARK-25317 URL: https://issues.apache.org/jira/browse/SPARK-25317 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan There is a performance regression when calculating hash code for UTF8String: {code} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25317: Priority: Blocker (was: Major) > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > There is a performance regression when calculating hash code for UTF8String: > {code} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25316) Spark error - ERROR ContextCleaner: Error cleaning broadcast 22, Exception thrown in awaitResult:
Vidya created SPARK-25316: - Summary: Spark error - ERROR ContextCleaner: Error cleaning broadcast 22, Exception thrown in awaitResult: Key: SPARK-25316 URL: https://issues.apache.org/jira/browse/SPARK-25316 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.2.2 Reporter: Vidya While running spark load on EMR with c3 instaces, we see following error ERROR ContextCleaner: Error cleaning broadcast 22 org.apache.spark.SparkException: Exception thrown in awaitResult: Whats the cause of the error and how do we fix it? Stage 30:=> (374 + 20) / 600] [Stage 30:=> (419 + 20) / 600] [Stage 30:==> (471 + 4) / 600]18/08/02 21:06:09 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.154.21.145:45990 is closed 18/08/02 21:06:09 ERROR ContextCleaner: Error cleaning broadcast 22 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:161) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:306) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185) at scala.Option.foreach(Option.scala:257) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1279) at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178) at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602440#comment-16602440 ] Dilip Biswal commented on SPARK-25279: -- [~zzcclp] Hmmn.. i don't know whats happening in the :paste mode. Let me cc the experts. cc [~cloud_fan] [~viirya] Hi Wenchen and Simon, Do you know the reason for the failure ? > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > ) >
[jira] [Resolved] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
[ https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-25117. -- Resolution: Fixed Assignee: Dilip Biswal > Add EXEPT ALL and INTERSECT ALL support in R. > - > > Key: SPARK-25117 > URL: https://issues.apache.org/jira/browse/SPARK-25117 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25117) Add EXEPT ALL and INTERSECT ALL support in R.
[ https://issues.apache.org/jira/browse/SPARK-25117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-25117: - Fix Version/s: 2.4.0 > Add EXEPT ALL and INTERSECT ALL support in R. > - > > Key: SPARK-25117 > URL: https://issues.apache.org/jira/browse/SPARK-25117 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics
[ https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali closed SPARK-22190. --- > Add Spark executor task metrics to Dropwizard metrics > - > > Key: SPARK-22190 > URL: https://issues.apache.org/jira/browse/SPARK-22190 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 2.3.0 > > Attachments: SparkTaskMetrics_Grafana_example.PNG > > > I would like to propose to expose Spark executor task metrics using the > Dropwizard metrics. I have developed a simple implementation and run a few > tests using Graphite sink and Grafana visualization and this appears to me a > good source of information for monitoring and troubleshooting the progress of > Spark jobs. I attach a screenshot of an example graph generated with Grafana. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali closed SPARK-21829. --- > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session
[ https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali closed SPARK-21519. --- > Add an option to the JDBC data source to initialize the environment of the > remote database session > -- > > Key: SPARK-21519 > URL: https://issues.apache.org/jira/browse/SPARK-21519 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 2.3.0 > > > This proposes an option to the JDBC datasource, tentatively called > "sessionInitStatement" to implement the functionality of session > initialization present for example in the Sqoop connector for Oracle (see > https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements) > . After each database session is opened to the remote DB, and before > starting to read data, this option executes a custom SQL statement (or a > PL/SQL block in the case of Oracle). > Example of usage, relevant to Oracle JDBC: > {code} > val preambleSQL=""" > begin > execute immediate 'alter session set tracefile_identifier=sparkora'; > execute immediate 'alter session set "_serial_direct_read"=true'; > execute immediate 'alter session set time_zone=''+02:00'''; > end; > """ > bin/spark-shell --jars ojdb6.jar > val df = spark.read >.format("jdbc") >.option("url", > "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name") >.option("driver", "oracle.jdbc.driver.OracleDriver") >.option("dbtable", "(select 1, sysdate, systimestamp, > current_timestamp, localtimestamp from dual)") >.option("user", "MYUSER") >.option("password", "MYPASSWORD").option("fetchsize",1000) >.option("sessionInitStatement", preambleSQL) >.load() > df.show(5,false) > {code} > *Comments:* This proposal has been developed and tested for connecting the > Spark JDBC data source to Oracle databases, however I believe it can be > useful for other target DBs too, as it is quite generic. > The code executed by the option "sessionInitStatement" is just the > user-provided string fed through the execute method of the JDBC connection, > so it can use the features of the target database language/syntax. When using > sessionInitStatement for querying Oracle, for example, the user-provided > command can be a SQL statement or a PL/SQL block grouping multiple commands > and logic. > Note the proposed code allows to inject SQL into the target database. This is > not a security concern as such, as it requires password authentication, > however beware of the possibilities of injecting user-provided SQL (and > PL/SQL) that this opens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25309) Sci-kit Learn like Auto Pipeline Parallelization in Spark
[ https://issues.apache.org/jira/browse/SPARK-25309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi updated SPARK-25309: - Description: SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. However, instead of setting the parallelism Parameter in the CrossValidator it would be good to have something like njobs=-1 (like Scikit Learn) where the Pipeline DAG could be automatically parallelized and scheduled based on the resources allocated to the Spark Session instead of having the user pick the integer value for this parameter. (was: SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. However, instead of setting the parallelism Parameter in the CrossValidator it would be good to have something like njobs=-1 (like Scikit Learn) where the Pipleline DAG could be automatically parallelized and scheduled based on the resources allocated to the Spark Session instead of having the user pick the integer value for this parameter. ) > Sci-kit Learn like Auto Pipeline Parallelization in Spark > -- > > Key: SPARK-25309 > URL: https://issues.apache.org/jira/browse/SPARK-25309 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Ravi >Priority: Critical > > SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. > However, instead of setting the parallelism Parameter in the CrossValidator > it would be good to have something like njobs=-1 (like Scikit Learn) where > the Pipeline DAG could be automatically parallelized and scheduled based on > the resources allocated to the Spark Session instead of having the user pick > the integer value for this parameter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602308#comment-16602308 ] Zhichao Zhang edited comment on SPARK-25279 at 9/3/18 4:12 PM: [~dkbiswal], I followed your steps to run code successfully, but if I pasted all code and then ran the code, the error occured: {code:java} scala>:paste ...copy all code here Crtl + D{code} what is the difference between these two mode? was (Author: zzcclp): [~dkbiswal], I followed your steps to run code successfully, but if I pasted all code and then ran the code, the error occured: scala>:paste ...copy all code here Crtl + D what is the difference between these two mode? > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: >
[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602308#comment-16602308 ] Zhichao Zhang commented on SPARK-25279: [~dkbiswal], I followed your steps to run code successfully, but if I pasted all code and then ran the code, the error occured: scala>:paste ...copy all code here Crtl + D what is the difference between these two mode? > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class >
[jira] [Created] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0
Zhenhao Li created SPARK-25315: -- Summary: setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0 Key: SPARK-25315 URL: https://issues.apache.org/jira/browse/SPARK-25315 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Environment: Standalone; running in IDEA Reporter: Zhenhao Li The following code won't read from the beginning of the topic ``` {code:java} val kafkaOptions = Map[String, String]( "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS, "subscribe" -> TOPIC, "group.id" -> GROUP_ID, "auto.offset.reset" -> "earliest" ) val myStream = sparkSession .readStream .format("kafka") .options(kafkaOptions) .load() .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") myStream .writeStream .format("console") .start() .awaitTermination() {code} ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24767) Propagate MDC to spark-submit thread in InProcessAppHandle
[ https://issues.apache.org/jira/browse/SPARK-24767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifei Huang resolved SPARK-24767. - Resolution: Won't Fix > Propagate MDC to spark-submit thread in InProcessAppHandle > -- > > Key: SPARK-24767 > URL: https://issues.apache.org/jira/browse/SPARK-24767 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.3.1 >Reporter: Yifei Huang >Priority: Major > > Currently, for the in-process launcher, we don't propagate MDCs to the thread > that runs spark-submit. Ideally, we'd do this so it's easier to map the child > thread to the parent thread. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25298) spark-tools build failure for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25298: - Assignee: Darcy Shen > spark-tools build failure for Scala 2.12 > > > Key: SPARK-25298 > URL: https://issues.apache.org/jira/browse/SPARK-25298 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Assignee: Darcy Shen >Priority: Major > Fix For: 2.4.0 > > > $ sbt-- > > ++ 2.12.6 > > compile > > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.\{universe => unv} > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.universe.runtimeMirror > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41: > not found: value runtimeMirror > [error] private val mirror = runtimeMirror(classLoader) > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43: > not found: value unv > [error] private def isPackagePrivate(sym: unv.Symbol) = -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25298) spark-tools build failure for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25298. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22310 [https://github.com/apache/spark/pull/22310] > spark-tools build failure for Scala 2.12 > > > Key: SPARK-25298 > URL: https://issues.apache.org/jira/browse/SPARK-25298 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Assignee: Darcy Shen >Priority: Major > Fix For: 2.4.0 > > > $ sbt-- > > ++ 2.12.6 > > compile > > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.\{universe => unv} > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.universe.runtimeMirror > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41: > not found: value runtimeMirror > [error] private val mirror = runtimeMirror(classLoader) > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43: > not found: value unv > [error] private def isPackagePrivate(sym: unv.Symbol) = -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25298) spark-tools build failure for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-25298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25298: -- Priority: Minor (was: Major) > spark-tools build failure for Scala 2.12 > > > Key: SPARK-25298 > URL: https://issues.apache.org/jira/browse/SPARK-25298 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Assignee: Darcy Shen >Priority: Minor > Fix For: 2.4.0 > > > $ sbt-- > > ++ 2.12.6 > > compile > > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:22: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.\{universe => unv} > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:23: > object runtime is not a member of package reflect > [error] import scala.reflect.runtime.universe.runtimeMirror > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:41: > not found: value runtimeMirror > [error] private val mirror = runtimeMirror(classLoader) > [error] ^ > [error] > /Users/rendong/wdi/spark/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:43: > not found: value unv > [error] private def isPackagePrivate(sym: unv.Symbol) = -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19728) PythonUDF with multiple parents shouldn't be pushed down when used as a predicate
[ https://issues.apache.org/jira/browse/SPARK-19728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602040#comment-16602040 ] Sergey Bahchissaraitsev commented on SPARK-19728: - This is still happening in 2.3.1 when using the UDF inside the join "on" condition. e.g. {quote}df1.join(df2, pred(df1.col_a, df2.col_b)).show() {quote} I have opened for this: SPARK-25314 > PythonUDF with multiple parents shouldn't be pushed down when used as a > predicate > -- > > Key: SPARK-19728 > URL: https://issues.apache.org/jira/browse/SPARK-19728 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > Fix For: 2.2.0 > > > Prior to Spark 2.0 it was possible to use Python UDF output as a predicate: > {code} > from pyspark.sql.functions import udf > from pyspark.sql.types import BooleanType > df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"]) > df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"]) > pred = udf(lambda x, y: x == y, BooleanType()) > df1.join(df2).where(pred("col_a", "col_b")).show() > {code} > In Spark 2.0 this is no longer possible: > {code} > spark.conf.set("spark.sql.crossJoin.enabled", True) > df1.join(df2).where(pred("col_a", "col_b")).show() > ## ... > ## Py4JJavaError: An error occurred while calling o731.showString. > : java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, > col_b#135L), requires attributes from more than one child. > ## ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25314) Invalid PythonUDF - requires attributes from more than one child - in "on" join condition
Sergey Bahchissaraitsev created SPARK-25314: --- Summary: Invalid PythonUDF - requires attributes from more than one child - in "on" join condition Key: SPARK-25314 URL: https://issues.apache.org/jira/browse/SPARK-25314 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.3.1 Reporter: Sergey Bahchissaraitsev This is another variation of the SPARK-19728 which was tagged as resolved. So I base the example on it: from pyspark.sql.functions import udf from pyspark.sql.types import BooleanType df1 = sc.parallelize([(1, ), (2, )]).toDF(["col_a"]) df2 = sc.parallelize([(2, ), (3, )]).toDF(["col_b"]) pred = udf(lambda x, y: x == y, BooleanType()) df1.join(df2, pred(df1.col_a, df2.col_b)).show() This throws: {quote}java.lang.RuntimeException: Invalid PythonUDF (col_a#132L, col_b#135L), requires attributes from more than one child. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181) at scala.collection.immutable.Stream.foreach(Stream.scala:594) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601921#comment-16601921 ] Apache Spark commented on SPARK-25262: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22323 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Make Spark local dir volumes configurable with Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601919#comment-16601919 ] Apache Spark commented on SPARK-25262: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22323 > Make Spark local dir volumes configurable with Spark on Kubernetes > -- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Priority: Major > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601857#comment-16601857 ] Apache Spark commented on SPARK-25312: -- User 'npoberezkin' has created a pull request for this issue: https://github.com/apache/spark/pull/22322 > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601856#comment-16601856 ] Apache Spark commented on SPARK-25312: -- User 'npoberezkin' has created a pull request for this issue: https://github.com/apache/spark/pull/22322 > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25312: Assignee: (was: Apache Spark) > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25312: Assignee: Apache Spark > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: (was: Apache Spark) > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25312) Add description for the conf spark.network.crypto.keyFactoryIterations
[ https://issues.apache.org/jira/browse/SPARK-25312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601834#comment-16601834 ] Nikita Poberezkin commented on SPARK-25312: --- I will add description > Add description for the conf spark.network.crypto.keyFactoryIterations > -- > > Key: SPARK-25312 > URL: https://issues.apache.org/jira/browse/SPARK-25312 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Affects Versions: 2.3.2 >Reporter: Xiao Li >Priority: Major > Labels: starter > > https://github.com/apache/spark/pull/22195 fixed the typo of an undocumented > conf `spark.network.crypto.keyFactoryIterations`. We should document it like > what we did for the other confs spark.network.crypto.xyz in > https://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601833#comment-16601833 ] Apache Spark commented on SPARK-25313: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22320 > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25313: Assignee: Apache Spark > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25313) Fix regression in FileFormatWriter output schema
Gengliang Wang created SPARK-25313: -- Summary: Fix regression in FileFormatWriter output schema Key: SPARK-25313 URL: https://issues.apache.org/jira/browse/SPARK-25313 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang et's see the follow example: val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2. By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. To guarantee correctness, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-25313: --- Description: In the follow example: val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2. By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. To guarantee correctness, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. was: et's see the follow example: val location = "/tmp/t" val df = spark.range(10).toDF("id") df.write.format("parquet").saveAsTable("tbl") spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location $location") spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") println(spark.read.parquet(location).schema) spark.table("tbl2").show() The output column name in schema will be id instead of ID, thus the last query shows nothing from tbl2. By enabling the debug message we can see that the output naming is changed from ID to id, and then the outputColumns in InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. To guarantee correctness, we should change the output columns from `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by optimizer. > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25303) A DStream that is checkpointed should allow its parent(s) to be removed and not persisted
[ https://issues.apache.org/jira/browse/SPARK-25303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599566#comment-16599566 ] Nikunj Bansal edited comment on SPARK-25303 at 9/3/18 6:41 AM: --- I have a potential fix for this and SPARK-25302 available. was (Author: nikunj): I have a potential fix for this and SPARK-25202 available. > A DStream that is checkpointed should allow its parent(s) to be removed and > not persisted > - > > Key: SPARK-25303 > URL: https://issues.apache.org/jira/browse/SPARK-25303 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1 >Reporter: Nikunj Bansal >Priority: Major > Labels: Streaming, streaming > > A checkpointed DStream is supposed to cut the lineage to its parent(s) such > that any persisted RDDs for the parent(s) are removed. However, combined with > the issue in SPARK-25302, they result in the Input Stream RDDs being > persisted a lot longer than they are actually required. > See also related bug SPARK-25302. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25311) `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking
[ https://issues.apache.org/jira/browse/SPARK-25311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochen Ouyang updated SPARK-25311: Description: IPV4 can pass the follwing check {code:java} def checkHost(host: String, message: String = "") { assert(host.indexOf(':') == -1, message) } {code} But IPV6 check failed. was: IPV4 can pass the flowing check {code:java} def checkHost(host: String, message: String = "") { assert(host.indexOf(':') == -1, message) } {code} But IPV6 check failed. > `SPARK_LOCAL_HOSTNAME` unsupport IPV6 when do host checking > --- > > Key: SPARK-25311 > URL: https://issues.apache.org/jira/browse/SPARK-25311 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2 >Reporter: Xiaochen Ouyang >Priority: Major > > IPV4 can pass the follwing check > {code:java} > def checkHost(host: String, message: String = "") { > assert(host.indexOf(':') == -1, message) > } > {code} > But IPV6 check failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org