[jira] [Commented] (SPARK-20408) Get glob path in parallel to reduce resolve relation time
[ https://issues.apache.org/jira/browse/SPARK-20408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520976#comment-16520976 ] Apache Spark commented on SPARK-20408: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/21618 > Get glob path in parallel to reduce resolve relation time > - > > Key: SPARK-20408 > URL: https://issues.apache.org/jira/browse/SPARK-20408 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Priority: Major > > The datasource read from a path with wildcard like below will cause a long > time waiting(each star may represent 100~1000 file or dir), especially in a > cross region env, driver and hdfs in different region, the drawback will > enlarge. > bq. spark.read.text("/log/product/201704/\*/\*/\*/\*") > Optimize strategy is same with bulkListLeafFiles in InMemoryFileIndex, get > the wildcard path in parallel. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2
[ https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520967#comment-16520967 ] Hyukjin Kwon commented on SPARK-23710: -- Yup, agree with that should better be done first and this ticket should be targeted to the upper major version. Was trying to check what's affected here and if it's possible to make the safer fix so that probably we can target to 3.0.0. > Upgrade Hive to 2.3.2 > - > > Key: SPARK-23710 > URL: https://issues.apache.org/jira/browse/SPARK-23710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Critical > > h1. Mainly changes > * Maven dependency: > hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change > {{hive.classifier}} to {{core}} > calcite.version from {{1.2.0-incubating}} to {{1.10.0}} > datanucleus-core.version from {{3.2.10}} to {{4.1.17}} > remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: > ORC-174 > add new dependency {{avatica}} and {{hive.storage.api}} > * ORC compatibility changes: > OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, > OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala > * hive-thriftserver java file update: > update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2 > update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to > hive 2.3.2 > * TestSuite should update: > ||TestSuite||Reason|| > |StatisticsSuite|HIVE-16098| > |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]| > |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, > SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to > hive-hcatalog-core-2.3.2.jar| > |SparkExecuteStatementOperationSuite|Interface changed from > org.apache.hive.service.cli.Type.NULL_TYPE to > org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE| > |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo > change to com.esotericsoftware.kryo.Kryo| > |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") > to Seq("1.100\t1", "2.100\t2")| > |HiveOrcFilterSuite|Result format changed| > |HiveDDLSuite|Remove $ (This change needs to be reconsidered)| > |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: > org.datanucleus.identity.DatastoreIdImpl cannot be cast to > org.datanucleus.identity.OID| > * Other changes: > Close hive schema verification: > [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251] > and > [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58] > Update > [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192] > Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't > connect to Hive 1.x metastore, We should use > {{HiveMetaStoreClient.getDelegationToken}} instead of > {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}} > All changes can be found at > [PR-20659|https://github.com/apache/spark/pull/20659]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24634) Add a new metric regarding number of rows later than watermark
[ https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24634: Assignee: Apache Spark > Add a new metric regarding number of rows later than watermark > -- > > Key: SPARK-24634 > URL: https://issues.apache.org/jira/browse/SPARK-24634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Spark filters out late rows which are later than watermark while applying > operations which leverage window. While Spark exposes information regarding > watermark to StreamingQueryListener, there's no information regarding rows > being filtered out due to watermark. The information should help end users to > adjust watermark while operating their query. > We could expose metric regarding number of rows later than watermark and > being filtered out. It would be ideal to support side-output to consume late > rows, but it doesn't look like easy so addressing this first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24634) Add a new metric regarding number of rows later than watermark
[ https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520950#comment-16520950 ] Apache Spark commented on SPARK-24634: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/21617 > Add a new metric regarding number of rows later than watermark > -- > > Key: SPARK-24634 > URL: https://issues.apache.org/jira/browse/SPARK-24634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark filters out late rows which are later than watermark while applying > operations which leverage window. While Spark exposes information regarding > watermark to StreamingQueryListener, there's no information regarding rows > being filtered out due to watermark. The information should help end users to > adjust watermark while operating their query. > We could expose metric regarding number of rows later than watermark and > being filtered out. It would be ideal to support side-output to consume late > rows, but it doesn't look like easy so addressing this first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24634) Add a new metric regarding number of rows later than watermark
[ https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24634: Assignee: (was: Apache Spark) > Add a new metric regarding number of rows later than watermark > -- > > Key: SPARK-24634 > URL: https://issues.apache.org/jira/browse/SPARK-24634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark filters out late rows which are later than watermark while applying > operations which leverage window. While Spark exposes information regarding > watermark to StreamingQueryListener, there's no information regarding rows > being filtered out due to watermark. The information should help end users to > adjust watermark while operating their query. > We could expose metric regarding number of rows later than watermark and > being filtered out. It would be ideal to support side-output to consume late > rows, but it doesn't look like easy so addressing this first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24634) Add a new metric regarding number of rows later than watermark
[ https://issues.apache.org/jira/browse/SPARK-24634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520948#comment-16520948 ] Jungtaek Lim commented on SPARK-24634: -- Working on this. Will submit a patch soon. > Add a new metric regarding number of rows later than watermark > -- > > Key: SPARK-24634 > URL: https://issues.apache.org/jira/browse/SPARK-24634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark filters out late rows which are later than watermark while applying > operations which leverage window. While Spark exposes information regarding > watermark to StreamingQueryListener, there's no information regarding rows > being filtered out due to watermark. The information should help end users to > adjust watermark while operating their query. > We could expose metric regarding number of rows later than watermark and > being filtered out. It would be ideal to support side-output to consume late > rows, but it doesn't look like easy so addressing this first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24634) Add a new metric regarding number of rows later than watermark
Jungtaek Lim created SPARK-24634: Summary: Add a new metric regarding number of rows later than watermark Key: SPARK-24634 URL: https://issues.apache.org/jira/browse/SPARK-24634 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jungtaek Lim Spark filters out late rows which are later than watermark while applying operations which leverage window. While Spark exposes information regarding watermark to StreamingQueryListener, there's no information regarding rows being filtered out due to watermark. The information should help end users to adjust watermark while operating their query. We could expose metric regarding number of rows later than watermark and being filtered out. It would be ideal to support side-output to consume late rows, but it doesn't look like easy so addressing this first. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24633) arrays_zip function's code generator splits input processing incorrectly
Bruce Robbins created SPARK-24633: - Summary: arrays_zip function's code generator splits input processing incorrectly Key: SPARK-24633 URL: https://issues.apache.org/jira/browse/SPARK-24633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Environment: Mac OS High Sierra Reporter: Bruce Robbins This works: {noformat} scala> val df = spark.read.parquet("many_arrays_per_row") df: org.apache.spark.sql.DataFrame = [k0: array, k1: array ... 98 more fields] scala> df.selectExpr("arrays_zip(k0, k1, k2)").show(truncate=false) ++ |arrays_zip(k0, k1, k2) | ++ |[[6583, 1312, 7460], [668, 1626, 4129]] | |[[5415, 5251, 1514], [1631, 2224, 2553]]| ++ {noformat} If I add one more array to the parameter list, I get this: {noformat} scala> df.selectExpr("arrays_zip(k0, k1, k2, k3)").show(truncate=false) 18/06/22 18:06:41 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 92, Column 35: Unknown variable or type "scan_row_0" org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 92, Column 35: Unknown variable or type "scan_row_0" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6521) at org.codehaus.janino.UnitCompiler.access$13100(UnitCompiler.java:212) at org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6133) .. much exception trace... 18/06/22 18:06:41 WARN WholeStageCodegenExec: Whole-stage codegen disabled for plan (id=1): *(1) LocalLimit 21 +- *(1) Project [cast(arrays_zip(k0#375, k1#376, k2#387, k3#398) as string) AS arrays_zip(k0, k1, k2, k3)#619] +- *(1) FileScan parquet [k0#375,k1#376,k2#387,k3#398] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/many_arrays_per_row], PartitionFilters: [], PushedFilters: [], ReadSchema: struct,k1:array,k2:array,k3:array> ++ |arrays_zip(k0, k1, k2, k3) | ++ |[[6583, 1312, 7460, 3712], [668, 1626, 4129, 2815]] | |[[5415, 5251, 1514, 1580], [1631, 2224, 2553, 7555]]| ++ {noformat} I still got the answer! I add a 5th parameter: {noformat} scala> df.selectExpr("arrays_zip(k0, k1, k2, k3, k4)").show(truncate=false) 18/06/22 18:07:53 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 97, Column 35: Unknown variable or type "scan_row_0" org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 97, Column 35: Unknown variable or type "scan_row_0" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6521) at org.codehaus.janino.UnitCompiler.access$13100(UnitCompiler.java:212) at org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6133) at org.codehaus.janino.UnitCompiler$18.visitPackage(UnitCompiler.java:6130) at org.codehaus.janino.Java$Package.accept(Java.java:4077) .. much exception trace... Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 21: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 21: Unknown variable or type "i" at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scal\ a:1361) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1423) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1420) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) ... 31 more scala> {noformat} This time, no result. It looks like the generated code is expecting the input row to be in a parameter (either i or scan_row_x), but that parameter is not passed to the input handler function (see lines 069 and 073) {noformat} /* 069 */ private int getValuesAndCardinalities_0_1(ArrayData[] arrVals_0, int biggestCardinality_0) { /* 070 */ /* 071 */ /* 072 */ if (biggestCardinality_0 != -1) { /* 073 */ boolean isNull_6 = i.isNullAt(4); /* 074 */ ArrayData value_6 = isNull_6 ? /* 075 */ null : (i.getArray(4)); /* 076 */ if (!isNull_6) { /* 077
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520879#comment-16520879 ] Sivakumar commented on SPARK-24631: --- Its resolved, Actually I was trying to query a view. Recreate d the view and it worked. > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23704) PySpark access of individual trees in random forest is slow
[ https://issues.apache.org/jira/browse/SPARK-23704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520832#comment-16520832 ] Seth Hendrickson commented on SPARK-23704: -- Instead of {code:java} model.trees[0].transform(test_feat).select('rowNum','probability'){code} Can you try {code:java} trees = model.trees trees[0].transform(test_feat).select('rowNum','probability'){code} And time only the second line? The first line actually calls into the JVM and creates new trees in Python. > PySpark access of individual trees in random forest is slow > --- > > Key: SPARK-23704 > URL: https://issues.apache.org/jira/browse/SPARK-23704 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.1 > Environment: PySpark 2.2.1 / Windows 10 >Reporter: Julian King >Priority: Minor > > Making predictions from a randomForestClassifier PySpark is much faster than > making predictions from an individual tree contained within the .trees > attribute. > In fact, the model.transform call without an action is more than 10x slower > for an individual tree vs the model.transform call for the random forest > model. > See > [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark] > for example with timing. > Ideally: > * Getting a prediction from a single tree should be comparable to or faster > than getting predictions from the whole tree > * Getting all the predictions from all the individual trees should be > comparable in speed to getting the predictions from the random forest > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24532) HiveExternalCatalogVersionSuite should be resilient to missing versions
[ https://issues.apache.org/jira/browse/SPARK-24532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24532. Resolution: Won't Fix I've added documentation on the release docs about this test, so RMs know it has to be checked. I played with code for this (which I'll leave at https://github.com/vanzin/spark/tree/SPARK-24532), but it seems like too much code, and a little too brittle for my taste, to avoid a little work when a new release goes out. > HiveExternalCatalogVersionSuite should be resilient to missing versions > --- > > Key: SPARK-24532 > URL: https://issues.apache.org/jira/browse/SPARK-24532 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.1, 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > See SPARK-24531. > As part of our release process we clean up older releases from the mirror > network. That causes this test to start failing. > The test should be more resilient to this; either ignore releases that are > not found, or fallback to the ASF archive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520785#comment-16520785 ] vaquar khan commented on SPARK-24631: - You just need to cast column , issue will be resolved {{}} > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-22897: --- Fix Version/s: 2.1.3 > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.1.3, 2.2.2, 2.3.0 > > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24589) OutputCommitCoordinator may allow duplicate commits
[ https://issues.apache.org/jira/browse/SPARK-24589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24589: --- Fix Version/s: 2.1.3 > OutputCommitCoordinator may allow duplicate commits > --- > > Key: SPARK-24589 > URL: https://issues.apache.org/jira/browse/SPARK-24589 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.3.1 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 2.1.3, 2.2.2, 2.3.2, 2.4.0 > > > This is a sibling bug to SPARK-24552. While investigating the source of that > bug, it was found that currently the output committer allows duplicate > commits when there are stage retries, and the task with the task attempt > number (one in each stage that currently has running tasks) try to commit > their output. > This can lead to duplicate data in the output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried
[ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520752#comment-16520752 ] Apache Spark commented on SPARK-24552: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21616 > Task attempt numbers are reused when stages are retried > --- > > Key: SPARK-24552 > URL: https://issues.apache.org/jira/browse/SPARK-24552 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1 >Reporter: Ryan Blue >Priority: Blocker > > When stages are retried due to shuffle failures, task attempt numbers are > reused. This causes a correctness bug in the v2 data sources write path. > Data sources (both the original and v2) pass the task attempt to writers so > that writers can use the attempt number to track and clean up data from > failed or speculative attempts. In the v2 docs for DataWriterFactory, the > attempt number's javadoc states that "Implementations can use this attempt > number to distinguish writers of different task attempts." > When two attempts of a stage use the same (partition, attempt) pair, two > tasks can create the same data and attempt to commit. The commit coordinator > prevents both from committing and will abort the attempt that finishes last. > When using the (partition, attempt) pair to track data, the aborted task may > delete data associated with the (partition, attempt) pair. If that happens, > the data for the task that committed is also deleted as well, which is a > correctness bug. > For a concrete example, I have a data source that creates files in place > named with {{part---.}}. Because these > files are written in place, both tasks create the same file and the one that > is aborted deletes the file, leading to data corruption when the file is > added to the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried
[ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520742#comment-16520742 ] Apache Spark commented on SPARK-24552: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21615 > Task attempt numbers are reused when stages are retried > --- > > Key: SPARK-24552 > URL: https://issues.apache.org/jira/browse/SPARK-24552 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1 >Reporter: Ryan Blue >Priority: Blocker > > When stages are retried due to shuffle failures, task attempt numbers are > reused. This causes a correctness bug in the v2 data sources write path. > Data sources (both the original and v2) pass the task attempt to writers so > that writers can use the attempt number to track and clean up data from > failed or speculative attempts. In the v2 docs for DataWriterFactory, the > attempt number's javadoc states that "Implementations can use this attempt > number to distinguish writers of different task attempts." > When two attempts of a stage use the same (partition, attempt) pair, two > tasks can create the same data and attempt to commit. The commit coordinator > prevents both from committing and will abort the attempt that finishes last. > When using the (partition, attempt) pair to track data, the aborted task may > delete data associated with the (partition, attempt) pair. If that happens, > the data for the task that committed is also deleted as well, which is a > correctness bug. > For a concrete example, I have a data source that creates files in place > named with {{part---.}}. Because these > files are written in place, both tasks create the same file and the one that > is aborted deletes the file, leading to data corruption when the file is > added to the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520715#comment-16520715 ] nirav patel commented on SPARK-14922: - Hi Any updates on this? Is there any workaround meanwhile? Is it possible to append empty dataframe in those partitions? > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.2, 2.2.1 >Reporter: Xiao Li >Priority: Major > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24632: - Assignee: Joseph K. Bradley > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > Some fixes we'll need include: > * an overridable method for converting between Python and Java classpaths. > See > https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284 > * > https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378 > One unusual thing for this task will be to write unit tests which test a > custom PipelineStage written outside of the pyspark package. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24632: -- Description: This is a follow-up for [SPARK-17025], which allowed users to implement Python PipelineStages in 3rd-party libraries, include them in Pipelines, and use Pipeline persistence. This task is to make it easier for 3rd-party libraries to have PipelineStages written in Java and then to use pyspark.ml abstractions to create wrappers around those Java classes. This is currently possible, except that users hit bugs around persistence. Some fixes we'll need include: * an overridable method for converting between Python and Java classpaths. See https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284 * https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378 One unusual thing for this task will be to write unit tests which test a custom PipelineStage written outside of the pyspark package. was: This is a follow-up for [SPARK-17025], which allowed users to implement Python PipelineStages in 3rd-party libraries, include them in Pipelines, and use Pipeline persistence. This task is to make it easier for 3rd-party libraries to have PipelineStages written in Java and then to use pyspark.ml abstractions to create wrappers around those Java classes. This is currently possible, except that users hit bugs around persistence. One fix we'll need is an overridable method for converting between Python and Java classpaths. See https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284 One unusual thing for this task will be to write unit tests which test a custom PipelineStage written outside of the pyspark package. > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > Some fixes we'll need include: > * an overridable method for converting between Python and Java classpaths. > See > https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284 > * > https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378 > One unusual thing for this task will be to write unit tests which test a > custom PipelineStage written outside of the pyspark package. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21926. --- Resolution: Fixed Fix Version/s: 2.3.0 Marking fix version as 2.3.0 since the issues were fixed in 2.3, even though tests were not completed in time for 2.3. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.3.0 > > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here SPARK-22346. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > -b) Allow user to set the cardinality of OneHotEncoder.- -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24465: -- Description: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming. This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. was: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs, but LSH . This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming. > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520677#comment-16520677 ] Joseph K. Bradley edited comment on SPARK-24465 at 6/22/18 6:39 PM: Oh actually I think I made this by mistake or forgot to update it? I fixed this in [SPARK-22883] was (Author: josephkb): Oh actually I think I made this by mistake? I fixed this in [SPARK-22883] > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs, but LSH . > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520677#comment-16520677 ] Joseph K. Bradley commented on SPARK-24465: --- Oh actually I think I made this by mistake? I fixed this in [SPARK-22883] > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs, but LSH . > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24465. --- Resolution: Fixed Assignee: Joseph K. Bradley Fix Version/s: 2.3.1 2.4.0 > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs, but LSH . > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24465: -- Description: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs, but LSH . This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. was: Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with Structured Streaming (and I believe are the final Transformers which are not compatible). These do not work because Spark SQL does not support nested types containing UDTs; see [SPARK-12878]. This task is to add unit tests for streaming (as in [SPARK-22644]) for LSHModels after [SPARK-12878] has been fixed. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs, but LSH . > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520671#comment-16520671 ] Joseph K. Bradley commented on SPARK-24465: --- You're right; I did not read [SPARK-12878] carefully enough. I'll update this JIRA's description to be more specific. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs; see > [SPARK-12878]. > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12878) Dataframe fails with nested User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12878: -- Description: Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. In version 1.5.2 the code below worked just fine: {code} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[AUDT]) case class A(list:Seq[B]) class AUDT extends UserDefinedType[A] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(BUDT, containsNull = false), nullable = true))) override def userClass: Class[A] = classOf[A] override def serialize(obj: Any): Any = obj match { case A(list) => val row = new GenericMutableRow(1) row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) row } override def deserialize(datum: Any): A = { datum match { case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) } } } object AUDT extends AUDT @SQLUserDefinedType(udt = classOf[BUDT]) case class B(text:Int) class BUDT extends UserDefinedType[B] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[B] = classOf[B] override def serialize(obj: Any): Any = obj match { case B(text) => val row = new GenericMutableRow(1) row.setInt(0, text) row } override def deserialize(datum: Any): B = { datum match { case row: InternalRow => new B(row.getInt(0)) } } } object BUDT extends BUDT object Test { def main(args:Array[String]) = { val col = Seq(new A(Seq(new B(1), new B(2))), new A(Seq(new B(3), new B(4 val sc = new SparkContext(new SparkConf().setMaster("local[1]").setAppName("TestSpark")) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val df = sc.parallelize(1 to 2 zip col).toDF("id","b") df.select("b").show() df.collect().foreach(println) } } {code} In the new version (1.6.0) I needed to include the following import: `import org.apache.spark.sql.catalyst.expressions.GenericMutableRow` However, Spark crashes in runtime: {code} 16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51) at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java
[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520666#comment-16520666 ] Joseph K. Bradley commented on SPARK-19498: --- Sure, comments are welcome! Or links to JIRAs, whichever are easier. > Discussion: Making MLlib APIs extensible for 3rd party libraries > > > Key: SPARK-19498 > URL: https://issues.apache.org/jira/browse/SPARK-19498 > Project: Spark > Issue Type: Brainstorming > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Per the recent discussion on the dev list, this JIRA is for discussing how we > can make MLlib DataFrame-based APIs more extensible, especially for the > purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs > (for custom Transformers, Estimators, etc.). > * For people who have written such libraries, what issues have you run into? > * What APIs are not public or extensible enough? Do they require changes > before being made more public? > * Are APIs for non-Scala languages such as Java and Python friendly or > extensive enough? > The easy answer is to make everything public, but that would be terrible of > course in the long-term. Let's discuss what is needed and how we can present > stable, sufficient, and easy-to-use APIs for 3rd-party developers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520657#comment-16520657 ] Jayesh lalwani commented on SPARK-22666: I'll try to take this on > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17025: -- Comment: was deleted (was: Thank you for your e-mail. I am on businees travel until Monday 2nd July so expect delays in my response. Many thanks, Pete Dr Peter Knight Sr Staff Analytics Engineer| UK Data Science GE Aviation AutoExtReply ) > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17025. --- Resolution: Fixed Fix Version/s: 2.3.0 Fixed by linked JIRAs > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-17025: - Assignee: Ajay Saini > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Assignee: Ajay Saini >Priority: Minor > Fix For: 2.3.0 > > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
Joseph K. Bradley created SPARK-24632: - Summary: Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence Key: SPARK-24632 URL: https://issues.apache.org/jira/browse/SPARK-24632 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Joseph K. Bradley This is a follow-up for [SPARK-17025], which allowed users to implement Python PipelineStages in 3rd-party libraries, include them in Pipelines, and use Pipeline persistence. This task is to make it easier for 3rd-party libraries to have PipelineStages written in Java and then to use pyspark.ml abstractions to create wrappers around those Java classes. This is currently possible, except that users hit bugs around persistence. One fix we'll need is an overridable method for converting between Python and Java classpaths. See https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284 One unusual thing for this task will be to write unit tests which test a custom PipelineStage written outside of the pyspark package. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520648#comment-16520648 ] Peter Knight commented on SPARK-17025: -- Thank you for your e-mail. I am on businees travel until Monday 2nd July so expect delays in my response. Many thanks, Pete Dr Peter Knight Sr Staff Analytics Engineer| UK Data Science GE Aviation AutoExtReply > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520646#comment-16520646 ] Joseph K. Bradley commented on SPARK-17025: --- We've tested it with Python-only implementations, and it works. You end up hitting problems if you use the (sort of private) Python abstractions for Java wrappers like JavaMLWritable. I'll create a follow-up issue for that and close this one as fixed. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)
[ https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520635#comment-16520635 ] Joseph K. Bradley commented on SPARK-4591: -- There are still a few contained tasks which are incomplete. I'd like to leave this open for now. > Algorithm/model parity for spark.ml (Scala) > --- > > Key: SPARK-4591 > URL: https://issues.apache.org/jira/browse/SPARK-4591 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > This is an umbrella JIRA for porting spark.mllib implementations to use the > DataFrame-based API defined under spark.ml. We want to achieve critical > feature parity for the next release. > h3. Instructions for 3 subtask types > *Review tasks*: detailed review of a subpackage to identify feature gaps > between spark.mllib and spark.ml. > * Should be listed as a subtask of this umbrella. > * Review subtasks cover major algorithm groups. To pick up a review subtask, > please: > ** Comment that you are working on it. > ** Compare the public APIs of spark.ml vs. spark.mllib. > ** Comment on all missing items within spark.ml: algorithms, models, methods, > features, etc. > ** Check for existing JIRAs covering those items. If there is no existing > JIRA, create one, and link it to your comment. > *Critical tasks*: higher priority missing features which are required for > this umbrella JIRA. > * Should be linked as "requires" links. > *Other tasks*: lower priority missing features which can be completed after > the critical tasks. > * Should be linked as "contains" links. > h4. Excluded items > This does *not* include: > * Python: We can compare Scala vs. Python in spark.ml itself. > * Moving linalg to spark.ml: [SPARK-13944] > * Streaming ML: Requires stabilizing some internal APIs of structured > streaming first > h3. TODO list > *Critical issues* > * [SPARK-14501]: Frequent Pattern Mining > * [SPARK-14709]: linear SVM > * [SPARK-15784]: Power Iteration Clustering (PIC) > *Lower priority issues* > * Missing methods within algorithms (see Issue Links below) > * evaluation submodule > * stat submodule (should probably be covered in DataFrames) > * Developer-facing submodules: > ** optimization (including [SPARK-17136]) > ** random, rdd > ** util > *To be prioritized* > * single-instance prediction: [SPARK-10413] > * pmml [SPARK-11171] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11107) spark.ml should support more input column types: umbrella
[ https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520634#comment-16520634 ] Joseph K. Bradley commented on SPARK-11107: --- There are still lots of Transformers and Estimators which should support more types, but I"m OK closing this since I don't have time to work on it. > spark.ml should support more input column types: umbrella > - > > Key: SPARK-11107 > URL: https://issues.apache.org/jira/browse/SPARK-11107 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This is an umbrella for expanding the set of data types which spark.ml > Pipeline stages can take. This should not involve breaking APIs, but merely > involve slight changes such as supporting all Numeric types instead of just > Double. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11107) spark.ml should support more input column types: umbrella
[ https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11107. --- Resolution: Done > spark.ml should support more input column types: umbrella > - > > Key: SPARK-11107 > URL: https://issues.apache.org/jira/browse/SPARK-11107 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This is an umbrella for expanding the set of data types which spark.ml > Pipeline stages can take. This should not involve breaking APIs, but merely > involve slight changes such as supporting all Numeric types instead of just > Double. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24372) Create script for preparing RCs
[ https://issues.apache.org/jira/browse/SPARK-24372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-24372. --- Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.4.0 > Create script for preparing RCs > --- > > Key: SPARK-24372 > URL: https://issues.apache.org/jira/browse/SPARK-24372 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0 > > > Currently, when preparing RCs, the RM has to invoke many scripts manually, > make sure that is being done in the correct environment, and set all the > correct environment variables, which differ from one script to the other. > It will be much easier for RMs if all that was automated as much as possible. > I'm working on something like this as part of releasing 2.3.1, and plan to > send my scripts for review after the release is done (i.e. after I make sure > the scripts are working properly). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24518) Using Hadoop credential provider API to store password
[ https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24518. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21548 [https://github.com/apache/spark/pull/21548] > Using Hadoop credential provider API to store password > -- > > Key: SPARK-24518 > URL: https://issues.apache.org/jira/browse/SPARK-24518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.4.0 > > > Current Spark configs password in a plaintext way, like putting in the > configuration file or adding as a launch arguments, sometimes such > configurations like SSL password is configured by cluster admin, which should > not be seen by user, but now this passwords are world readable to all the > users. > Hadoop credential provider API support storing password in a secure way, in > which Spark could read it in a secure way, so here propose to add support of > using credential provider API to get password. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24518) Using Hadoop credential provider API to store password
[ https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24518: -- Assignee: Saisai Shao > Using Hadoop credential provider API to store password > -- > > Key: SPARK-24518 > URL: https://issues.apache.org/jira/browse/SPARK-24518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.4.0 > > > Current Spark configs password in a plaintext way, like putting in the > configuration file or adding as a launch arguments, sometimes such > configurations like SSL password is configured by cluster admin, which should > not be seen by user, but now this passwords are world readable to all the > users. > Hadoop credential provider API support storing password in a secure way, in > which Spark could read it in a secure way, so here propose to add support of > using credential provider API to get password. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520587#comment-16520587 ] Dilip Biswal commented on SPARK-24130: -- [~Shurap1] We are currently waiting for feedback from the community on how to proceed. I think we need a V2 implementation of JDBC datasource before we can proceed on the pushdown. > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > Attachments: Data Source V2 Join Push Down.pdf > > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-24130: - Comment: was deleted (was: [~Shurap1] We are currently waiting for feedback from the community on how to proceed. I think we need a V2 implementation of JDBC datasource before we can proceed on the pushdown.) > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > Attachments: Data Source V2 Join Push Down.pdf > > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520581#comment-16520581 ] Dilip Biswal commented on SPARK-24130: -- [~Shurap1] We are currently waiting for feedback from the community on how to proceed. I think we need a V2 implementation of JDBC datasource before we can proceed on the pushdown. > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > Attachments: Data Source V2 Join Push Down.pdf > > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24611) Clean up OutputCommitCoordinator
[ https://issues.apache.org/jira/browse/SPARK-24611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520578#comment-16520578 ] Marcelo Vanzin commented on SPARK-24611: One more: adjust the test so that it ensures that state is kept if multiple {{stageStart}} calls are made. > Clean up OutputCommitCoordinator > > > Key: SPARK-24611 > URL: https://issues.apache.org/jira/browse/SPARK-24611 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-24589, to address some issues brought up during > review of the change: > - the DAGScheduler registers all stages with the coordinator, when at first > view only result stages need to. That would save memory in the driver. > - the coordinator can track task IDs instead of the internal "TaskIdentifier" > type it uses; that would also save some memory, and also be more accurate. > - {{TaskCommitDenied}} currently has a "job ID" when it's really a stage ID, > and it contains the task attempt number, when it should probably have the > task ID instead (like above). > The latter is an API breakage (in a class tagged as developer API, but > still), and also affects data written to event logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520576#comment-16520576 ] vaquar khan edited comment on SPARK-24631 at 6/22/18 4:43 PM: -- Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all possible detail will help to reproduce issue. What type of connection are you using JDBC , database version , spark version ? was (Author: vaquar.k...@gmail.com): Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all possible detail will help to reproduce issue > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520576#comment-16520576 ] vaquar khan commented on SPARK-24631: - Database type (MYSQL,Hbase etc ), column descriptions (int,bigdecimal etc ) all possible detail will help to reproduce issue > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Li updated SPARK-24130: --- Attachment: Data Source V2 Join Push Down.pdf > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > Attachments: Data Source V2 Join Push Down.pdf > > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520551#comment-16520551 ] Sivakumar commented on SPARK-24631: --- Updated with some additional data. I have tried the same in spark2-shell as well, But it is throwing the same error for that particular table. > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sivakumar updated SPARK-24631: -- Description: Getting the below error when executing the simple select query, Sample: Table Description: name: String, id: BigInt val df=spark.sql("select name,id from testtable") ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as it may truncate.{color} I am not doing any transformation's, I am just trying to query a table ,But still I am getting the error. I am getting this error only on production cluster and only for a single table, other tables are running fine. + more data, val df=spark.sql("select* from table_name") I am just trying this query a table. But with other tables it is running fine. {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from bigint to column_name#2525: smallint as it may truncate.{color} that specific column is having Bigint datatype, But there were other table's that ran fine with Bigint columns. was: Getting the below error when executing the simple select query, Sample: Table Description: name: String, id: BigInt val df=spark.sql("select name,id from testtable") ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as it may truncate.{color} I am not doing any transformation's, I am just trying to query a table ,But still I am getting the error. I am getting this error only on production cluster and only for a single table, other tables are running fine. > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#ff}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > + more data, > val df=spark.sql("select* from table_name") > I am just trying this query a table. But with other tables it is running fine. > {color:#d04437}18/06/22 01:36:29 ERROR Driver1: [] [main] Exception occurred: > org.apache.spark.sql.AnalysisException: Cannot up cast `column_name` from > bigint to column_name#2525: smallint as it may truncate.{color} > that specific column is having Bigint datatype, But there were other table's > that ran fine with Bigint columns. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520533#comment-16520533 ] vaquar khan edited comment on SPARK-24631 at 6/22/18 3:57 PM: -- Can you add complete error logs and if possible smalll code example with test data ( data you can find in your prod table where it's failing ) was (Author: vaquar.k...@gmail.com): Can you add complete error logs and if possible smalll code example with test data ( data you can find in your prod table where it's failing > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
[ https://issues.apache.org/jira/browse/SPARK-24631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520533#comment-16520533 ] vaquar khan commented on SPARK-24631: - Can you add complete error logs and if possible smalll code example with test data ( data you can find in your prod table where it's failing > Cannot up cast column from bigint to smallint as it may truncate > > > Key: SPARK-24631 > URL: https://issues.apache.org/jira/browse/SPARK-24631 > Project: Spark > Issue Type: New JIRA Project > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Sivakumar >Priority: Major > > Getting the below error when executing the simple select query, > Sample: > Table Description: > name: String, id: BigInt > val df=spark.sql("select name,id from testtable") > ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as > it may truncate.{color} > I am not doing any transformation's, I am just trying to query a table ,But > still I am getting the error. > I am getting this error only on production cluster and only for a single > table, other tables are running fine. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2
[ https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520522#comment-16520522 ] Marcelo Vanzin commented on SPARK-23710: There are a few places in Spark that are affected by a Hive upgrade: - Hive serde support - Hive UD(*)F support - The thrift server The first two are for supporting Hive's API in Spark so people can keep using their serdes and udfs. The risk here is that we're crossing a Hive major version boundary, and things in the API may have been broken, and that would transitively affect Spark's API. In the real world that's already sort of a risk, though, because people might be running Hive 2 and thus have Hive 2 serdes in their tables, and Spark trying to read or write data to that table with an old version of the same serde could cause issues. I think switching to the Hive mainline is a good medium or long term goal, but that probably would require a major Spark version to be more palatable - and perhaps should be coupled with deprecation of some features so that we can isolate ourselves from Hive more. It's a bit risky in a minor version. In the short term my preference would be to either fix the fork, or go with Saisai's patch in HIVE-16391, which requires collaboration from the Hive side... > Upgrade Hive to 2.3.2 > - > > Key: SPARK-23710 > URL: https://issues.apache.org/jira/browse/SPARK-23710 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Critical > > h1. Mainly changes > * Maven dependency: > hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change > {{hive.classifier}} to {{core}} > calcite.version from {{1.2.0-incubating}} to {{1.10.0}} > datanucleus-core.version from {{3.2.10}} to {{4.1.17}} > remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: > ORC-174 > add new dependency {{avatica}} and {{hive.storage.api}} > * ORC compatibility changes: > OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, > OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala > * hive-thriftserver java file update: > update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2 > update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to > hive 2.3.2 > * TestSuite should update: > ||TestSuite||Reason|| > |StatisticsSuite|HIVE-16098| > |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]| > |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, > SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to > hive-hcatalog-core-2.3.2.jar| > |SparkExecuteStatementOperationSuite|Interface changed from > org.apache.hive.service.cli.Type.NULL_TYPE to > org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE| > |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo > change to com.esotericsoftware.kryo.Kryo| > |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") > to Seq("1.100\t1", "2.100\t2")| > |HiveOrcFilterSuite|Result format changed| > |HiveDDLSuite|Remove $ (This change needs to be reconsidered)| > |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: > org.datanucleus.identity.DatastoreIdImpl cannot be cast to > org.datanucleus.identity.OID| > * Other changes: > Close hive schema verification: > [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251] > and > [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58] > Update > [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192] > Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't > connect to Hive 1.x metastore, We should use > {{HiveMetaStoreClient.getDelegationToken}} instead of > {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}} > All changes can be found at > [PR-20659|https://github.com/apache/spark/pull/20659]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24631) Cannot up cast column from bigint to smallint as it may truncate
Sivakumar created SPARK-24631: - Summary: Cannot up cast column from bigint to smallint as it may truncate Key: SPARK-24631 URL: https://issues.apache.org/jira/browse/SPARK-24631 Project: Spark Issue Type: New JIRA Project Components: Spark Core, Spark Submit Affects Versions: 2.2.1 Reporter: Sivakumar Getting the below error when executing the simple select query, Sample: Table Description: name: String, id: BigInt val df=spark.sql("select name,id from testtable") ERROR: {color:#FF}Cannot up cast column "id" from bigint to smallint as it may truncate.{color} I am not doing any transformation's, I am just trying to query a table ,But still I am getting the error. I am getting this error only on production cluster and only for a single table, other tables are running fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520461#comment-16520461 ] vaquar khan edited comment on SPARK-24130 at 6/22/18 3:02 PM: -- Could you please attached doc in Jira insted of google doc was (Author: vaquar.k...@gmail.com): Could you please update doc in Jira insted of google doc > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520461#comment-16520461 ] vaquar khan commented on SPARK-24130: - Could you please update doc in Jira insted of google doc > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24519) MapStatus has 2000 hardcoded
[ https://issues.apache.org/jira/browse/SPARK-24519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-24519. --- Resolution: Fixed Assignee: Hieu Tri Huynh Fix Version/s: 2.4.0 > MapStatus has 2000 hardcoded > > > Key: SPARK-24519 > URL: https://issues.apache.org/jira/browse/SPARK-24519 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Hieu Tri Huynh >Assignee: Hieu Tri Huynh >Priority: Minor > Fix For: 2.4.0 > > > MapStatus uses hardcoded value of 2000 partitions to determine if it should > use highly compressed map status. We should make it configurable to allow > users to more easily tune their jobs with respect to this without having for > them to modify their code to change the number of partitions. Note we can > leave this as an internal/undocumented config for now until we have more > advise for the users on how to set this config. > Some of my reasoning: > The config gives you a way to easily change something without the user having > to change code, redeploy jar, and then run again. You can simply change the > config and rerun. It also allows for easier experimentation. Changing the # > of partitions has other side affects, whether good or bad is situation > dependent. It can be worse are you could be increasing # of output files when > you don't want to be, affects the # of tasks needs and thus executors to run > in parallel, etc. > There have been various talks about this number at spark summits where people > have told customers to increase it to be 2001 partitions. Note if you just do > a search for spark 2000 partitions you will fine various things all talking > about this number. This shows that people are modifying their code to take > this into account so it seems to me having this configurable would be better. > Once we have more advice for users we could expose this and document > information on it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24130) Data Source V2: Join Push Down
[ https://issues.apache.org/jira/browse/SPARK-24130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520386#comment-16520386 ] Parshuram V Patki commented on SPARK-24130: --- [~jliwork] do you think this improvement will make it to spark 2.4.0? > Data Source V2: Join Push Down > -- > > Key: SPARK-24130 > URL: https://issues.apache.org/jira/browse/SPARK-24130 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jia Li >Priority: Major > > Spark applications often directly query external data sources such as > relational databases, or files. Spark provides Data Sources APIs for > accessing structured data through Spark SQL. Data Sources APIs in both V1 and > V2 support optimizations such as Filter push down and Column pruning which > are subset of the functionality that can be pushed down to some data sources. > We’re proposing to extend Data Sources APIs with join push down (JPD). Join > push down significantly improves query performance by reducing the amount of > data transfer and exploiting the capabilities of the data sources such as > index access. > Join push down design document is available > [here|https://docs.google.com/document/d/1k-kRadTcUbxVfUQwqBbIXs_yPZMxh18-e-cz77O_TaE/edit?usp=sharing]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22897) Expose stageAttemptId in TaskContext
[ https://issues.apache.org/jira/browse/SPARK-22897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-22897: -- Fix Version/s: 2.2.2 > Expose stageAttemptId in TaskContext > - > > Key: SPARK-22897 > URL: https://issues.apache.org/jira/browse/SPARK-22897 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.2, 2.2.1 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > Currently, there's no easy way for Executor to detect a new stage is launched > as stageAttemptId is missing. > I'd like to propose exposing stageAttemptId in TaskContext, and will send a > pr if community thinks it's a good thing. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-24630: --- Attachment: SQLStreaming SPIP.pdf > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jackey Lee updated SPARK-24630: --- Summary: SPIP: Support SQLStreaming in Spark (was: Support SQLStreaming in Spark) > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24630) Support SQLStreaming in Spark
Jackey Lee created SPARK-24630: -- Summary: Support SQLStreaming in Spark Key: SPARK-24630 URL: https://issues.apache.org/jira/browse/SPARK-24630 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1, 2.2.0 Reporter: Jackey Lee At present, KafkaSQL, Flink SQL(which is actually based on Calcite), SQLStream, StormSQL all provide a stream type SQL interface, with which users with little knowledge about streaming, can easily develop a flow system processing model. In Spark, we can also support SQL API based on StructStreamig. To support for SQL Streaming, there are two key points: 1, Analysis should be able to parse streaming type SQL. 2, Analyzer should be able to map metadata information to the corresponding Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24629: Assignee: Apache Spark > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Assignee: Apache Spark >Priority: Minor > Attachments: .png, .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. > Another graph is that after fixing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24629: Assignee: (was: Apache Spark) > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Priority: Minor > Attachments: .png, .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. > Another graph is that after fixing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520219#comment-16520219 ] Apache Spark commented on SPARK-24629: -- User 'ChenjunZou' has created a pull request for this issue: https://github.com/apache/spark/pull/21613 > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Priority: Minor > Attachments: .png, .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. > Another graph is that after fixing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StephenZou updated SPARK-24629: --- Description: When Beeline connection closes, spark thrift server (STS) will send a session close event, then the relevant listener in class HiveThriftServer2 will clean its internal session states. But the internal statement state is not updated, (see graph attached). So it remains in state Compiled and doesn't be swept out. Another graph is that after fixing it. was: When Beeline connection closes, spark thrift server (STS) will send a session close event, then the relevant listener in class HiveThriftServer2 will clean its internal session states. But the internal statement state is not updated, (see graph attached). So it remains in state Compiled and doesn't be swept out. > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Priority: Minor > Attachments: .png, .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. > Another graph is that after fixing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StephenZou updated SPARK-24629: --- Attachment: .png > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Priority: Minor > Attachments: .png, .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24629) thrift server memory leak when beeline connection quits
[ https://issues.apache.org/jira/browse/SPARK-24629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StephenZou updated SPARK-24629: --- Attachment: .png > thrift server memory leak when beeline connection quits > --- > > Key: SPARK-24629 > URL: https://issues.apache.org/jira/browse/SPARK-24629 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: StephenZou >Priority: Minor > Attachments: .png > > > When Beeline connection closes, spark thrift server (STS) will send a session > close event, then the relevant listener in class HiveThriftServer2 will clean > its internal session states. > But the internal statement state is not updated, (see graph attached). So it > remains in state Compiled and doesn't be swept out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24629) thrift server memory leak when beeline connection quits
StephenZou created SPARK-24629: -- Summary: thrift server memory leak when beeline connection quits Key: SPARK-24629 URL: https://issues.apache.org/jira/browse/SPARK-24629 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.2.1 Reporter: StephenZou When Beeline connection closes, spark thrift server (STS) will send a session close event, then the relevant listener in class HiveThriftServer2 will clean its internal session states. But the internal statement state is not updated, (see graph attached). So it remains in state Compiled and doesn't be swept out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24458) Invalid PythonUDF check_1(), requires attributes from more than one child
[ https://issues.apache.org/jira/browse/SPARK-24458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520165#comment-16520165 ] Ruben Berenguel commented on SPARK-24458: - [~hyukjin.kwon] I just built 2.3.0 from the tagged branch and it's still passing, but the branch points to 2.3.2. I tried building 2.2 from the branch 2.2, but the build is failing for me locally. What is the best way to roll back locally to older versions of Spark? > Invalid PythonUDF check_1(), requires attributes from more than one child > - > > Key: SPARK-24458 > URL: https://issues.apache.org/jira/browse/SPARK-24458 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Spark 2.3.0 (local mode) > Mac OSX >Reporter: Abdeali Kothari >Priority: Major > > I was trying out a very large query execution plan I have and I got the error: > > {code:java} > py4j.protocol.Py4JJavaError: An error occurred while calling > o359.simpleString. > : java.lang.RuntimeException: Invalid PythonUDF check_1(), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:182) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:181) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:181) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:87) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:87) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:187) > at > org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:187) > at sun.reflect.NativeMethodAccessorIm
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520161#comment-16520161 ] Takeshi Yamamuro commented on SPARK-24498: -- yea, that might be true now. But, I think we need to think more about compiler options and other conditions for getting any benefit. welcome any comment and suggestion! Thanks for you comment! > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
[ https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520148#comment-16520148 ] Yuming Wang commented on SPARK-20295: - [~KevinZwx] Can you try [https://github.com/Intel-bigdata/spark-adaptive]? > when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue > -- > > Key: SPARK-20295 > URL: https://issues.apache.org/jira/browse/SPARK-20295 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 2.1.0 >Reporter: Ruhui Wang >Priority: Major > > when run tpcds-q95, and set spark.sql.adaptive.enabled = true the physical > plan firstly: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- Exchange(coordinator id: 3) > spark.sql.exchange.reuse is opened, then physical plan will become below: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- ReusedExchange Exchange(coordinator id: 2) > If spark.sql.adaptive.enabled = true, the code stack is : > ShuffleExchange#doExecute --> postShuffleRDD function --> > doEstimationIfNecessary . In this function, > assert(exchanges.length == numExchanges) will be error, as left side has only > one element, but right is equal to 2. > If this is a bug of spark.sql.adaptive.enabled and exchange resue? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24628) The example given to create a dense matrix using python has a mistake
[ https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520141#comment-16520141 ] Apache Spark commented on SPARK-24628: -- User 'huangweizhe123' has created a pull request for this issue: https://github.com/apache/spark/pull/21612 > The example given to create a dense matrix using python has a mistake > - > > Key: SPARK-24628 > URL: https://issues.apache.org/jira/browse/SPARK-24628 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Priority: Minor > Original Estimate: 10m > Remaining Estimate: 10m > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24628) The example given to create a dense matrix using python has a mistake
[ https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24628: Assignee: (was: Apache Spark) > The example given to create a dense matrix using python has a mistake > - > > Key: SPARK-24628 > URL: https://issues.apache.org/jira/browse/SPARK-24628 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Priority: Minor > Original Estimate: 10m > Remaining Estimate: 10m > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24628) The example given to create a dense matrix using python has a mistake
[ https://issues.apache.org/jira/browse/SPARK-24628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24628: Assignee: Apache Spark > The example given to create a dense matrix using python has a mistake > - > > Key: SPARK-24628 > URL: https://issues.apache.org/jira/browse/SPARK-24628 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.3.1 >Reporter: Weizhe Huang >Assignee: Apache Spark >Priority: Minor > Original Estimate: 10m > Remaining Estimate: 10m > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24628) The example given to create a dense matrix using python has a mistake
Weizhe Huang created SPARK-24628: Summary: The example given to create a dense matrix using python has a mistake Key: SPARK-24628 URL: https://issues.apache.org/jira/browse/SPARK-24628 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 2.3.1 Reporter: Weizhe Huang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520125#comment-16520125 ] Marco Gaido commented on SPARK-24498: - Thanks for your great analysis [~maropu]! Very interesting. Seems like there is no advantage in introducing a new compiler. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data
[ https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23603. -- Resolution: Duplicate 2.7.x has a regression we had to revert it back. See also https://github.com/apache/spark/pull/9759 > When the length of the json is in a range,get_json_object will result in > missing tail data > -- > > Key: SPARK-23603 > URL: https://issues.apache.org/jira/browse/SPARK-23603 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Major > > Jackson(>=2.7.7) fixes the possibility of missing tail data when the length > of the value is in a range > [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7] > [https://github.com/FasterXML/jackson-core/issues/307] > spark-shell: > {code:java} > val value = "x" * 3000 > val json = s"""{"big": "$value"}""" > spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect > res0: Array[org.apache.spark.sql.Row] = Array([2991]) > {code} > expect result : 3000 > actual result : 2991 > There are two solutions > One is > *Bump jackson from 2.6.7&2.6.7.1 to 2.7.7* > The other one is > *Replace writeRaw(char[] text, int offset, int len) with writeRaw(String > text)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520090#comment-16520090 ] Takeshi Yamamuro commented on SPARK-24498: -- If there are javac options related to performance, bytecode size, compile time, ..., please let me know. I'm currently looking into them now. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23934) High-order function: map_from_entries(array>) → map
[ https://issues.apache.org/jira/browse/SPARK-23934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23934. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21282 https://github.com/apache/spark/pull/21282 > High-order function: map_from_entries(array>) → map > -- > > Key: SPARK-23934 > URL: https://issues.apache.org/jira/browse/SPARK-23934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created from the given array of entries. > {noformat} > SELECT map_from_entries(ARRAY[(1, 'x'), (2, 'y')]); -- {1 -> 'x', 2 -> 'y'} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23934) High-order function: map_from_entries(array>) → map
[ https://issues.apache.org/jira/browse/SPARK-23934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23934: - Assignee: Marek Novotny > High-order function: map_from_entries(array>) → map > -- > > Key: SPARK-23934 > URL: https://issues.apache.org/jira/browse/SPARK-23934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marek Novotny >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created from the given array of entries. > {noformat} > SELECT map_from_entries(ARRAY[(1, 'x'), (2, 'y')]); -- {1 -> 'x', 2 -> 'y'} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline
ABHISHEK KUMAR GUPTA created SPARK-24627: Summary: [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline Key: SPARK-24627 URL: https://issues.apache.org/jira/browse/SPARK-24627 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: OS: SUSE11 Spark Version: 2.3.0 Hadoop: 2.8.3 Reporter: ABHISHEK KUMAR GUPTA Steps: beeline session was active. 1.Launch spark-beeline 2. create table alt_s1 (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ','; 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1; 4. show tables;( Table listed successfully ) 5. select * from alt_s1; Throws HDFS_DELEGATION_TOKEN Exception 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in stage 22.0 (TID 106, blr123110, executor 1): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255) at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272) at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) **Note: Even after kinit spark/hadoop token is not getting renewed.** Now Launch spark sql session ( Select * from alt_s1 ) is successful. 1. Launch spark-sql 2.spark-sql> select * from alt_s1; 2018-06-22 14:24:04 INFO HiveMetaStore:746 - 0: get_table : db=test_one tbl=alt_s1 2018-06-22 14:24:04 INFO audit:371 - ugi=spark/had...@hadoop.com ip=unknown-ip-addr cmd=get_table : db=test_one tbl=alt_s1 2018-06-22 14:24:04 INFO SQLStdHiveAccessController:95 - Created SQLStdHiveAccessController for session context : HiveAuthzSessionContext [sessionString=2cf6aac4-91c6-4c2d-871b-4d7620d91f43, clientType=HIVECLI] 2018-06-22 14:24:04 INFO metastore:291 - Mestastore configuration hive.metastore.filter.hook changed from org.apache.hadoop.hive.metastore.DefaultMetaStoreFilterHookImpl to org.apache.hadoop.hive.ql.security.authorization.plugin.AuthorizationMetaStoreFilterHook 2018-06-22 14:24:04 INFO HiveMetaStore:
[jira] [Comment Edited] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520056#comment-16520056 ] Takeshi Yamamuro edited comment on SPARK-24498 at 6/22/18 6:59 AM: --- Based on my rough patch (https://github.com/apache/spark/compare/master...maropu:JdkCompiler), the results of my investigation (sf=1 performance values, compile time, max method codegen size, and max class codegen size in TPCDS) are here: [https://docs.google.com/spreadsheets/d/1Mgdd9dfFaACXOUHqKfaeKrj09hB3X1j9sKTJlJ6UM6w/edit?usp=sharing] Their performance values are not so different between each other and Jdk compile time is much larger than Janino time in q28-q72. It seems the max method codegen size in Jdk is a little smaller than that in Janino though, the class size in Jdk is larger than that in Janino. was (Author: maropu): The results of my investigation (sf=1 performance values, compile time, max method codegen size, and max class codegen size in TPCDS) are here: [https://docs.google.com/spreadsheets/d/1Mgdd9dfFaACXOUHqKfaeKrj09hB3X1j9sKTJlJ6UM6w/edit?usp=sharing] Their performance values are not so different between each other and Jdk compile time is much larger than Janino time in q28-q72. It seems the max method codegen size in Jdk is a little smaller than that in Janino though, the class size in Jdk is larger than that in Janino. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org