[jira] [Commented] (SPARK-47287) Aggregate in not causes
[ https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831625#comment-17831625 ] Juefei Yan commented on SPARK-47287: I tried the code on 3.4 branch, cannot reproduce this problem > Aggregate in not causes > > > Key: SPARK-47287 > URL: https://issues.apache.org/jira/browse/SPARK-47287 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ted Chester Jenks >Priority: Major > > > The below snippet is confirmed working with Spark 3.2.1 and broken Spark > 3.4.1. i believe this is a bug. > {code:java} >Dataset ds = dummyDataset > .withColumn("flag", > functions.not(functions.coalesce(functions.col("bool1"), > functions.lit(false)).equalTo(true))) > .groupBy("code") > .agg(functions.max(functions.col("flag")).alias("flag")); > ds.show(); {code} > It fails with: > {code:java} > Caused by: java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:208) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) > at > org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98) > at > org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33) > at > org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803) > at > org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-43251) Assign a name to the error class _LEGACY_ERROR_TEMP_2015
[ https://issues.apache.org/jira/browse/SPARK-43251 ] Liang Yan deleted comment on SPARK-43251: --- was (Author: JIRAUSER299156): I will work on this issue. > Assign a name to the error class _LEGACY_ERROR_TEMP_2015 > > > Key: SPARK-43251 > URL: https://issues.apache.org/jira/browse/SPARK-43251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2015* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43251) Assign a name to the error class _LEGACY_ERROR_TEMP_2015
[ https://issues.apache.org/jira/browse/SPARK-43251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720107#comment-17720107 ] Liang Yan commented on SPARK-43251: --- I will work on this issue. > Assign a name to the error class _LEGACY_ERROR_TEMP_2015 > > > Key: SPARK-43251 > URL: https://issues.apache.org/jira/browse/SPARK-43251 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2015* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42844) Assign a name to the error class _LEGACY_ERROR_TEMP_2008
[ https://issues.apache.org/jira/browse/SPARK-42844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709235#comment-17709235 ] Liang Yan commented on SPARK-42844: --- [~maxgekk], I have added PR. > Assign a name to the error class _LEGACY_ERROR_TEMP_2008 > > > Key: SPARK-42844 > URL: https://issues.apache.org/jira/browse/SPARK-42844 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2008* defined in > {*}core/src/main/resources/error/error-classes.json{*}. The name should be > short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Yan resolved SPARK-42711. --- Resolution: Not A Problem The original codes are just a copy of upstream. And the changes not fix actual problem. > build/sbt usage error messages and shellcheck warn/error > > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information has some missing content: > > {code:java} > (base) spark% ./build/sbt -help > Usage: [options] > -h | -help print this message > -v | -verbose this runner is chattier > {code} > And also some shellcheck warn/error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Yan closed SPARK-42711. - > build/sbt usage error messages and shellcheck warn/error > > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information has some missing content: > > {code:java} > (base) spark% ./build/sbt -help > Usage: [options] > -h | -help print this message > -v | -verbose this runner is chattier > {code} > And also some shellcheck warn/error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Yan updated SPARK-42711: -- Description: The build/sbt tool's usage information has some missing content: {code:java} (base) spark% ./build/sbt -help Usage: [options] -h | -help print this message -v | -verbose this runner is chattier {code} And also some shellcheck warn/error. was: The build/sbt tool's usage information about java-home is wrong: # java version (default: java from PATH, currently $(java -version 2>&1 | grep version)) -java-home alternate JAVA_HOME > build/sbt usage error messages and shellcheck warn/error > > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information has some missing content: > > {code:java} > (base) spark% ./build/sbt -help > Usage: [options] > -h | -help print this message > -v | -verbose this runner is chattier > {code} > And also some shellcheck warn/error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Yan updated SPARK-42711: -- Summary: build/sbt usage error messages and shellcheck warn/error (was: build/sbt usage error messages about java-home) > build/sbt usage error messages and shellcheck warn/error > > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information about java-home is wrong: > # java version (default: java from PATH, currently $(java -version 2>&1 | > grep version)) > -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42711) build/sbt usage error messages about java-home
[ https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697745#comment-17697745 ] Liang Yan commented on SPARK-42711: --- I am preparing my PR for this issue. Please assign it to me. > build/sbt usage error messages about java-home > -- > > Key: SPARK-42711 > URL: https://issues.apache.org/jira/browse/SPARK-42711 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.3.2 >Reporter: Liang Yan >Priority: Minor > > The build/sbt tool's usage information about java-home is wrong: > # java version (default: java from PATH, currently $(java -version 2>&1 | > grep version)) > -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42711) build/sbt usage error messages about java-home
Liang Yan created SPARK-42711: - Summary: build/sbt usage error messages about java-home Key: SPARK-42711 URL: https://issues.apache.org/jira/browse/SPARK-42711 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.3.2 Reporter: Liang Yan The build/sbt tool's usage information about java-home is wrong: # java version (default: java from PATH, currently $(java -version 2>&1 | grep version)) -java-home alternate JAVA_HOME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42395) The code logic of the configmap max size validation lacks extra content
Wei Yan created SPARK-42395: --- Summary: The code logic of the configmap max size validation lacks extra content Key: SPARK-42395 URL: https://issues.apache.org/jira/browse/SPARK-42395 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.5.0 Reporter: Wei Yan Fix For: 3.3.1 In each configmap, Spark adds an extra content in a fixed format,this extra content of the configmap is as belows: spark.kubernetes.namespace: default spark.properties: | #Java properties built from Kubernetes config map with name: spark-exec-b47b438630eec12d-conf-map #Wed Feb 08 20:10:19 CST 2023 spark.kubernetes.namespace=default But the max size validation code logic does not consider this part -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42344) The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576
Wei Yan created SPARK-42344: --- Summary: The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576 Key: SPARK-42344 URL: https://issues.apache.org/jira/browse/SPARK-42344 Project: Spark Issue Type: Bug Components: Kubernetes, Spark Submit Affects Versions: 3.3.1 Environment: Kubernetes: 1.22.0 ETCD: 3.5.0 Spark: 3.3.2 Reporter: Wei Yan Fix For: 3.5.0 Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://172.18.123.24:6443/api/v1/namespaces/default/configmaps. Message: ConfigMap "spark-exec-ed9f2c861aa40b48-conf-map" is invalid: []: Too long: must have at most 1048576 bytes. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have at most 1048576 bytes, reason=FieldValueTooLong, additionalProperties={})], group=null, kind=ConfigMap, name=spark-exec-ed9f2c861aa40b48-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=ConfigMap "spark-exec-ed9f2c861aa40b48-conf-map" is invalid: []: Too long: must have at most 1048576 bytes, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}). at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:305) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) at io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:88) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:112) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:222) at org.apache.spark.SparkContext.(SparkContext.scala:595) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2714) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) at org.apache.spark.examples.JavaSparkPi.main(JavaSparkPi.java:37) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large
[ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957623#comment-16957623 ] carlos yan commented on SPARK-24666: I also get this question, and my spark version is 2.1.0. I used 1000w record for train and words size is about 100w. When the numIterations>10, the vectors generated contain *infinity* and *NaN*. > Word2Vec generate infinity vectors when numIterations are large > --- > > Key: SPARK-24666 > URL: https://issues.apache.org/jira/browse/SPARK-24666 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.1 > Environment: 2.0.X, 2.1.X, 2.2.X, 2.3.X >Reporter: ZhongYu >Priority: Critical > > We found that Word2Vec generate large absolute value vectors when > numIterations are large, and if numIterations are large enough (>20), the > vector's value many be *infinity(or -**infinity)***, resulting in useless > vectors. > In normal situations, vectors values are mainly around -1.0~1.0 when > numIterations = 1. > The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X. > There are already issues report this bug: > https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works > seems missing. > Other people's reports: > [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec] > [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
[ https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey(Xilang) Yan updated SPARK-27894: Description: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {code:java} PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. {code} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. was: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. > PySpark streaming transform RDD join not works when checkpoint enabled > -- > > Key: SPARK-27894 > URL: https://issues.apache.org/jira/browse/SPARK-27894 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jeffrey(Xilang) Yan >Priority: Major > > In PySpark Steaming, if checkpoint enabled and there is a transform-join > operation, the error thrown. > {code:java} > sc=SparkContext(appName='') > sc.setLogLevel("WARN") > ssc=StreamingContext(sc,10) > ssc.checkpoint("hdfs:///test") > kafka_bootstrap_servers="" > topics = ['', ''] > doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) > kvds=KafkaUtils.createDirectStream(ssc, topics, > kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) > line=kvds.map(lambda x:(1,2)) > line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) > ssc.start() > ssc.awaitTermination() > {code} > > Error details: > {code:java} > PicklingError: Could not serialize object: Exception: It appears that you are > attempting to broadcast an RDD or reference an RDD from an action or > transformation. RDD transformations and actions can only be invoked by the > driver, not inside of other transformations; for example, rdd1.map(lambda x: > rdd2.values.count() * x) is invalid because the values transformation and > count action cannot be performed inside of the rdd1.map transformation. For > more information, see SPARK-5063. > {code} > The similar code works great in Scala. And if we remove any of > {code:java} > ssc.checkpoint("hdfs:///test") > {code} >
[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
[ https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey(Xilang) Yan updated SPARK-27894: Description: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {code:java} sc=SparkContext(appName='') sc.setLogLevel("WARN") ssc=StreamingContext(sc,10) ssc.checkpoint("hdfs:///test") kafka_bootstrap_servers="" topics = ['', ''] doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) line=kvds.map(lambda x:(1,2)) line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) ssc.start() ssc.awaitTermination() {code} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {code:java} ssc.checkpoint("hdfs:///test") {code} or {code:java} line.transform(lambda rdd:rdd.join(doc_info)) {code} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. was: In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {{sc=SparkContext(appName='') }} {{sc.setLogLevel("WARN") }} {{ssc=StreamingContext(sc,10) }} {{ssc.checkpoint("hdfs:///test") }} {{kafka_bootstrap_servers="" }} {{topics = ['', ''] }} {{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}} {{ kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }} {{line=kvds.map(lambda x:(1,2)) }} {{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }} {{ssc.start() }} {{ssc.awaitTermination() }} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {{ssc.checkpoint("hdfs:///test") }} or {{line.transform(lambda rdd:rdd.join(doc_info)) }} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. > PySpark streaming transform RDD join not works when checkpoint enabled > -- > > Key: SPARK-27894 > URL: https://issues.apache.org/jira/browse/SPARK-27894 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jeffrey(Xilang) Yan >Priority: Major > > In PySpark Steaming, if checkpoint enabled and there is a transform-join > operation, the error thrown. > {code:java} > sc=SparkContext(appName='') > sc.setLogLevel("WARN") > ssc=StreamingContext(sc,10) > ssc.checkpoint("hdfs:///test") > kafka_bootstrap_servers="" > topics = ['', ''] > doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11))) > kvds=KafkaUtils.createDirectStream(ssc, topics, > kafkaParams={"metadata.broker.list": kafka_bootstrap_servers}) > line=kvds.map(lambda x:(1,2)) > line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) > ssc.start() > ssc.awaitTermination() > {code} > > Error details: > {{PicklingError: Could not serialize object: Exception: It appears that you > are attempting to broadcast an RDD or reference an RDD from an action or > transformation. RDD transformations and actions can only be invoked by the > driver, not inside of other transformations; for example, rdd1.map(lambda x: > rdd2.values.count() * x) is invalid because the values transformation and > count action cannot be performed inside of the rdd1.map transformation. For > more information, see SPARK-5063. }} > The similar code works great in Scala. And if we remove any of > {code:java} > ssc.checkpoint("hdfs:///test") > {code} > or >
[jira] [Created] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled
Jeffrey(Xilang) Yan created SPARK-27894: --- Summary: PySpark streaming transform RDD join not works when checkpoint enabled Key: SPARK-27894 URL: https://issues.apache.org/jira/browse/SPARK-27894 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Jeffrey(Xilang) Yan In PySpark Steaming, if checkpoint enabled and there is a transform-join operation, the error thrown. {{sc=SparkContext(appName='') }} {{sc.setLogLevel("WARN") }} {{ssc=StreamingContext(sc,10) }} {{ssc.checkpoint("hdfs:///test") }} {{kafka_bootstrap_servers="" }} {{topics = ['', ''] }} {{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}} {{ kvds=KafkaUtils.createDirectStream(ssc, topics, kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }} {{line=kvds.map(lambda x:(1,2)) }} {{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }} {{ssc.start() }} {{ssc.awaitTermination() }} Error details: {{PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. }} The similar code works great in Scala. And if we remove any of {{ssc.checkpoint("hdfs:///test") }} or {{line.transform(lambda rdd:rdd.join(doc_info)) }} There is no error either. It seems that when checkpoint is enabled, pyspark will serialize transform lambda, and then the RDD used by lambda, while RDD cannot be serialize so the error prompted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833632#comment-16833632 ] Jeffrey(Xilang) Yan commented on SPARK-5594: There is a bug before 2.2.3/2.3.0 If you met "Failed to get broadcast" and the method call stack is from MapOutputTracker, then try to upgrade your spark. The bug is due to driver remove the broadcast but send the broadcast id to executor, method MapOutputTrackerMaster.getSerializedMapOutputStatuses . It has been fixed by https://issues.apache.org/jira/browse/SPARK-23243 > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at >
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490307#comment-16490307 ] Wei Yan commented on SPARK-24374: - Thanks [~mengxr] for the initiative and the doc. cc [~leftnoteasy] [~zhz] , as may need some support from YARN side if running as yarn-cluster. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21576) Spark caching difference between 2.0.2 and 2.1.1
Yan created SPARK-21576: --- Summary: Spark caching difference between 2.0.2 and 2.1.1 Key: SPARK-21576 URL: https://issues.apache.org/jira/browse/SPARK-21576 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: Yan Priority: Minor Hi, i asked a question on stackoverflow and was recommended to open jira I'm a bit confused with spark caching behavior. I want to compute dependent dataset (b), cache it and unpersist source dataset(a) - here is my code: {code:java} val spark = SparkSession.builder().appName("test").master("local[4]").getOrCreate() import spark.implicits._ val a = spark.createDataset(Seq(("a", 1), ("b", 2), ("c", 3))) a.createTempView("a") a.cache println(s"Is a cached: ${spark.catalog.isCached("a")}") val b = a.filter(x => x._2 < 3) b.createTempView("b") // calling action b.cache.first println(s"Is b cached: ${spark.catalog.isCached("b")}") spark.catalog.uncacheTable("a") println(s"Is b cached after a was unpersisted: ${spark.catalog.isCached("b")}") {code} When using spark 2.0.2 it works as expected: {code:java} Is a cached: true Is b cached: true Is b cached after a was unpersisted: true {code} But on 2.1.1: {code:java} Is a cached: true Is b cached: true Is b cached after a was unpersisted: false {code} in reality i have dataset a (complex database query), heavy processing to get dataset b from a, but after b is computed and cached dataset a is not needed anymore (i want to free memory) How can i archive this in 2.1.1? Thank you. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3165) DecisionTree does not use sparsity in data
[ https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646 ] Facai Yan edited comment on SPARK-3165 at 3/22/17 1:57 AM: --- Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't use sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on this if no one else has started it. was (Author: facai): Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on it. > DecisionTree does not use sparsity in data > -- > > Key: SPARK-3165 > URL: https://issues.apache.org/jira/browse/SPARK-3165 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: computation > DecisionTree should take advantage of sparse feature vectors. Aggregation > over training data could handle the empty/zero-valued data elements more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3165) DecisionTree does not use sparsity in data
[ https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646 ] Facai Yan commented on SPARK-3165: -- Do you mean that: TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data? So those modifications is need: 1. modify TreePoint.binnedFeatures to Vector. 2. modify LearningNode.predictImpl method if need. 3. modify the methods about Bin-wise computation, such as binSeqOp, to accelerate computation. Please correct me if misunderstand. I'd like to work on it. > DecisionTree does not use sparsity in data > -- > > Key: SPARK-3165 > URL: https://issues.apache.org/jira/browse/SPARK-3165 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: computation > DecisionTree should take advantage of sparse feature vectors. Aggregation > over training data could handle the empty/zero-valued data elements more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs
[ https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904479#comment-15904479 ] Ji Yan commented on SPARK-19320: i'm proposing to add a configuration parameter to guarantee a hard limit (spark.mesos.gpus) on gpu numbers. To avoid conflict, it will override spark.mesos.gpus.max whenever spark.mesos.gpus is greater than 0. > Allow guaranteed amount of GPU to be used when launching jobs > - > > Key: SPARK-19320 > URL: https://issues.apache.org/jira/browse/SPARK-19320 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > > Currently the only configuration for using GPUs with Mesos is setting the > maximum amount of GPUs a job will take from an offer, but doesn't guarantee > exactly how much. > We should have a configuration that sets this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs
[ https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888998#comment-15888998 ] Ji Yan commented on SPARK-19320: [~tnachen] in this case, should we rename spark.mesos.gpus.max to something else, or should we keep spark.mesos.gpus.max and set add a new configuration for a hard limit on gpu? > Allow guaranteed amount of GPU to be used when launching jobs > - > > Key: SPARK-19320 > URL: https://issues.apache.org/jira/browse/SPARK-19320 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > > Currently the only configuration for using GPUs with Mesos is setting the > maximum amount of GPUs a job will take from an offer, but doesn't guarantee > exactly how much. > We should have a configuration that sets this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884487#comment-15884487 ] Ji Yan commented on SPARK-19740: the problem is that when running Spark on Mesos, there is no way to run Spark executor as non-root user > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884426#comment-15884426 ] Ji Yan commented on SPARK-19740: proposed change: https://github.com/yanji84/spark/commit/4f8368ea727e5689e96794884b8d1baf3eccb5d5 > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19740) Spark executor always runs as root when running on mesos
[ https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Yan updated SPARK-19740: --- Description: When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user background on mesos with arbitrary parameters support: https://issues.apache.org/jira/browse/MESOS-1816 was:When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user > Spark executor always runs as root when running on mesos > > > Key: SPARK-19740 > URL: https://issues.apache.org/jira/browse/SPARK-19740 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Ji Yan > > When running Spark on Mesos with docker containerizer, the spark executors > are always launched with 'docker run' command without specifying --user > option, which always results in spark executors running as root. Mesos has a > way to support arbitrary parameters. Spark could use that to expose setting > user > background on mesos with arbitrary parameters support: > https://issues.apache.org/jira/browse/MESOS-1816 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19740) Spark executor always runs as root when running on mesos
Ji Yan created SPARK-19740: -- Summary: Spark executor always runs as root when running on mesos Key: SPARK-19740 URL: https://issues.apache.org/jira/browse/SPARK-19740 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.1.0 Reporter: Ji Yan When running Spark on Mesos with docker containerizer, the spark executors are always launched with 'docker run' command without specifying --user option, which always results in spark executors running as root. Mesos has a way to support arbitrary parameters. Spark could use that to expose setting user -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593229#comment-15593229 ] Yan commented on SPARK-15777: - One approach could be first tagging a subtree as specific to a data source, and then only applying the custom rules from that data source to the subtree so tagged. There could be other feasible approaches, and it is considered one of the details left open for future discussions. Thanks. > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590055#comment-15590055 ] Yan commented on SPARK-15777: - There is a paragraph in the design doc about the ordering of rule application as copied below: - 5) Resolution/execution ordering of analyzer/optimizer/planner rules is as follows: a) For analyzer and planner, external rules precede the built-in rules; b) For optimizer, the external rules follow the built-in rules. This is following the existing ordering, but is subject to future revisits; - To prevent a custom rule from being too aggressive and modifying the plan branches that are generic or specific to other data sources, guards can be added against such intrusive rules. In the future even explicit ordering specification might be added for more advanced plugged-in rules. > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543329#comment-15543329 ] Yan commented on SPARK-15777: - 1) Currently the rules are applied on a per-session basis. Right, ideally they should be applied on a per-query basis. We can modify the design/implementation in that direction. Regarding evaluation ordering, item 5) of the "Scopes, Limitations and Open Questions" is on this topic. In short, there is an ordering between the built-in rules and custom rules, but not among the custom rules. The plugin mechanism is for cooperative behavior so the plugged rules are expected to be applied against their specific data sources of the plans only, probably after some plan rewriting. Once the overall ideas are accepted by the community, we will flesh out the design doc and post the implementation in a WIP fashion. 2) As mentioned in the doc, this is a not complete design. Hopefully it can lay down some basic concepts and principles so future work can be built on top of it. For instance, persistent catalog itself could be another major feature but it is left out of the scope of this design for now without affecting the primary functionalities. 3) 3-level table identifier is now for the name space purpose. Yes, join queries against two tables of the same db and table names but with different catalog names work well. Arbitrary levels of name spaces are not supported yet . Thanks for your comments. > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541643#comment-15541643 ] Yan commented on SPARK-15777: - A design document is just attached above. We realize that this task requires significant amount of efforts, and would like to follow community advices on how to proceed with discussions. If Google Doc is the preferred channel, we have one version ready over there. Thanks. > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15777) Catalog federation
[ https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-15777: Attachment: SparkFederationDesign.pdf > Catalog federation > -- > > Key: SPARK-15777 > URL: https://issues.apache.org/jira/browse/SPARK-15777 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Attachments: SparkFederationDesign.pdf > > > This is a ticket to track progress to support federating multiple external > catalogs. This would require establishing an API (similar to the current > ExternalCatalog API) for getting information about external catalogs, and > ability to convert a table into a data source table. > As part of this, we would also need to be able to support more than a > two-level table identifier (database.table). At the very least we would need > a three level identifier for tables (catalog.database.table). A possibly > direction is to support arbitrary level hierarchical namespaces similar to > file systems. > Once we have this implemented, we can convert the current Hive catalog > implementation into an external catalog that is "mounted" into an internal > catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519497#comment-15519497 ] Yan commented on SPARK-17556: - For 2), I think BitTorrent won't help in the case of all-to-all transfers, unlike the one-to-all such as the driver-to-cluster broadcast, or few-to-all, transfers. Thanks. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518598#comment-15518598 ] Yan commented on SPARK-17556: - A few comments of mine are as follows: 1) The "one-executor collection" approach is different from the driver-side collection and broadcasting, in that it avoids uploading data from the driver back to cluster. The primary concern of the "one-executor collection" approach, as pointed out, is that the sole executor could get bottlenecked similar to the latency issue with the "driver-side collection" approach, to a large degree; 2) The "all-executor collection" approach is more balanced and scalable, but it might suffer from the network storming since all slaves needs to connect to all others. 3) the real issue is the repeated, and thus wasted, work of collection of pieces of the broadcast data by multiple collectors/broadcasters, against the extended latency if the collection/broadcasting is performed once and for all. This is actually not quite different from the scenario of multiple- vs single-reducer in a map/reduce execution. Final output from a single reducer is ready to use; while those from multiple-reducers require final assemblies by the end users, particularly if the final result is to be organized, e.g., totally ordered. But using multiple-reducers is more scalable, balanced and likely faster. 4) It's probably good to have a configurable # of executors acting as collectors/broadcasters, each of which just collects and broadcasts a portion of the broadcast table for the final join executions. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > Attachments: executor broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17375) Star Join Optimization
Yan created SPARK-17375: --- Summary: Star Join Optimization Key: SPARK-17375 URL: https://issues.apache.org/jira/browse/SPARK-17375 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yan The star schema is the simplest style of data mart schema and is the approach often seen in BI/Decision Support systems. Star Join is a popular SQL query pattern that joins one or (a few) fact tables with a few dimension tables in star schemas. Star Join Query Optimizations aim to optimize the performance and use of resource for the star joins. Currently the existing Spark SQL optimization works on broadcasting the usually small (after filtering and projection) dimension tables to avoid costly shuffling of fact table and the "reduce" operations based on the join keys. This improvement proposal tries to further improve the broadcast star joins in the two areas: 1) avoid materialization of the intermediate rows that otherwise could eventually not make to the final result row set after further joined with other dimensions that are more restricting; 2) avoid the performance variations among different join orders. This could also have been largely achieved by cost analysis and heuristics and selecting a reasonably optimal join order. But we are here trying to achieve similar improvement without relying on such info. A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% improvement (with codegen disabled on both tests) vs. the multiple broadcast joins on one Query (Q27) that inner joins 4 dimension table with one fact table. The large variation (5%-40%) is due to the different join ordering of the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are yet to be performed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS
[ https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267811#comment-15267811 ] Yan commented on SPARK-14521: - Yes, we are serializing the fields that should not be serialized. Please check and review https://github.com/apache/spark/pull/12598. > StackOverflowError in Kryo when executing TPC-DS > > > Key: SPARK-14521 > URL: https://issues.apache.org/jira/browse/SPARK-14521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan >Priority: Critical > > Build details: Spark build from master branch (Apr-10) > DataSet:TPC-DS at 200 GB scale in Parq format stored in hive. > Client: $SPARK_HOME/bin/beeline > Query: TPC-DS Query27 > spark.sql.sources.fileScan=true (this is the default value anyways) > Exception: > {noformat} > Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14507) Decide if we should still support CREATE EXTERNAL TABLE AS SELECT
[ https://issues.apache.org/jira/browse/SPARK-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240616#comment-15240616 ] Yan commented on SPARK-14507: - In terms of Hive support vs Spark SQL support, the "external table" concept in Spark SQL seems to be beyond that in Hive, not just for CTAS. For Hive, an "external table" is only for the "schema-on-read" scenario on the data on, say, HDFS. It has its own kinda unique DDL semantics and security features different from normal SQL DB's. For Spark SQL's external table, as far as I understand, it could be a mapping to a data source table. I'm not sure whether this mapping would need special considerations regarding DDL semantics and security models as Hive external tables. > Decide if we should still support CREATE EXTERNAL TABLE AS SELECT > - > > Key: SPARK-14507 > URL: https://issues.apache.org/jira/browse/SPARK-14507 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > Look like we support CREATE EXTERNAL TABLE AS SELECT by accident. Should we > still support it? Seems Hive does not support it. Based on the doc Impala, > seems Impala supports it. Right now, seems the rule of CreateTables in > HiveMetastoreCatalog.scala does not respect EXTERNAL keyword when > {{hive.convertCTAS}} is true and the CTAS query does not provide any storage > format. For this case, the table will become a MANAGED_TABLE and stored in > the default metastore location (not the user specified location). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11368) Spark shouldn't scan all partitions when using Python UDF and filter over partitioned column is given
[ https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233231#comment-15233231 ] Yan commented on SPARK-11368: - The issue seems to be gone with the latest master code (for 2.0): sqlCtx.sql('select count(*) from df where id >= 990 and multiply2(value) > 20').explain(True): WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count(1)#14L]) : +- INPUT +- Exchange SinglePartition, None +- WholeStageCodegen : +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#17L]) : +- Project :+- Project [value#0L,id#1] : +- Filter (cast(pythonUDF0#18 as double) > 20.0) : +- INPUT +- !BatchPythonEvaluation [multiply2(value#0L)], [value#0L,id#1,pythonUDF0#18] +- WholeStageCodegen : +- BatchedScan HadoopFiles[value#0L,id#1] Format: ParquetFormat, PushedFilters: [], ReadSchema: struct while 1.6 still had the issue as reported. > Spark shouldn't scan all partitions when using Python UDF and filter over > partitioned column is given > - > > Key: SPARK-11368 > URL: https://issues.apache.org/jira/browse/SPARK-11368 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Maciej Bryński >Priority: Critical > > Hi, > I think this is huge performance bug. > I created parquet file partitioned by column. > Then I make query with filter over partition column and filter with UDF. > Result is that all partition are scanned. > Sample data: > {code} > rdd = sc.parallelize(range(0,1000)).map(lambda x: > Row(id=x%1000,value=x)).repartition(1) > df = sqlCtx.createDataFrame(rdd) > df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test') > df = sqlCtx.read.parquet('/mnt/mfs/udf_test') > df.registerTempTable('df') > {code} > Then queries: > Without udf - Spark reads only 10 partitions: > {code} > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and value > 10').count() > print(time.time() - start) > 0.9993703365325928 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#22L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#25L]) >Project > Filter (value#5L > 10) > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L] > {code} > With udf Spark reads all the partitions: > {code} > sqlCtx.registerFunction('multiply2', lambda x: x *2 ) > start = time.time() > sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > > 20').count() > print(time.time() - start) > 13.0826096534729 > == Physical Plan == > TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], > output=[count#34L]) > TungstenExchange SinglePartition > TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[currentCount#37L]) >TungstenProject > Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0)) > !BatchPythonEvaluation PythonUDF#multiply2(value#5L), > [value#5L,id#6,pythonUDF#33] > Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231655#comment-15231655 ] Yan commented on SPARK-14389: - Actually the current Master branch does not have the issue; while 1.6.0 has. There appear to be improvements on BNL join since 1.6, Spark-13213 in particular. > OOM during BroadcastNestedLoopJoin > -- > > Key: SPARK-14389 > URL: https://issues.apache.org/jira/browse/SPARK-14389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OS: Amazon Linux AMI 2015.09 > EMR: 4.3.0 > Hadoop: Amazon 2.7.1 > Spark 1.6.0 > Ganglia 3.7.2 > Master: m3.xlarge > Core: m3.xlarge > m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD >Reporter: Steve Johnston > Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, > sample_script.py, stdout.txt > > > When executing attached sample_script.py in client mode with a single > executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", > during the self join of a small table, TPC-H lineitem generated for a 1M > dataset. Also see execution log stdout.txt attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin
[ https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231013#comment-15231013 ] Yan commented on SPARK-14389: - [~Steve Johnston] I guess the memory eater is probably not from the broadcast but from the persist() call on the cartesian joined dataframe. I tried on a single node using the significant configurations but did not get OOM. Can you run "jps -lvm |grep Submit" which will show your JVM flags used at runtime? > OOM during BroadcastNestedLoopJoin > -- > > Key: SPARK-14389 > URL: https://issues.apache.org/jira/browse/SPARK-14389 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OS: Amazon Linux AMI 2015.09 > EMR: 4.3.0 > Hadoop: Amazon 2.7.1 > Spark 1.6.0 > Ganglia 3.7.2 > Master: m3.xlarge > Core: m3.xlarge > m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD >Reporter: Steve Johnston > Attachments: lineitem.tbl, plans.txt, sample_script.py, stdout.txt > > > When executing attached sample_script.py in client mode with a single > executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", > during the self join of a small table, TPC-H lineitem generated for a 1M > dataset. Also see execution log stdout.txt attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128829#comment-15128829 ] Yan commented on SPARK-12988: - My thinking is that projections should parse the column names; while the schema-based ops should keep the names as is. One thing I'm not sure is "Column". Given its current capabilities, it seems it is for projections so its name should be backticked if it contains a '.'. But please correct me if I'm wrong here. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots
[ https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127883#comment-15127883 ] Yan commented on SPARK-12988: - [~marmbrus] For the same reason of "`a.c` is an invalid column name. toDF(...) should not accept that", can we require that df.drop do not take backtick either because df.drop can only drop top-level columns? Programmatically it makes little difference; but it seems more consistent semantically. > Can't drop columns that contain dots > > > Key: SPARK-12988 > URL: https://issues.apache.org/jira/browse/SPARK-12988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Armbrust > > Neither of theses works: > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("a.c").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > {code} > val df = Seq((1, 1)).toDF("a_b", "a.c") > df.drop("`a.c`").collect() > df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int] > {code} > Given that you can't use drop to drop subfields, it seems to me that we > should treat the column name literally (i.e. as though it is wrapped in back > ticks). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12686) Support group-by push down into data sources
[ https://issues.apache.org/jira/browse/SPARK-12686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088343#comment-15088343 ] Yan commented on SPARK-12686: - Spark-12449 seems to be a super set of this Jira. > Support group-by push down into data sources > > > Key: SPARK-12686 > URL: https://issues.apache.org/jira/browse/SPARK-12686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > > As for logical plan nodes like 'Aggregate -> Project -> (Filter) -> Scan', we > can push down partial aggregation processing into data sources that could > aggregate their own data efficiently because Orc/Parquet could fetch the > MIN/MAX value by using statistics data and some databases have efficient > aggregation implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088334#comment-15088334 ] Yan commented on SPARK-12449: - Stephan, thanks for your explanations and questions. My answers are as follows: 1) This is actually one point of having a "physical plan pruning interface" as part of the DataSource interface. From just a logical plan, it'd be probably hard to take advantage of data distribution info that Spark SQL is actually capable of. Another advantage of a pluggable physical plan pruner is the flexibility of making use of datasources' various capabilities, including partial aggregation, some types of predicate/expression evaluations, ..., etc. We feel the pain of the lack of such a "physical plan pruner" in developing the Astro project (http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase) which forces us to use a separate SQLContext to incorporate many advanced planning optimizations for HBase. In fact, the current datasource API already supports predicate pruning in *physical* plan, in a limited way though, in the method of "unhandledFilters". 2) I don't think current Spark SQL planning has the capability, and the "plan later" is for a different purpose. 3) Right, this just a bit more details to 2). The idea is the same: physical plan pruning. The point seems to be: the question of logical plan pruning vs. physical plan pruning is actually a question of what types of capabilities of a data source are to be valued here, physical or logical. My take is physical, for Spark's powerful capabilities in dataset/dataframe/SQL In fact, the "isMultiplePartitionExecution" field of the proposed "CatalystSource" interface, if true, signifies the willingness of leaving some *physical* operations such as shuffling to the Spark engine. It might make more sense for a SQL federation engine to do logical plan pruning. But Spark are much more capable than a federated engine, I guess. Admittedly, the stability and complexity of such an interface will be a big issue as pointed out by Reynold. I'd just keep my eyes open on any progresses/ideas/topics made in this field. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087983#comment-15087983 ] Yan commented on SPARK-12449: - Stephan, By "partial op" I mean, for instance, partial map-side aggregation. There is also a Jira (SPARK-12686) that seems to echo this scenario as well. Spark-10978 deals with predicate pushdown that used to get double evaluated, which is not related to the discussion of logical plan vs physical plan push down. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070088#comment-15070088 ] Yan commented on SPARK-12449: - To push down map-side (partial) ops, Physical Plan would need to be checked I guess. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070135#comment-15070135 ] Yan commented on SPARK-12449: - Conceivably, if only logical plan is used, the Spark SQL execution would probably redo the already-pushed-down partial op, which would amount to a no-op. It might work but just not be super efficient. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068447#comment-15068447 ] Yan commented on SPARK-12449: - A few thoughts on the capabilities of this "CatalystSource Interface": 1) provide data source partition info given a filtering predicate. Holistic/Partitioned Execution could also be (partially) controlled by this output. It will make the partition pruning pluggable; 2) have an interface to transform a portion of a physical plan into a "pushed down" plan, plus a "left-over" plan for execution inside Spark. Spark Planning may need to curve up the portion from the original plan so that the portion only contains the same data source and leave the execution on data from across different data sources in Spark. This will leave the decision of what portion of plan can be pushed down in the hands of the data sources. In particular, pushdown of either a whole SQL or just map-side executions could be supported. 3) On carving up the portion of the plan, the Spark Planning can start from the SCAN and move "downstream", and may (optionally?) want to stop on the branch at any intermediate DF that are to be cached or persistent so as to honor the Spark's execution at no extra cost. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.
[ https://issues.apache.org/jira/browse/SPARK-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tang Yan updated SPARK-7825: Affects Version/s: (was: 1.3.1) (was: 1.2.2) (was: 1.2.1) (was: 1.3.0) (was: 1.2.0) > Poor performance in Cross Product due to no combine operations for small > files. > --- > > Key: SPARK-7825 > URL: https://issues.apache.org/jira/browse/SPARK-7825 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Tang Yan > > Dealing with Cross Product, if one table has many small files, spark sql > has to handle so many tasks which will lead to poor performance, while Hive > has a CombineHiveInputFormat which can combine small files to decrease the > task number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.
Tang Yan created SPARK-7825: --- Summary: Poor performance in Cross Product due to no combine operations for small files. Key: SPARK-7825 URL: https://issues.apache.org/jira/browse/SPARK-7825 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0 Reporter: Tang Yan Dealing with Cross Product, if one table has many small files, spark sql has to handle so many tasks which will lead to poor performance, while Hive has a CombineHiveInputFormat which can combine small files to decrease the task number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7270) StringType dynamic partition cast to DecimalType in Spark Sql Hive
[ https://issues.apache.org/jira/browse/SPARK-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feixiang Yan updated SPARK-7270: Description: Create a hive table with two partitons,the first type is bigint and the second type is string.When insert overwrite the table with one static partiton and one dynamic partiton, the second StringType dynamic partition will be cast to DecimalType. {noformat} desc test; OK a string None b bigint None c string None # Partition Information # col_name data_type comment b bigint None c string None· {noformat} when run following hive sql in HiveContext {noformat}sqlContext.sql(insert overwrite table test partition (b=1,c) select 'a','c' from ptest){noformat} get the result of partition is {noformat}test[1,__HIVE_DEFAULT_PARTITION__]{noformat} spark log {noformat}15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:09 INFO ParseDriver: Parsing command: insert overwrite table test partition (b=1,c) select 'a','c' from ptest 15/04/30 10:38:09 INFO ParseDriver: Parse Completed 15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:10 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/04/30 10:38:10 INFO ObjectStore: ObjectStore, initialize called 15/04/30 10:38:10 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/04/30 10:38:10 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/04/30 10:38:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:11 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/04/30 10:38:11 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:11 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/04/30 10:38:12 INFO ObjectStore: Initialized ObjectStore 15/04/30 10:38:12 INFO HiveMetaStore: Added admin role in metastore 15/04/30 10:38:12 INFO HiveMetaStore: Added public role in metastore 15/04/30 10:38:12 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/04/30 10:38:12 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=test 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=test 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=test 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions : db=default tbl=test 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=ptest 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=ptest 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=ptest 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions : db=default tbl=ptest 15/04/30 10:38:13 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(451930) called with curMem=0, maxMem=2291041566 15/04/30 10:38:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 441.3 KB, free 2.1 GB) 15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(71321)
[jira] [Created] (SPARK-7270) StringType dynamic partition cast to DecimalType in Spark Sql Hive
Feixiang Yan created SPARK-7270: --- Summary: StringType dynamic partition cast to DecimalType in Spark Sql Hive Key: SPARK-7270 URL: https://issues.apache.org/jira/browse/SPARK-7270 Project: Spark Issue Type: Bug Components: SQL Reporter: Feixiang Yan Create a hive table with two partitons,the first type is bigint and the second type is string.When insert overwrite the table with one static partiton and one dynamic partiton, the second StringType dynamic partition will be case to DecimalType. {noformat} desc test; OK a string None b bigint None c string None # Partition Information # col_name data_type comment b bigint None c string None· {noformat} when run following hive sql in HiveContext {noformat}sqlContext.sql(insert overwrite table test partition (b=1,c) select 'a','c' from ptest){noformat} get the result of partition is {noformat}test[1,__HIVE_DEFAULT_PARTITION__]{noformat} spark log {noformat}15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:09 INFO ParseDriver: Parsing command: insert overwrite table test partition (b=1,c) select 'a','c' from ptest 15/04/30 10:38:09 INFO ParseDriver: Parse Completed 15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:10 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/04/30 10:38:10 INFO ObjectStore: ObjectStore, initialize called 15/04/30 10:38:10 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/04/30 10:38:10 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/04/30 10:38:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead 15/04/30 10:38:11 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/04/30 10:38:11 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:11 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/04/30 10:38:12 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/04/30 10:38:12 INFO ObjectStore: Initialized ObjectStore 15/04/30 10:38:12 INFO HiveMetaStore: Added admin role in metastore 15/04/30 10:38:12 INFO HiveMetaStore: Added public role in metastore 15/04/30 10:38:12 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/04/30 10:38:12 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=test 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=test 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=test 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions : db=default tbl=test 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=ptest 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=ptest 15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=ptest 15/04/30 10:38:13 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions : db=default tbl=ptest 15/04/30 10:38:13 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(451930) called with curMem=0, maxMem=2291041566 15/04/30
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378073#comment-14378073 ] Yan commented on SPARK-3306: If by global singleton object, you meant it to be in the Executor class, it'll have to be supported by the Executor. Besides, one application may need to use multiple external resources. If you meant it to be supplied by the application, my understanding is, correct me if I am wrong, that an application can now only submit tasks to an executor along with some static resources like jar files that can be shared between different tasks. The need here is to have a hook so an app can specify the connection behavior, but executors are to use the hook, if any, to initialize/cache/fetch-from-cache/terminate/show the external resources. In summary, there will be a pool. The question is whether an application, which is very task-oriented except for the static external resource usage like jars, can have the capabilities to manage the lifecycles of the cross-task external resources. We have an initial implementation in https://github.com/Huawei-Spark/spark/tree/SPARK-3306. Please feel free to take a look and voice your advices. Note that this is not a complete implementation, but for experimental purpose it's working at least for JDBC connections. Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377414#comment-14377414 ] Yan commented on SPARK-3306: The external resource primarily will serve the purpose of reuse of such a resource by different tasks on the same executor, such as a DB connection, to minimize the latency of reconnection per task. It will differ from the existing static resources like jar files, or other files in that the handles or identifiers have to be kept in memory and the executor process has to provide the access mechanism to its tasks. The current static resources have no problem because they use disk locations to identify themselves and the tasks have no difficulty to access them from disk. All of these is of dynamic nature and much more complex than jars/files, so the executors, I feel, should need to be modified/enhanced. I have not found much time on this as promised due to other Spark SQL work. Hopefully can give more concrete details for discussion soon. Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5614) Predicate pushdown through Generate
Lu Yan created SPARK-5614: - Summary: Predicate pushdown through Generate Key: SPARK-5614 URL: https://issues.apache.org/jira/browse/SPARK-5614 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Lu Yan Now in Catalyst's rules, predicates can not be pushed through Generate nodes. Further more, partition pruning in HiveTableScan can not be applied on those queries involves Generate. This makes such queries very inefficient. For example, physical plan for query {quote} select len, bk from s_server lateral view explode(len_arr) len_table as len where len 5 and day = '20150102'; {quote} where 'day' is a partition column in metastore is like this in current version of Spark SQL: {quote} Project [len, bk] Filter ((len 5) (event_day = 20150102)) Generate explode(len_arr), true, false HiveTableScan [bk, len_arr, day], (MetastoreRelation pblog.audio, audio_central_ts_server, None), None {quote} But theoretically the plan should be like this {quote} Project [len, bk] Filter (len 5) Generate explode(len_arr), true, false HiveTableScan [bk, len_arr, day], (MetastoreRelation pblog.audio, audio_central_ts_server, None), Some(event_day = 20150102) {quote} Where partition pruning predicates can be pushed to HiveTableScan nodes. I've developed a solution on this issue. If you guys do not have a plan for this already, I could merge the solution back to master. And there is also a problem on column pruning for Generate, I would file another issue about that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: SparkSQLOnHBase_v2.0.docx HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: (was: SparkSQLOnHBase_v2.docx) HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: SparkSQLOnHBase_v2.docx Version 2 HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263914#comment-14263914 ] Yan commented on SPARK-3306: Yes, close/cleanup will be supported, plus a show/list capability. Expect a design doc in a few weeks. Addition of external resource dependency in executors - Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167006#comment-14167006 ] Yan commented on SPARK-3880: The new context is intended to be very light-weighted. We noticed that SparkSQL is a very active project and there have been talks/Jiras about SQLContext and data sources. As mentioned in the design, we are aware of the PR, and the need to have a universal mechanism to access different types of data stores, and will a keep close watch on the latest movements and will definitely fit our efforts to those latest features and interfaces when they are ready and reasonably stable. In the meanwhile, the design is intended to be heavy of HBase-specific data model, data access mechanisms and query optimizations, and keep the integration part light-weighted so it can be easily adjusted to future changes. The point is that we need to find some compromise between a rapidly changing project and the need to have a more or less stable context to base a new feature on. Chasing a constantly moving target is never easy, I guess. HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3880) HBase as data source to SparkSQL
Yan created SPARK-3880: -- Summary: HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Reporter: Yan Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Component/s: SQL Fix Version/s: (was: 1.3.0) HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan updated SPARK-3880: --- Attachment: HBaseOnSpark.docx Design Document HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3306) Addition of external resource dependency in executors
Yan created SPARK-3306: -- Summary: Addition of external resource dependency in executors Key: SPARK-3306 URL: https://issues.apache.org/jira/browse/SPARK-3306 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Yan Currently, Spark executors only support static and read-only external resources of side files and jar files. With emerging disparate data sources, there is a need to support more versatile external resources, such as connections to data sources, to facilitate efficient data accesses to the sources. For one, the JDBCRDD, with some modifications, could benefit from this feature by reusing established JDBC connections from the same Spark context before. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org