[jira] [Commented] (SPARK-47287) Aggregate in not causes

2024-03-27 Thread Juefei Yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831625#comment-17831625
 ] 

Juefei Yan commented on SPARK-47287:


I tried the code on 3.4 branch, cannot reproduce this problem

> Aggregate in not causes 
> 
>
> Key: SPARK-47287
> URL: https://issues.apache.org/jira/browse/SPARK-47287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ted Chester Jenks
>Priority: Major
>
>  
> The below snippet is confirmed working with Spark 3.2.1 and broken Spark 
> 3.4.1. i believe this is a bug. 
> {code:java}
>Dataset ds = dummyDataset
> .withColumn("flag", 
> functions.not(functions.coalesce(functions.col("bool1"), 
> functions.lit(false)).equalTo(true)))
> .groupBy("code")
> .agg(functions.max(functions.col("flag")).alias("flag"));
> ds.show(); {code}
> It fails with:
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:208)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.$anonfun$generateExpression$7(V2ExpressionBuilder.scala:185)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:184)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateAggregateFunc(V2ExpressionBuilder.scala:293)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.generateExpression(V2ExpressionBuilder.scala:98)
>   at 
> org.apache.spark.sql.catalyst.util.V2ExpressionBuilder.build(V2ExpressionBuilder.scala:33)
>   at 
> org.apache.spark.sql.execution.datasources.PushableExpression$.unapply(DataSourceStrategy.scala:803)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.translate$1(DataSourceStrategy.scala:700){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-43251) Assign a name to the error class _LEGACY_ERROR_TEMP_2015

2023-05-09 Thread Liang Yan (Jira)


[ https://issues.apache.org/jira/browse/SPARK-43251 ]


Liang Yan deleted comment on SPARK-43251:
---

was (Author: JIRAUSER299156):
I will work on this issue.

> Assign a name to the error class _LEGACY_ERROR_TEMP_2015
> 
>
> Key: SPARK-43251
> URL: https://issues.apache.org/jira/browse/SPARK-43251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2015* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43251) Assign a name to the error class _LEGACY_ERROR_TEMP_2015

2023-05-06 Thread Liang Yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720107#comment-17720107
 ] 

Liang Yan commented on SPARK-43251:
---

I will work on this issue.

> Assign a name to the error class _LEGACY_ERROR_TEMP_2015
> 
>
> Key: SPARK-43251
> URL: https://issues.apache.org/jira/browse/SPARK-43251
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2015* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42844) Assign a name to the error class _LEGACY_ERROR_TEMP_2008

2023-04-06 Thread Liang Yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709235#comment-17709235
 ] 

Liang Yan commented on SPARK-42844:
---

[~maxgekk], I have added PR.

> Assign a name to the error class _LEGACY_ERROR_TEMP_2008
> 
>
> Key: SPARK-42844
> URL: https://issues.apache.org/jira/browse/SPARK-42844
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2008* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error

2023-03-13 Thread Liang Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Yan resolved SPARK-42711.
---
Resolution: Not A Problem

The original codes are just a copy of upstream. And the changes not fix actual 
problem.

> build/sbt usage error messages and shellcheck warn/error
> 
>
> Key: SPARK-42711
> URL: https://issues.apache.org/jira/browse/SPARK-42711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.2
>Reporter: Liang Yan
>Priority: Minor
>
> The build/sbt tool's usage information has some missing content:
>  
> {code:java}
> (base) spark% ./build/sbt -help
> Usage:  [options]
>   -h | -help print this message
>   -v | -verbose  this runner is chattier
> {code}
> And also some shellcheck warn/error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error

2023-03-13 Thread Liang Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Yan closed SPARK-42711.
-

> build/sbt usage error messages and shellcheck warn/error
> 
>
> Key: SPARK-42711
> URL: https://issues.apache.org/jira/browse/SPARK-42711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.2
>Reporter: Liang Yan
>Priority: Minor
>
> The build/sbt tool's usage information has some missing content:
>  
> {code:java}
> (base) spark% ./build/sbt -help
> Usage:  [options]
>   -h | -help print this message
>   -v | -verbose  this runner is chattier
> {code}
> And also some shellcheck warn/error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error

2023-03-13 Thread Liang Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Yan updated SPARK-42711:
--
Description: 
The build/sbt tool's usage information has some missing content:
 
{code:java}
(base) spark% ./build/sbt -help
Usage:  [options]

  -h | -help print this message
  -v | -verbose  this runner is chattier
{code}

And also some shellcheck warn/error.

  was:
The build/sbt tool's usage information about java-home is wrong:

  # java version (default: java from PATH, currently $(java -version 2>&1 | 
grep version))

  -java-home          alternate JAVA_HOME


> build/sbt usage error messages and shellcheck warn/error
> 
>
> Key: SPARK-42711
> URL: https://issues.apache.org/jira/browse/SPARK-42711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.2
>Reporter: Liang Yan
>Priority: Minor
>
> The build/sbt tool's usage information has some missing content:
>  
> {code:java}
> (base) spark% ./build/sbt -help
> Usage:  [options]
>   -h | -help print this message
>   -v | -verbose  this runner is chattier
> {code}
> And also some shellcheck warn/error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42711) build/sbt usage error messages and shellcheck warn/error

2023-03-13 Thread Liang Yan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Yan updated SPARK-42711:
--
Summary: build/sbt usage error messages and shellcheck warn/error  (was: 
build/sbt usage error messages about java-home)

> build/sbt usage error messages and shellcheck warn/error
> 
>
> Key: SPARK-42711
> URL: https://issues.apache.org/jira/browse/SPARK-42711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.2
>Reporter: Liang Yan
>Priority: Minor
>
> The build/sbt tool's usage information about java-home is wrong:
>   # java version (default: java from PATH, currently $(java -version 2>&1 | 
> grep version))
>   -java-home          alternate JAVA_HOME



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42711) build/sbt usage error messages about java-home

2023-03-07 Thread Liang Yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697745#comment-17697745
 ] 

Liang Yan commented on SPARK-42711:
---

I am preparing my PR for this issue. Please assign it to me.

> build/sbt usage error messages about java-home
> --
>
> Key: SPARK-42711
> URL: https://issues.apache.org/jira/browse/SPARK-42711
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.2
>Reporter: Liang Yan
>Priority: Minor
>
> The build/sbt tool's usage information about java-home is wrong:
>   # java version (default: java from PATH, currently $(java -version 2>&1 | 
> grep version))
>   -java-home          alternate JAVA_HOME



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42711) build/sbt usage error messages about java-home

2023-03-07 Thread Liang Yan (Jira)
Liang Yan created SPARK-42711:
-

 Summary: build/sbt usage error messages about java-home
 Key: SPARK-42711
 URL: https://issues.apache.org/jira/browse/SPARK-42711
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.2
Reporter: Liang Yan


The build/sbt tool's usage information about java-home is wrong:

  # java version (default: java from PATH, currently $(java -version 2>&1 | 
grep version))

  -java-home          alternate JAVA_HOME



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42395) The code logic of the configmap max size validation lacks extra content

2023-02-10 Thread Wei Yan (Jira)
Wei Yan created SPARK-42395:
---

 Summary: The code logic of the configmap max size validation lacks 
extra content
 Key: SPARK-42395
 URL: https://issues.apache.org/jira/browse/SPARK-42395
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.5.0
Reporter: Wei Yan
 Fix For: 3.3.1


In each configmap, Spark adds an extra content in a fixed format,this extra 
content of the configmap is as belows:
  spark.kubernetes.namespace: default
  spark.properties: |
    #Java properties built from Kubernetes config map with name: 
spark-exec-b47b438630eec12d-conf-map
    #Wed Feb 08 20:10:19 CST 2023
    spark.kubernetes.namespace=default

But the max size validation code logic does not consider this part 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42344) The default size of the CONFIG_MAP_MAXSIZE should not be greater than 1048576

2023-02-03 Thread Wei Yan (Jira)
Wei Yan created SPARK-42344:
---

 Summary: The default size of the CONFIG_MAP_MAXSIZE should not be 
greater than 1048576
 Key: SPARK-42344
 URL: https://issues.apache.org/jira/browse/SPARK-42344
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Spark Submit
Affects Versions: 3.3.1
 Environment: Kubernetes: 1.22.0

ETCD: 3.5.0

Spark: 3.3.2
Reporter: Wei Yan
 Fix For: 3.5.0


Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://172.18.123.24:6443/api/v1/namespaces/default/configmaps. Message: 
ConfigMap "spark-exec-ed9f2c861aa40b48-conf-map" is invalid: []: Too long: must 
have at most 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have 
at most 1048576 bytes, reason=FieldValueTooLong, additionalProperties={})], 
group=null, kind=ConfigMap, name=spark-exec-ed9f2c861aa40b48-conf-map, 
retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
message=ConfigMap "spark-exec-ed9f2c861aa40b48-conf-map" is invalid: []: Too 
long: must have at most 1048576 bytes, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Invalid, status=Failure, 
additionalProperties={}).
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:305)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83)
        at 
io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61)
        at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:88)
        at 
org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:112)
        at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:222)
        at org.apache.spark.SparkContext.(SparkContext.scala:595)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2714)
        at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953)
        at scala.Option.getOrElse(Option.scala:189)
        at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
        at org.apache.spark.examples.JavaSparkPi.main(JavaSparkPi.java:37)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

2019-10-23 Thread carlos yan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957623#comment-16957623
 ] 

carlos yan commented on SPARK-24666:


I also get this question, and my spark version is 2.1.0. I used 1000w record 
for train and words size is about 100w. When the numIterations>10, the vectors 
generated contain *infinity* and *NaN*.

> Word2Vec generate infinity vectors when numIterations are large
> ---
>
> Key: SPARK-24666
> URL: https://issues.apache.org/jira/browse/SPARK-24666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.1
> Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X
>Reporter: ZhongYu
>Priority: Critical
>
> We found that Word2Vec generate large absolute value vectors when 
> numIterations are large, and if numIterations are large enough (>20), the 
> vector's value many be *infinity(or -**infinity)***, resulting in useless 
> vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when 
> numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X.
> There are already issues report this bug: 
> https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works 
> seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled

2019-05-31 Thread Jeffrey(Xilang) Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey(Xilang) Yan updated SPARK-27894:

Description: 
In PySpark Steaming, if checkpoint enabled and there is a transform-join 
operation, the error thrown.
{code:java}
sc=SparkContext(appName='')
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,10)
ssc.checkpoint("hdfs:///test")

kafka_bootstrap_servers=""
topics = ['', '']

doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
kvds=KafkaUtils.createDirectStream(ssc, topics, 
kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})

line=kvds.map(lambda x:(1,2))

line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)

ssc.start()
ssc.awaitTermination()
{code}
 

Error details:
{code:java}
PicklingError: Could not serialize object: Exception: It appears that you are 
attempting to broadcast an RDD or reference an RDD from an action or 
transformation. RDD transformations and actions can only be invoked by the 
driver, not inside of other transformations; for example, rdd1.map(lambda x: 
rdd2.values.count() * x) is invalid because the values transformation and count 
action cannot be performed inside of the rdd1.map transformation. For more 
information, see SPARK-5063.
{code}
The similar code works great in Scala. And if we remove any of
{code:java}
ssc.checkpoint("hdfs:///test") 
{code}
or
{code:java}
line.transform(lambda rdd:rdd.join(doc_info))
{code}
There is no error either.

 

It seems that when checkpoint is enabled, pyspark will serialize transform 
lambda, and then the RDD used by lambda, while RDD cannot be serialize so the 
error prompted.

  was:
In PySpark Steaming, if checkpoint enabled and there is a transform-join 
operation, the error thrown.
{code:java}
sc=SparkContext(appName='')
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,10)
ssc.checkpoint("hdfs:///test")

kafka_bootstrap_servers=""
topics = ['', '']

doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
kvds=KafkaUtils.createDirectStream(ssc, topics, 
kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})

line=kvds.map(lambda x:(1,2))

line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)

ssc.start()
ssc.awaitTermination()
{code}
 

Error details:

{{PicklingError: Could not serialize object: Exception: It appears that you are 
attempting to broadcast an RDD or reference an RDD from an action or 
transformation. RDD transformations and actions can only be invoked by the 
driver, not inside of other transformations; for example, rdd1.map(lambda x: 
rdd2.values.count() * x) is invalid because the values transformation and count 
action cannot be performed inside of the rdd1.map transformation. For more 
information, see SPARK-5063. }}

The similar code works great in Scala. And if we remove any of
{code:java}
ssc.checkpoint("hdfs:///test") 
{code}
or
{code:java}
line.transform(lambda rdd:rdd.join(doc_info))
{code}
There is no error either.

 

It seems that when checkpoint is enabled, pyspark will serialize transform 
lambda, and then the RDD used by lambda, while RDD cannot be serialize so the 
error prompted.


> PySpark streaming transform RDD join not works when checkpoint enabled
> --
>
> Key: SPARK-27894
> URL: https://issues.apache.org/jira/browse/SPARK-27894
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jeffrey(Xilang) Yan
>Priority: Major
>
> In PySpark Steaming, if checkpoint enabled and there is a transform-join 
> operation, the error thrown.
> {code:java}
> sc=SparkContext(appName='')
> sc.setLogLevel("WARN")
> ssc=StreamingContext(sc,10)
> ssc.checkpoint("hdfs:///test")
> kafka_bootstrap_servers=""
> topics = ['', '']
> doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
> kvds=KafkaUtils.createDirectStream(ssc, topics, 
> kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})
> line=kvds.map(lambda x:(1,2))
> line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)
> ssc.start()
> ssc.awaitTermination()
> {code}
>  
> Error details:
> {code:java}
> PicklingError: Could not serialize object: Exception: It appears that you are 
> attempting to broadcast an RDD or reference an RDD from an action or 
> transformation. RDD transformations and actions can only be invoked by the 
> driver, not inside of other transformations; for example, rdd1.map(lambda x: 
> rdd2.values.count() * x) is invalid because the values transformation and 
> count action cannot be performed inside of the rdd1.map transformation. For 
> more information, see SPARK-5063.
> {code}
> The similar code works great in Scala. And if we remove any of
> {code:java}
> ssc.checkpoint("hdfs:///test") 
> {code}
> 

[jira] [Updated] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled

2019-05-31 Thread Jeffrey(Xilang) Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey(Xilang) Yan updated SPARK-27894:

Description: 
In PySpark Steaming, if checkpoint enabled and there is a transform-join 
operation, the error thrown.
{code:java}
sc=SparkContext(appName='')
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,10)
ssc.checkpoint("hdfs:///test")

kafka_bootstrap_servers=""
topics = ['', '']

doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
kvds=KafkaUtils.createDirectStream(ssc, topics, 
kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})

line=kvds.map(lambda x:(1,2))

line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)

ssc.start()
ssc.awaitTermination()
{code}
 

Error details:

{{PicklingError: Could not serialize object: Exception: It appears that you are 
attempting to broadcast an RDD or reference an RDD from an action or 
transformation. RDD transformations and actions can only be invoked by the 
driver, not inside of other transformations; for example, rdd1.map(lambda x: 
rdd2.values.count() * x) is invalid because the values transformation and count 
action cannot be performed inside of the rdd1.map transformation. For more 
information, see SPARK-5063. }}

The similar code works great in Scala. And if we remove any of
{code:java}
ssc.checkpoint("hdfs:///test") 
{code}
or
{code:java}
line.transform(lambda rdd:rdd.join(doc_info))
{code}
There is no error either.

 

It seems that when checkpoint is enabled, pyspark will serialize transform 
lambda, and then the RDD used by lambda, while RDD cannot be serialize so the 
error prompted.

  was:
In PySpark Steaming, if checkpoint enabled and there is a transform-join 
operation, the error thrown.

{{sc=SparkContext(appName='') }}

{{sc.setLogLevel("WARN") }}

{{ssc=StreamingContext(sc,10) }}

{{ssc.checkpoint("hdfs:///test") }}

{{kafka_bootstrap_servers="" }}

{{topics = ['', ''] }}

{{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}}

{{ kvds=KafkaUtils.createDirectStream(ssc, topics, 
kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }}

{{line=kvds.map(lambda x:(1,2)) }}

{{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }}

{{ssc.start() }}

{{ssc.awaitTermination() }}

Error details:

{{PicklingError: Could not serialize object: Exception: It appears that you are 
attempting to broadcast an RDD or reference an RDD from an action or 
transformation. RDD transformations and actions can only be invoked by the 
driver, not inside of other transformations; for example, rdd1.map(lambda x: 
rdd2.values.count() * x) is invalid because the values transformation and count 
action cannot be performed inside of the rdd1.map transformation. For more 
information, see SPARK-5063. }}

The similar code works great in Scala. And if we remove any of

{{ssc.checkpoint("hdfs:///test") }}

or

{{line.transform(lambda rdd:rdd.join(doc_info)) }}

There is no error either.

 

It seems that when checkpoint is enabled, pyspark will serialize transform 
lambda, and then the RDD used by lambda, while RDD cannot be serialize so the 
error prompted.


> PySpark streaming transform RDD join not works when checkpoint enabled
> --
>
> Key: SPARK-27894
> URL: https://issues.apache.org/jira/browse/SPARK-27894
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jeffrey(Xilang) Yan
>Priority: Major
>
> In PySpark Steaming, if checkpoint enabled and there is a transform-join 
> operation, the error thrown.
> {code:java}
> sc=SparkContext(appName='')
> sc.setLogLevel("WARN")
> ssc=StreamingContext(sc,10)
> ssc.checkpoint("hdfs:///test")
> kafka_bootstrap_servers=""
> topics = ['', '']
> doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))
> kvds=KafkaUtils.createDirectStream(ssc, topics, 
> kafkaParams={"metadata.broker.list": kafka_bootstrap_servers})
> line=kvds.map(lambda x:(1,2))
> line.transform(lambda rdd:rdd.join(doc_info)).pprint(10)
> ssc.start()
> ssc.awaitTermination()
> {code}
>  
> Error details:
> {{PicklingError: Could not serialize object: Exception: It appears that you 
> are attempting to broadcast an RDD or reference an RDD from an action or 
> transformation. RDD transformations and actions can only be invoked by the 
> driver, not inside of other transformations; for example, rdd1.map(lambda x: 
> rdd2.values.count() * x) is invalid because the values transformation and 
> count action cannot be performed inside of the rdd1.map transformation. For 
> more information, see SPARK-5063. }}
> The similar code works great in Scala. And if we remove any of
> {code:java}
> ssc.checkpoint("hdfs:///test") 
> {code}
> or
> 

[jira] [Created] (SPARK-27894) PySpark streaming transform RDD join not works when checkpoint enabled

2019-05-31 Thread Jeffrey(Xilang) Yan (JIRA)
Jeffrey(Xilang) Yan created SPARK-27894:
---

 Summary: PySpark streaming transform RDD join not works when 
checkpoint enabled
 Key: SPARK-27894
 URL: https://issues.apache.org/jira/browse/SPARK-27894
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Jeffrey(Xilang) Yan


In PySpark Steaming, if checkpoint enabled and there is a transform-join 
operation, the error thrown.

{{sc=SparkContext(appName='') }}

{{sc.setLogLevel("WARN") }}

{{ssc=StreamingContext(sc,10) }}

{{ssc.checkpoint("hdfs:///test") }}

{{kafka_bootstrap_servers="" }}

{{topics = ['', ''] }}

{{doc_info = sc.parallelize(((1, 2), (4, 5), (7, 8), (10, 11)))}}

{{ kvds=KafkaUtils.createDirectStream(ssc, topics, 
kafkaParams=\{"metadata.broker.list": kafka_bootstrap_servers}) }}

{{line=kvds.map(lambda x:(1,2)) }}

{{line.transform(lambda rdd:rdd.join(doc_info)).pprint(10) }}

{{ssc.start() }}

{{ssc.awaitTermination() }}

Error details:

{{PicklingError: Could not serialize object: Exception: It appears that you are 
attempting to broadcast an RDD or reference an RDD from an action or 
transformation. RDD transformations and actions can only be invoked by the 
driver, not inside of other transformations; for example, rdd1.map(lambda x: 
rdd2.values.count() * x) is invalid because the values transformation and count 
action cannot be performed inside of the rdd1.map transformation. For more 
information, see SPARK-5063. }}

The similar code works great in Scala. And if we remove any of

{{ssc.checkpoint("hdfs:///test") }}

or

{{line.transform(lambda rdd:rdd.join(doc_info)) }}

There is no error either.

 

It seems that when checkpoint is enabled, pyspark will serialize transform 
lambda, and then the RDD used by lambda, while RDD cannot be serialize so the 
error prompted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2019-05-06 Thread Jeffrey(Xilang) Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833632#comment-16833632
 ] 

Jeffrey(Xilang) Yan commented on SPARK-5594:


There is a bug before 2.2.3/2.3.0

If you met "Failed to get broadcast" and the method call stack is from 
MapOutputTracker, then try to upgrade your spark. The bug is due to driver 
remove the broadcast but send the broadcast id to executor, method 
MapOutputTrackerMaster.getSerializedMapOutputStatuses . It has been fixed by 
https://issues.apache.org/jira/browse/SPARK-23243

 

 

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> 

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-05-25 Thread Wei Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490307#comment-16490307
 ] 

Wei Yan commented on SPARK-24374:
-

Thanks [~mengxr] for the initiative and the doc. cc [~leftnoteasy] [~zhz] , as 
may need some support from YARN side if running as yarn-cluster.

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21576) Spark caching difference between 2.0.2 and 2.1.1

2017-07-30 Thread Yan (JIRA)
Yan created SPARK-21576:
---

 Summary: Spark caching difference between 2.0.2 and 2.1.1
 Key: SPARK-21576
 URL: https://issues.apache.org/jira/browse/SPARK-21576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Yan
Priority: Minor


Hi,

i asked a question on stackoverflow and was recommended to open jira

I'm a bit confused with spark caching behavior. I want to compute dependent 
dataset (b), cache it and unpersist source dataset(a) - here is my code:


{code:java}
val spark = 
SparkSession.builder().appName("test").master("local[4]").getOrCreate()
import spark.implicits._
val a = spark.createDataset(Seq(("a", 1), ("b", 2), ("c", 3)))
a.createTempView("a")
a.cache
println(s"Is a cached: ${spark.catalog.isCached("a")}")
val b = a.filter(x => x._2 < 3)
b.createTempView("b")
// calling action
b.cache.first
println(s"Is b cached: ${spark.catalog.isCached("b")}")

spark.catalog.uncacheTable("a")
println(s"Is b cached after a was unpersisted: ${spark.catalog.isCached("b")}")
{code}

When using spark 2.0.2 it works as expected:


{code:java}
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: true
{code}

But on 2.1.1:

{code:java}
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: false
{code}

in reality i have dataset a (complex database query), heavy processing to get 
dataset b from a, but after b is computed and cached dataset a is not needed 
anymore (i want to free memory)

How can i archive this in 2.1.1?
Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3165) DecisionTree does not use sparsity in data

2017-03-21 Thread Facai Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646
 ] 

Facai Yan edited comment on SPARK-3165 at 3/22/17 1:57 AM:
---

Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't use sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on this if no one else has started it.


was (Author: facai):
Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on it.

> DecisionTree does not use sparsity in data
> --
>
> Key: SPARK-3165
> URL: https://issues.apache.org/jira/browse/SPARK-3165
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: computation
> DecisionTree should take advantage of sparse feature vectors.  Aggregation 
> over training data could handle the empty/zero-valued data elements more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3165) DecisionTree does not use sparsity in data

2017-03-21 Thread Facai Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935646#comment-15935646
 ] 

Facai Yan commented on SPARK-3165:
--

Do you mean that:
TreePoint.binnedFeatures is Array[int], which doesn't sparsity in data?

So those modifications is need:
1. modify TreePoint.binnedFeatures to Vector.
2. modify LearningNode.predictImpl method if need.
3. modify the methods about Bin-wise computation, such as binSeqOp, to 
accelerate computation.

Please correct me if misunderstand.

I'd like to work on it.

> DecisionTree does not use sparsity in data
> --
>
> Key: SPARK-3165
> URL: https://issues.apache.org/jira/browse/SPARK-3165
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: computation
> DecisionTree should take advantage of sparse feature vectors.  Aggregation 
> over training data could handle the empty/zero-valued data elements more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904479#comment-15904479
 ] 

Ji Yan commented on SPARK-19320:


i'm proposing to add a configuration parameter to guarantee a hard limit 
(spark.mesos.gpus) on gpu numbers. To avoid conflict, it will override 
spark.mesos.gpus.max whenever spark.mesos.gpus is greater than 0.

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-02-28 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888998#comment-15888998
 ] 

Ji Yan commented on SPARK-19320:


[~tnachen] in this case, should we rename spark.mesos.gpus.max to something 
else, or should we keep spark.mesos.gpus.max and set add a new configuration 
for a hard limit on gpu?

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884487#comment-15884487
 ] 

Ji Yan commented on SPARK-19740:


the problem is that when running Spark on Mesos, there is no way to run Spark 
executor as non-root user

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884426#comment-15884426
 ] 

Ji Yan commented on SPARK-19740:


proposed change: 
https://github.com/yanji84/spark/commit/4f8368ea727e5689e96794884b8d1baf3eccb5d5

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Yan updated SPARK-19740:
---
Description: 
When running Spark on Mesos with docker containerizer, the spark executors are 
always launched with 'docker run' command without specifying --user option, 
which always results in spark executors running as root. Mesos has a way to 
support arbitrary parameters. Spark could use that to expose setting user

background on mesos with arbitrary parameters support: 
https://issues.apache.org/jira/browse/MESOS-1816


  was:When running Spark on Mesos with docker containerizer, the spark 
executors are always launched with 'docker run' command without specifying 
--user option, which always results in spark executors running as root. Mesos 
has a way to support arbitrary parameters. Spark could use that to expose 
setting user


> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)
Ji Yan created SPARK-19740:
--

 Summary: Spark executor always runs as root when running on mesos
 Key: SPARK-19740
 URL: https://issues.apache.org/jira/browse/SPARK-19740
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Ji Yan


When running Spark on Mesos with docker containerizer, the spark executors are 
always launched with 'docker run' command without specifying --user option, 
which always results in spark executors running as root. Mesos has a way to 
support arbitrary parameters. Spark could use that to expose setting user



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-20 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593229#comment-15593229
 ] 

Yan commented on SPARK-15777:
-

One approach could be first tagging a subtree as specific to a data source, and 
then only applying the custom rules from that data source to the subtree so 
tagged. There could be other feasible approaches, and it is considered one of 
the details left open for future discussions. Thanks.

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-19 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590055#comment-15590055
 ] 

Yan commented on SPARK-15777:
-

There is a paragraph in the design doc about the ordering of rule application 
as copied below:

-
5) Resolution/execution ordering of analyzer/optimizer/planner rules is as 
follows:
a) For analyzer and planner, external rules precede the built-in rules;
b) For optimizer, the external rules follow the built-in rules.

This is following the existing ordering, but is subject to future revisits;
-

To prevent a custom rule from being too aggressive and modifying the plan 
branches that are generic or specific to other data sources,  guards can be 
added against such intrusive rules. In the future even explicit ordering 
specification might be added for more advanced plugged-in rules.  

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-03 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15543329#comment-15543329
 ] 

Yan commented on SPARK-15777:
-

1) Currently the rules are applied on a per-session basis. Right, ideally they 
should be applied on a per-query basis. We can modify the design/implementation 
in that direction. Regarding evaluation ordering, item 5) of the "Scopes, 
Limitations and Open Questions" is on this topic. In short, there is an 
ordering between the built-in rules and custom rules, but not among the custom 
rules. The plugin mechanism is for cooperative behavior so the plugged rules 
are expected to be applied against their specific data sources of the plans 
only, probably after some plan rewriting. Once the overall ideas are accepted 
by the community, we will flesh out the design doc and post the implementation 
in a WIP fashion.
2) As mentioned in the doc, this is a not complete design. Hopefully it can lay 
down some basic concepts and principles so future work can be built on top of 
it. For instance, persistent catalog itself could be another major feature but 
it is left out of the scope of this design for now without affecting the 
primary functionalities.
3) 3-level table identifier is now for the name space purpose. Yes, join 
queries against two tables of the same db and table names but with different 
catalog names work well. Arbitrary levels of name spaces are not supported yet .

Thanks for your comments.   

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15777) Catalog federation

2016-10-03 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541643#comment-15541643
 ] 

Yan commented on SPARK-15777:
-

A design document is just attached above. We realize that this task requires 
significant amount of efforts, and would like to follow community advices on 
how to proceed with discussions. If Google Doc is the preferred channel, we 
have one version ready over there. Thanks.

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15777) Catalog federation

2016-10-03 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-15777:

Attachment: SparkFederationDesign.pdf

> Catalog federation
> --
>
> Key: SPARK-15777
> URL: https://issues.apache.org/jira/browse/SPARK-15777
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
> Attachments: SparkFederationDesign.pdf
>
>
> This is a ticket to track progress to support federating multiple external 
> catalogs. This would require establishing an API (similar to the current 
> ExternalCatalog API) for getting information about external catalogs, and 
> ability to convert a table into a data source table.
> As part of this, we would also need to be able to support more than a 
> two-level table identifier (database.table). At the very least we would need 
> a three level identifier for tables (catalog.database.table). A possibly 
> direction is to support arbitrary level hierarchical namespaces similar to 
> file systems.
> Once we have this implemented, we can convert the current Hive catalog 
> implementation into an external catalog that is "mounted" into an internal 
> catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519497#comment-15519497
 ] 

Yan commented on SPARK-17556:
-

For 2),  I think BitTorrent won't help in the case of all-to-all transfers, 
unlike the one-to-all such as the driver-to-cluster broadcast, or few-to-all, 
transfers. Thanks.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518598#comment-15518598
 ] 

Yan commented on SPARK-17556:
-

A few comments of mine are as follows:

1) The "one-executor collection" approach is different from the driver-side 
collection and broadcasting, in that it avoids uploading data from the driver 
back to cluster. The primary concern of the "one-executor collection" approach, 
as pointed out, is that the sole executor could get bottlenecked similar to the 
latency issue with the "driver-side collection" approach, to a large degree;
2) The "all-executor collection" approach is more balanced and scalable, but it 
might suffer from the network storming since all slaves needs to connect to all 
others.
3) the real issue is the repeated, and thus wasted, work of collection of 
pieces of the broadcast data by multiple collectors/broadcasters, against the 
extended latency if the collection/broadcasting is performed once and for all. 
This is actually not quite different from the scenario of multiple- vs 
single-reducer in a map/reduce execution. Final output from a single reducer is 
ready to use; while those from multiple-reducers require final assemblies by 
the end users, particularly if the final result is to be organized, e.g., 
totally ordered. But using multiple-reducers is more scalable, balanced and 
likely faster. 
4) It's probably good to have a configurable # of executors acting as 
collectors/broadcasters, each of which just collects and broadcasts a portion 
of the broadcast table for the final join executions.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17375) Star Join Optimization

2016-09-02 Thread Yan (JIRA)
Yan created SPARK-17375:
---

 Summary: Star Join Optimization
 Key: SPARK-17375
 URL: https://issues.apache.org/jira/browse/SPARK-17375
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yan


The star schema is the simplest style of data mart schema and is the approach 
often seen in BI/Decision Support systems. Star Join is a popular SQL query 
pattern that joins one or (a few) fact tables with a few dimension tables in 
star schemas. Star Join Query Optimizations aim to optimize the performance and 
use of resource for the star joins.

Currently the existing Spark SQL optimization works on broadcasting the usually 
small (after filtering and projection) dimension tables to avoid costly 
shuffling of fact table and the "reduce" operations based on the join keys.

This improvement proposal tries to further improve the broadcast star joins in 
the two areas:
1) avoid materialization of the intermediate rows that otherwise could 
eventually not make to the final result row set after further joined with other 
dimensions that are more restricting;
2) avoid the performance variations among different join orders. This could 
also have been largely achieved by cost analysis and heuristics and selecting a 
reasonably optimal join order. But we are here trying to achieve similar 
improvement without relying on such info.

A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% 
improvement (with codegen disabled on both tests) vs. the multiple broadcast 
joins on one Query (Q27) that inner joins 4 dimension table with one fact 
table. The large variation (5%-40%) is due to  the different join ordering of 
the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are 
yet to be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-05-02 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267811#comment-15267811
 ] 

Yan commented on SPARK-14521:
-

Yes, we are serializing the fields that should not be serialized. Please check 
and review https://github.com/apache/spark/pull/12598.

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Critical
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14507) Decide if we should still support CREATE EXTERNAL TABLE AS SELECT

2016-04-13 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240616#comment-15240616
 ] 

Yan commented on SPARK-14507:
-

In terms of Hive support vs Spark SQL support, the "external table"  concept in 
Spark SQL seems to be beyond that in Hive, not just for CTAS. For Hive,
an "external table" is only for the "schema-on-read" scenario on the data on, 
say, HDFS. It has its own kinda unique DDL semantics and security features
different from normal SQL DB's. For Spark SQL's external table, as far as I 
understand, it could be a mapping to a data source table. I'm not sure whether 
this mapping would need special considerations regarding DDL semantics and 
security models as Hive external tables. 

> Decide if we should still support CREATE EXTERNAL TABLE AS SELECT
> -
>
> Key: SPARK-14507
> URL: https://issues.apache.org/jira/browse/SPARK-14507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Look like we support CREATE EXTERNAL TABLE AS SELECT by accident. Should we 
> still support it? Seems Hive does not support it. Based on the doc Impala, 
> seems Impala supports it. Right now, seems the rule of CreateTables in 
> HiveMetastoreCatalog.scala does not respect EXTERNAL keyword when 
> {{hive.convertCTAS}} is true and the CTAS query does not provide any storage 
> format. For this case, the table will become a MANAGED_TABLE and stored in 
> the default metastore location (not the user specified location). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11368) Spark shouldn't scan all partitions when using Python UDF and filter over partitioned column is given

2016-04-08 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233231#comment-15233231
 ] 

Yan commented on SPARK-11368:
-

The issue seems to be gone with the latest master code (for 2.0):

sqlCtx.sql('select count(*) from df where id >= 990 and multiply2(value) > 
20').explain(True):

WholeStageCodegen
:  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Final,isDistinct=false)], output=[count(1)#14L])
: +- INPUT
+- Exchange SinglePartition, None
   +- WholeStageCodegen
  :  +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#17L])
  : +- Project
  :+- Project [value#0L,id#1]
  :   +- Filter (cast(pythonUDF0#18 as double) > 20.0)
  :  +- INPUT
  +- !BatchPythonEvaluation [multiply2(value#0L)], 
[value#0L,id#1,pythonUDF0#18]
 +- WholeStageCodegen
:  +- BatchedScan HadoopFiles[value#0L,id#1] Format: ParquetFormat, 
PushedFilters: [], ReadSchema: struct

while 1.6 still had the issue as reported.

> Spark shouldn't scan all partitions when using Python UDF and filter over 
> partitioned column is given
> -
>
> Key: SPARK-11368
> URL: https://issues.apache.org/jira/browse/SPARK-11368
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I think this is huge performance bug.
> I created parquet file partitioned by column.
> Then I make query with filter over partition column and filter with UDF.
> Result is that all partition are scanned.
> Sample data:
> {code}
> rdd = sc.parallelize(range(0,1000)).map(lambda x: 
> Row(id=x%1000,value=x)).repartition(1)
> df = sqlCtx.createDataFrame(rdd)
> df.write.mode("overwrite").partitionBy('id').parquet('/mnt/mfs/udf_test')
> df = sqlCtx.read.parquet('/mnt/mfs/udf_test')
> df.registerTempTable('df')
> {code}
> Then queries:
> Without udf - Spark reads only 10 partitions:
> {code}
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and value > 10').count()
> print(time.time() - start)
> 0.9993703365325928
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#22L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#25L])
>Project
> Filter (value#5L > 10)
>  Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L]
> {code}
> With udf Spark reads all the partitions:
> {code}
> sqlCtx.registerFunction('multiply2', lambda x: x *2 )
> start = time.time()
> sqlCtx.sql('select * from df where id >= 990 and multiply2(value) > 
> 20').count()
> print(time.time() - start)
> 13.0826096534729
> == Physical Plan ==
> TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[count#34L])
>  TungstenExchange SinglePartition
>   TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[currentCount#37L])
>TungstenProject
> Filter ((id#6 > 990) && (cast(pythonUDF#33 as double) > 20.0))
>  !BatchPythonEvaluation PythonUDF#multiply2(value#5L), 
> [value#5L,id#6,pythonUDF#33]
>   Scan ParquetRelation[file:/mnt/mfs/udf_test][value#5L,id#6]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231655#comment-15231655
 ] 

Yan commented on SPARK-14389:
-

Actually the current Master branch does not have the issue; while 1.6.0 has. 
There appear to be improvements on BNL join since 1.6, Spark-13213 in 
particular.

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: jps_command_results.txt, lineitem.tbl, plans.txt, 
> sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-04-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231013#comment-15231013
 ] 

Yan commented on SPARK-14389:
-

[~Steve Johnston]

I guess the memory eater is probably not from the broadcast but from the 
persist() call on the cartesian joined dataframe. I tried on a single node 
using the significant configurations but did not get OOM.   Can you run "jps 
-lvm |grep Submit" which will show your JVM flags used at runtime?

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0
> Hadoop: Amazon 2.7.1
> Spark 1.6.0
> Ganglia 3.7.2
> Master: m3.xlarge
> Core: m3.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
>Reporter: Steve Johnston
> Attachments: lineitem.tbl, plans.txt, sample_script.py, stdout.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128829#comment-15128829
 ] 

Yan commented on SPARK-12988:
-

My thinking is that projections should parse the column names; while the 
schema-based ops should keep the names as is. One thing I'm not sure is 
"Column". Given its current capabilities, it seems it is for projections so its 
name should be backticked if it contains a '.'. But please correct me if I'm 
wrong here.

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12988) Can't drop columns that contain dots

2016-02-02 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127883#comment-15127883
 ] 

Yan commented on SPARK-12988:
-

[~marmbrus] For the same reason of "`a.c` is an invalid column name. toDF(...) 
should not accept that",  can we require that df.drop do not take backtick 
either because df.drop can only drop top-level columns? Programmatically it 
makes little difference; but it seems more consistent semantically. 

> Can't drop columns that contain dots
> 
>
> Key: SPARK-12988
> URL: https://issues.apache.org/jira/browse/SPARK-12988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> Neither of theses works:
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a.c").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("`a.c`").collect()
> df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
> {code}
> Given that you can't use drop to drop subfields, it seems to me that we 
> should treat the column name literally (i.e. as though it is wrapped in back 
> ticks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12686) Support group-by push down into data sources

2016-01-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088343#comment-15088343
 ] 

Yan commented on SPARK-12686:
-

Spark-12449 seems to be a super set of this Jira.

> Support group-by push down into data sources
> 
>
> Key: SPARK-12686
> URL: https://issues.apache.org/jira/browse/SPARK-12686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> As for logical plan nodes like 'Aggregate -> Project -> (Filter) -> Scan', we 
> can push down partial aggregation processing into data sources that could 
> aggregate their own data efficiently because Orc/Parquet could fetch the 
> MIN/MAX value by using statistics data and some databases have efficient 
> aggregation implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-01-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088334#comment-15088334
 ] 

Yan commented on SPARK-12449:
-

Stephan, thanks for your explanations and questions. My answers are as follows:

1) This is actually one point of having a "physical plan pruning interface" as 
part of the DataSource interface. From just a logical plan, it'd be probably 
hard to take advantage of data distribution info that Spark SQL is actually 
capable of. Another advantage of a pluggable physical plan pruner is the 
flexibility of making use of datasources' various capabilities, including 
partial aggregation, some types of predicate/expression evaluations, ..., etc.

We feel the pain of the lack of such a "physical plan pruner" in developing the 
Astro project 
(http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase) which 
forces us to use a separate SQLContext to incorporate many advanced planning 
optimizations for HBase.

In fact, the current datasource API already supports predicate pruning in 
*physical* plan, in a limited way though, in the method of "unhandledFilters".

2) I don't think current Spark SQL planning has the capability, and the "plan 
later" is for a different purpose.

3) Right, this just a bit more details to 2). The idea is the same: physical 
plan pruning.

The point seems to be: the question of logical plan pruning vs. physical plan 
pruning is actually a question of what types of capabilities of a data source 
are to be valued here, physical or logical. My take is physical, for Spark's 
powerful capabilities in dataset/dataframe/SQL In fact, the 
"isMultiplePartitionExecution" field of the proposed "CatalystSource" 
interface, if true, signifies the willingness of leaving some *physical* 
operations such as shuffling to the Spark engine. It might make more sense for 
a SQL federation engine to do logical plan pruning. But Spark are much more 
capable than a federated engine, I guess.

Admittedly, the stability and complexity of such an interface will be a big 
issue as pointed out by Reynold. I'd just keep my eyes open on any 
progresses/ideas/topics made in this field.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-01-07 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087983#comment-15087983
 ] 

Yan commented on SPARK-12449:
-

Stephan, 

By "partial op" I mean, for instance, partial map-side aggregation. There is 
also a Jira (SPARK-12686) that seems to echo this scenario as well.
Spark-10978 deals with predicate pushdown that used to get double evaluated, 
which is not related to the discussion of logical plan vs physical plan push 
down. 

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2015-12-23 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070088#comment-15070088
 ] 

Yan commented on SPARK-12449:
-

To push down map-side (partial) ops, Physical Plan would need to be checked I 
guess.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2015-12-23 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070135#comment-15070135
 ] 

Yan commented on SPARK-12449:
-

Conceivably,  if only logical plan is used, the Spark SQL execution would 
probably redo the already-pushed-down partial op, which would amount to a  
no-op. It might work but just not be super efficient.

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2015-12-22 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068447#comment-15068447
 ] 

Yan commented on SPARK-12449:
-

A few thoughts on the capabilities of this "CatalystSource Interface":

1) provide data source partition info given a filtering predicate. 
Holistic/Partitioned Execution could also be (partially) controlled by this 
output. It will make the partition pruning pluggable;
2) have an interface to transform a portion of a physical plan into a "pushed 
down" plan, plus a "left-over" plan for execution inside Spark. Spark Planning 
may need to curve up the portion from the original plan so that the portion 
only contains the same data source and leave the execution on data from across 
different data sources in Spark. This will leave the decision of what portion 
of plan can be pushed down in the hands of the data sources. In particular, 
pushdown of either a whole SQL or just map-side executions could be supported.
3) On carving up the portion of the plan, the Spark Planning can start from the 
SCAN and move "downstream", and may (optionally?) want to stop on the branch at 
any intermediate DF that are to be cached or persistent so as to honor the 
Spark's execution at no extra cost.  

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.

2015-09-09 Thread Tang Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tang Yan updated SPARK-7825:

Affects Version/s: (was: 1.3.1)
   (was: 1.2.2)
   (was: 1.2.1)
   (was: 1.3.0)
   (was: 1.2.0)

> Poor performance in Cross Product due to no combine operations for small 
> files.
> ---
>
> Key: SPARK-7825
> URL: https://issues.apache.org/jira/browse/SPARK-7825
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Tang Yan
>
> Dealing with  Cross Product, if one  table has many small files, spark sql 
> has to handle so many tasks which will lead to poor performance, while Hive 
> has a CombineHiveInputFormat which can combine small files to decrease the 
> task  number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7825) Poor performance in Cross Product due to no combine operations for small files.

2015-05-22 Thread Tang Yan (JIRA)
Tang Yan created SPARK-7825:
---

 Summary: Poor performance in Cross Product due to no combine 
operations for small files.
 Key: SPARK-7825
 URL: https://issues.apache.org/jira/browse/SPARK-7825
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0
Reporter: Tang Yan


Dealing with  Cross Product, if one  table has many small files, spark sql has 
to handle so many tasks which will lead to poor performance, while Hive has a 
CombineHiveInputFormat which can combine small files to decrease the task  
number.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7270) StringType dynamic partition cast to DecimalType in Spark Sql Hive

2015-04-29 Thread Feixiang Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feixiang Yan updated SPARK-7270:

Description: 
Create a hive table with two partitons,the first type is bigint and the second 
type is string.When insert overwrite the table with one static partiton and one 
dynamic partiton, the second StringType dynamic partition will be cast to 
DecimalType.
{noformat}
desc test; 
OK
a   string  None
b   bigint  None
c   string  None
 
# Partition Information  
# col_name  data_type   comment 
 
b   bigint  None
c   string  None·
{noformat}

when run following hive sql in HiveContext
{noformat}sqlContext.sql(insert overwrite table test partition (b=1,c) select 
'a','c' from ptest){noformat}

get the result of partition is
{noformat}test[1,__HIVE_DEFAULT_PARTITION__]{noformat}

spark log
{noformat}15/04/30 10:38:09 WARN HiveConf: DEPRECATED: 
hive.metastore.ds.retry.* no longer has any effect.  Use 
hive.hmshandler.retry.* instead
15/04/30 10:38:09 INFO ParseDriver: Parsing command: insert overwrite table 
test partition (b=1,c) select 'a','c' from ptest
15/04/30 10:38:09 INFO ParseDriver: Parse Completed
15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/30 10:38:10 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/30 10:38:10 INFO ObjectStore: ObjectStore, initialize called
15/04/30 10:38:10 INFO Persistence: Property datanucleus.cache.level2 unknown - 
will be ignored
15/04/30 10:38:10 INFO Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/04/30 10:38:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/30 10:38:11 INFO ObjectStore: Setting MetaStore object pin classes with 
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/04/30 10:38:11 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/04/30 10:38:11 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/04/30 10:38:12 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/04/30 10:38:12 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/04/30 10:38:12 INFO Query: Reading in results for query 
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is 
closing
15/04/30 10:38:12 INFO ObjectStore: Initialized ObjectStore
15/04/30 10:38:12 INFO HiveMetaStore: Added admin role in metastore
15/04/30 10:38:12 INFO HiveMetaStore: Added public role in metastore
15/04/30 10:38:12 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/04/30 10:38:12 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=test
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=test 
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=test
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  
cmd=get_partitions : db=default tbl=test
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=ptest
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=ptest
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=ptest
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  
cmd=get_partitions : db=default tbl=ptest   
15/04/30 10:38:13 INFO deprecation: mapred.map.tasks is deprecated. Instead, 
use mapreduce.job.maps
15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(451930) called with 
curMem=0, maxMem=2291041566
15/04/30 10:38:13 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 441.3 KB, free 2.1 GB)
15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(71321) 

[jira] [Created] (SPARK-7270) StringType dynamic partition cast to DecimalType in Spark Sql Hive

2015-04-29 Thread Feixiang Yan (JIRA)
Feixiang Yan created SPARK-7270:
---

 Summary: StringType dynamic partition cast to DecimalType in Spark 
Sql Hive 
 Key: SPARK-7270
 URL: https://issues.apache.org/jira/browse/SPARK-7270
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Feixiang Yan


Create a hive table with two partitons,the first type is bigint and the second 
type is string.When insert overwrite the table with one static partiton and one 
dynamic partiton, the second StringType dynamic partition will be case to 
DecimalType.
{noformat}
desc test; 
OK
a   string  None
b   bigint  None
c   string  None
 
# Partition Information  
# col_name  data_type   comment 
 
b   bigint  None
c   string  None·
{noformat}

when run following hive sql in HiveContext
{noformat}sqlContext.sql(insert overwrite table test partition (b=1,c) select 
'a','c' from ptest){noformat}

get the result of partition is
{noformat}test[1,__HIVE_DEFAULT_PARTITION__]{noformat}

spark log
{noformat}15/04/30 10:38:09 WARN HiveConf: DEPRECATED: 
hive.metastore.ds.retry.* no longer has any effect.  Use 
hive.hmshandler.retry.* instead
15/04/30 10:38:09 INFO ParseDriver: Parsing command: insert overwrite table 
test partition (b=1,c) select 'a','c' from ptest
15/04/30 10:38:09 INFO ParseDriver: Parse Completed
15/04/30 10:38:09 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/30 10:38:10 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/04/30 10:38:10 INFO ObjectStore: ObjectStore, initialize called
15/04/30 10:38:10 INFO Persistence: Property datanucleus.cache.level2 unknown - 
will be ignored
15/04/30 10:38:10 INFO Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/04/30 10:38:10 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/04/30 10:38:11 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
longer has any effect.  Use hive.hmshandler.retry.* instead
15/04/30 10:38:11 INFO ObjectStore: Setting MetaStore object pin classes with 
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/04/30 10:38:11 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/04/30 10:38:11 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/04/30 10:38:12 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
15/04/30 10:38:12 INFO Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
15/04/30 10:38:12 INFO Query: Reading in results for query 
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is 
closing
15/04/30 10:38:12 INFO ObjectStore: Initialized ObjectStore
15/04/30 10:38:12 INFO HiveMetaStore: Added admin role in metastore
15/04/30 10:38:12 INFO HiveMetaStore: Added public role in metastore
15/04/30 10:38:12 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/04/30 10:38:12 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=test
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=test 
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=test
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  
cmd=get_partitions : db=default tbl=test
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_table : db=default tbl=ptest
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=ptest
15/04/30 10:38:13 INFO HiveMetaStore: 0: get_partitions : db=default tbl=ptest
15/04/30 10:38:13 INFO audit: ugi=root  ip=unknown-ip-addr  
cmd=get_partitions : db=default tbl=ptest   
15/04/30 10:38:13 INFO deprecation: mapred.map.tasks is deprecated. Instead, 
use mapreduce.job.maps
15/04/30 10:38:13 INFO MemoryStore: ensureFreeSpace(451930) called with 
curMem=0, maxMem=2291041566
15/04/30 

[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2015-03-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378073#comment-14378073
 ] 

Yan commented on SPARK-3306:


If by global singleton object, you meant it to be in the Executor class, 
it'll have to be supported by the Executor. Besides, one application may need 
to use multiple external resources.  If you meant it to be supplied by the 
application, my understanding is, correct me if I am wrong, that an application 
can now only submit tasks to an executor along with some static resources 
like jar files that can be shared between different tasks.

The need here is to have a hook so an app can specify the connection behavior, 
but executors are to use the hook, if any,  to 
initialize/cache/fetch-from-cache/terminate/show the external resources.

In summary, there will be a pool. The question is whether an application, which 
is very task-oriented except for the static external resource usage like 
jars, can have the capabilities to manage the lifecycles of the cross-task 
external resources.

We have an initial implementation in 
https://github.com/Huawei-Spark/spark/tree/SPARK-3306. Please feel free to take 
a look and voice your advices. Note that this is not a complete implementation, 
but for experimental purpose it's working at least for JDBC connections.

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2015-03-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377414#comment-14377414
 ] 

Yan commented on SPARK-3306:


The external resource primarily will serve the purpose of reuse of such a 
resource by different tasks on the same executor, such as a DB connection, to 
minimize the latency of reconnection per task. It will differ from the existing 
static resources like jar files, or other files in that the handles or 
identifiers have to be kept in memory and the executor process has to provide 
the access mechanism to its tasks. The current static resources have no 
problem because they use disk locations to identify themselves and the tasks 
have no difficulty to access them from disk.

All of these is of dynamic nature and much more complex than jars/files, so the 
executors, I feel, should need to be modified/enhanced.

I have not found much time on this as promised due to other Spark SQL work. 
Hopefully can give more concrete details for discussion soon.

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5614) Predicate pushdown through Generate

2015-02-04 Thread Lu Yan (JIRA)
Lu Yan created SPARK-5614:
-

 Summary: Predicate pushdown through Generate
 Key: SPARK-5614
 URL: https://issues.apache.org/jira/browse/SPARK-5614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Lu Yan


Now in Catalyst's rules, predicates can not be pushed through Generate nodes. 
Further more, partition pruning in HiveTableScan can not be applied on those 
queries involves Generate. This makes such queries very inefficient.
For example, physical plan for query
{quote}
select len, bk
from s_server lateral view explode(len_arr) len_table as len 
where len  5 and day = '20150102';
{quote}
where 'day' is a partition column in metastore is like this in current version 
of Spark SQL:
{quote}
Project [len, bk]

Filter ((len  5)  (event_day = 20150102))

Generate explode(len_arr), true, false

HiveTableScan [bk, len_arr, day], (MetastoreRelation pblog.audio, 
audio_central_ts_server, None), None
{quote}
But theoretically the plan should be like this
{quote}
Project [len, bk]

Filter (len  5)

Generate explode(len_arr), true, false

HiveTableScan [bk, len_arr, day], (MetastoreRelation pblog.audio, 
audio_central_ts_server, None), Some(event_day = 20150102)
{quote} 
Where partition pruning predicates can be pushed to HiveTableScan nodes.
I've developed a solution on this issue. If you guys do not have a plan for 
this already, I could merge the solution back to master.
And there is also a problem on column pruning for Generate, I would file 
another issue about that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.0.docx

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.0.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-26 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: (was: SparkSQLOnHBase_v2.docx)

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2015-01-16 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: SparkSQLOnHBase_v2.docx

Version 2

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx, SparkSQLOnHBase_v2.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2015-01-04 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263914#comment-14263914
 ] 

Yan commented on SPARK-3306:


Yes, close/cleanup will be supported, plus a show/list capability. Expect a 
design doc in a few weeks. 

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL

2014-10-10 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167006#comment-14167006
 ] 

Yan commented on SPARK-3880:


The new context is intended to be very light-weighted. We noticed that SparkSQL 
is a very active project and there have been talks/Jiras about SQLContext and 
data sources. As mentioned in the design, we are aware of the PR, and the need 
to have a universal mechanism to access different types of data stores,  and 
will a keep close watch on the latest movements and will definitely fit our 
efforts to those latest features and interfaces when they are ready and 
reasonably stable. In the meanwhile, the design is intended to be heavy of 
HBase-specific data model, data access mechanisms and query optimizations, and 
keep the integration part light-weighted so it can be easily adjusted to future 
changes.

The point is that we need to find some compromise between a rapidly changing 
project and the need to have a more or less stable context to base a new 
feature on. Chasing a constantly moving target is never easy, I guess.

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)
Yan created SPARK-3880:
--

 Summary: HBase as data source to SparkSQL
 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
Reporter: Yan
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
  Component/s: SQL
Fix Version/s: (was: 1.3.0)

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3880) HBase as data source to SparkSQL

2014-10-09 Thread Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan updated SPARK-3880:
---
Attachment: HBaseOnSpark.docx

Design Document

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3306) Addition of external resource dependency in executors

2014-08-29 Thread Yan (JIRA)
Yan created SPARK-3306:
--

 Summary: Addition of external resource dependency in executors
 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan


Currently, Spark executors only support static and read-only external resources 
of side files and jar files. With emerging disparate data sources, there is a 
need to support more versatile external resources, such as connections to data 
sources, to facilitate efficient data accesses to the sources. For one, the 
JDBCRDD, with some modifications,  could benefit from this feature by reusing 
established JDBC connections from the same Spark context before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org