[jira] [Created] (SPARK-21920) DataFrame Fail To Find The Column Name
abhijit nag created SPARK-21920: --- Summary: DataFrame Fail To Find The Column Name Key: SPARK-21920 URL: https://issues.apache.org/jira/browse/SPARK-21920 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 1.6.0 Reporter: abhijit nag Priority: Critical I am getting one issue like "sql.AnalysisException: cannot resolve column_name" Wrote a simple query as below. [DataFrame df= df1 .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT"))) .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));] Exception Found : resolved attribute(s) MERCH_ID#738 missing from MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project [MERCH_ID#738,MERCHANT#737]; Problem Solved by following code: DataFrame df= df1.alias("df1"). .join(df2.alias("df2"), functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT"))) .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT"))); Similar kind of issue appears rare, but I want to know the root cause of this problem. Is it a bug in Spark 1.6 or something else. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21918) HiveClient shouldn't share Hive object between different thread
[ https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Liu, updated SPARK-21918: Description: I'm testing the spark thrift server and found that all the DDL statements are run by user hive even if hive.server2.enable.doAs=true The root cause is that Hive object is shared between different thread in HiveClientImpl {code:java} private def client: Hive = { if (clientLoader.cachedHive != null) { clientLoader.cachedHive.asInstanceOf[Hive] } else { val c = Hive.get(conf) clientLoader.cachedHive = c c } } {code} But in impersonation mode, we should just share the Hive object inside the thread so that the metastore client in Hive could be associated with right user. we can pass the Hive object of parent thread to child thread when running the sql to fix it I have already had a initial patch for review and I'm glad to work on it if anyone could assign it to me. was: I'm testing the spark thrift server and found that all the DDL statements are run by user hive even if hive.server2.enable.doAs=true The root cause is that Hive object is shared between different thread in HiveClientImpl {code:java} private def client: Hive = { if (clientLoader.cachedHive != null) { clientLoader.cachedHive.asInstanceOf[Hive] } else { val c = Hive.get(conf) clientLoader.cachedHive = c c } } {code} But in impersonation mode, we should just share the Hive object inside the thread. we can pass the Hive object of current thread to new thread when running the sql to fix it I have already had a initial patch for review and I'm glad to work on it if anyone could assign it to me. > HiveClient shouldn't share Hive object between different thread > --- > > Key: SPARK-21918 > URL: https://issues.apache.org/jira/browse/SPARK-21918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hu Liu, > > I'm testing the spark thrift server and found that all the DDL statements are > run by user hive even if hive.server2.enable.doAs=true > The root cause is that Hive object is shared between different thread in > HiveClientImpl > {code:java} > private def client: Hive = { > if (clientLoader.cachedHive != null) { > clientLoader.cachedHive.asInstanceOf[Hive] > } else { > val c = Hive.get(conf) > clientLoader.cachedHive = c > c > } > } > {code} > But in impersonation mode, we should just share the Hive object inside the > thread so that the metastore client in Hive could be associated with right > user. > we can pass the Hive object of parent thread to child thread when running > the sql to fix it > I have already had a initial patch for review and I'm glad to work on it if > anyone could assign it to me. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
Ashish Chopra created SPARK-21919: - Summary: inconsistent behavior of AFTsurvivalRegression algorithm Key: SPARK-21919 URL: https://issues.apache.org/jira/browse/SPARK-21919 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.2.0 Environment: Spark Version: 2.2.0 Cluster setup: Standalone single node Python version: 3.5.2 Reporter: Ashish Chopra Took the direct example from spark ml documentation. {code} training = spark.createDataFrame([ (1.218, 1.0, Vectors.dense(1.560, -0.605)), (2.949, 0.0, Vectors.dense(0.346, 2.158)), (3.627, 0.0, Vectors.dense(1.380, 0.231)), (0.273, 1.0, Vectors.dense(0.520, 1.151)), (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"]) quantileProbabilities = [0.3, 0.6] aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, quantilesCol="quantiles") #aft = AFTSurvivalRegression() model = aft.fit(training) # Print the coefficients, intercept and scale parameter for AFT survival regression print("Coefficients: " + str(model.coefficients)) print("Intercept: " + str(model.intercept)) print("Scale: " + str(model.scale)) model.transform(training).show(truncate=False) {code} result is: Coefficients: [-0.496304411053,0.198452172529] Intercept: 2.6380898963056327 Scale: 1.5472363533632303 ||label||censor||features ||prediction || quantiles || |1.218|1.0 |[1.56,-0.605] |5.718985621018951 | [1.160322990805951,4.99546058340675]| |2.949|0.0 |[0.346,2.158] |18.07678210850554 |[3.66759199449632,15.789837303662042]| |3.627|0.0 |[1.38,0.231] |7.381908879359964 |[1.4977129086101573,6.4480027195054905]| |0.273|1.0 |[0.52,1.151] |13.577717814884505|[2.754778414791513,11.859962351993202]| |4.199|0.0 |[0.795,-0.226]|9.013087597344805 |[1.828662187733188,7.8728164067854856]| But if we change the value of all labels as label + 20. as: {code} training = spark.createDataFrame([ (21.218, 1.0, Vectors.dense(1.560, -0.605)), (22.949, 0.0, Vectors.dense(0.346, 2.158)), (23.627, 0.0, Vectors.dense(1.380, 0.231)), (20.273, 1.0, Vectors.dense(0.520, 1.151)), (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"]) quantileProbabilities = [0.3, 0.6] aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, quantilesCol="quantiles") #aft = AFTSurvivalRegression() model = aft.fit(training) # Print the coefficients, intercept and scale parameter for AFT survival regression print("Coefficients: " + str(model.coefficients)) print("Intercept: " + str(model.intercept)) print("Scale: " + str(model.scale)) model.transform(training).show(truncate=False) {code} result changes to: Coefficients: [23.9932020748,3.18105314757] Intercept: 7.35052273751137 Scale: 7698609960.724161 ||label ||censor||features ||prediction ||quantiles|| |21.218|1.0 |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]| |22.949|0.0 |[0.346,2.158] |6.011158613411288E9 |[0.0,0.0]| |23.627|0.0 |[1.38,0.231] |7.7835948690311181E17|[0.0,0.0]| |20.273|1.0 |[0.52,1.151] |1.5880852723124176E10|[0.0,0.0]| |24.199|0.0 |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]| Can someone please explain this exponential blow up in prediction, as per my understanding prediction in AFT is a prediction of the time when the failure event will occur, not able to understand why it will change exponentially against the value of the label. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21918) HiveClient shouldn't share Hive object between different thread
Hu Liu, created SPARK-21918: --- Summary: HiveClient shouldn't share Hive object between different thread Key: SPARK-21918 URL: https://issues.apache.org/jira/browse/SPARK-21918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Hu Liu, I'm testing the spark thrift server and found that all the DDL statements are run by user hive even if hive.server2.enable.doAs=true The root cause is that Hive object is shared between different thread in HiveClientImpl {code:java} private def client: Hive = { if (clientLoader.cachedHive != null) { clientLoader.cachedHive.asInstanceOf[Hive] } else { val c = Hive.get(conf) clientLoader.cachedHive = c c } } {code} But in impersonation mode, we should just share the Hive object inside the thread. we can pass the Hive object of current thread to new thread when running the sql to fix it I have already had a initial patch for review and I'm glad to work on it if anyone could assign it to me. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21917) Remote http(s) resources is not supported in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153141#comment-16153141 ] Saisai Shao commented on SPARK-21917: - I'm inclining to choose option 1, the only overhead is resource re-uploading, the fix is restricted to SparkSubmit and all other code could be worked transparently. What's your opinion [~tgraves] [~vanzin]? > Remote http(s) resources is not supported in YARN mode > -- > > Key: SPARK-21917 > URL: https://issues.apache.org/jira/browse/SPARK-21917 > Project: Spark > Issue Type: Bug > Components: Spark Submit, YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > In the current Spark, when submitting application on YARN with remote > resources {{./bin/spark-shell --jars > http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar > --master yarn-client -v}}, Spark will be failed with: > {noformat} > java.io.IOException: No FileSystem for scheme: http > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599) > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173) > {noformat} > This is because {{YARN#client}} assumes resources must be on the Hadoop > compatible FS, also in the NM > (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245) > it will only use Hadoop compatible FS to download resources. So this makes > Spark on YARN fail to support remote http(s) resources. > To solve this problem, there might be several options: > * Download remote http(s) resources to local and add this local downloaded > resources to dist cache. The downside of this option is that remote resources > will be uploaded again unnecessarily. > * Filter remote http(s) resources and add them with spark.jars or > spark.files, to leverage Spark's internal fileserver to distribute remote > http(s) resources. The problem of this solution is: for some resources which > require to be available before application start may not work. > * Leverage Hadoop's support http(s) file system > (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in > Hadoop 2.9+, and I think even we implement a similar one in Spark will not be > worked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21917) Remote http(s) resources is not supported in YARN mode
Saisai Shao created SPARK-21917: --- Summary: Remote http(s) resources is not supported in YARN mode Key: SPARK-21917 URL: https://issues.apache.org/jira/browse/SPARK-21917 Project: Spark Issue Type: Bug Components: Spark Submit, YARN Affects Versions: 2.2.0 Reporter: Saisai Shao Priority: Minor In the current Spark, when submitting application on YARN with remote resources {{./bin/spark-shell --jars http://central.maven.org/maven2/com/github/swagger-akka-http/swagger-akka-http_2.11/0.10.1/swagger-akka-http_2.11-0.10.1.jar --master yarn-client -v}}, Spark will be failed with: {noformat} java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:354) at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:600) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$6.apply(Client.scala:599) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:848) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:173) {noformat} This is because {{YARN#client}} assumes resources must be on the Hadoop compatible FS, also in the NM (https://github.com/apache/hadoop/blob/99e558b13ba4d5832aea97374e1d07b4e78e5e39/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java#L245) it will only use Hadoop compatible FS to download resources. So this makes Spark on YARN fail to support remote http(s) resources. To solve this problem, there might be several options: * Download remote http(s) resources to local and add this local downloaded resources to dist cache. The downside of this option is that remote resources will be uploaded again unnecessarily. * Filter remote http(s) resources and add them with spark.jars or spark.files, to leverage Spark's internal fileserver to distribute remote http(s) resources. The problem of this solution is: for some resources which require to be available before application start may not work. * Leverage Hadoop's support http(s) file system (https://issues.apache.org/jira/browse/HADOOP-14383). This is only worked in Hadoop 2.9+, and I think even we implement a similar one in Spark will not be worked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore
[ https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21916: Assignee: (was: Apache Spark) > Set isolationOn=true when create client to remote hive metastore > > > Key: SPARK-21916 > URL: https://issues.apache.org/jira/browse/SPARK-21916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > In current code, we set {{isolationOn=!isCliSessionState()}} when create hive > client for metadata. However conf of {{CliSessionState}} points to local > dummy > metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416). > Using {{CliSessionState}}, we fail to get metadata from remote hive > metastore. We can always set {{isolationOn=true}} when create hive client for > metadata -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore
[ https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153126#comment-16153126 ] Apache Spark commented on SPARK-21916: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/19127 > Set isolationOn=true when create client to remote hive metastore > > > Key: SPARK-21916 > URL: https://issues.apache.org/jira/browse/SPARK-21916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing > > In current code, we set {{isolationOn=!isCliSessionState()}} when create hive > client for metadata. However conf of {{CliSessionState}} points to local > dummy > metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416). > Using {{CliSessionState}}, we fail to get metadata from remote hive > metastore. We can always set {{isolationOn=true}} when create hive client for > metadata -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore
[ https://issues.apache.org/jira/browse/SPARK-21916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21916: Assignee: Apache Spark > Set isolationOn=true when create client to remote hive metastore > > > Key: SPARK-21916 > URL: https://issues.apache.org/jira/browse/SPARK-21916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: Apache Spark > > In current code, we set {{isolationOn=!isCliSessionState()}} when create hive > client for metadata. However conf of {{CliSessionState}} points to local > dummy > metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416). > Using {{CliSessionState}}, we fail to get metadata from remote hive > metastore. We can always set {{isolationOn=true}} when create hive client for > metadata -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21916) Set isolationOn=true when create client to remote hive metastore
jin xing created SPARK-21916: Summary: Set isolationOn=true when create client to remote hive metastore Key: SPARK-21916 URL: https://issues.apache.org/jira/browse/SPARK-21916 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: jin xing In current code, we set {{isolationOn=!isCliSessionState()}} when create hive client for metadata. However conf of {{CliSessionState}} points to local dummy metastore(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L416). Using {{CliSessionState}}, we fail to get metadata from remote hive metastore. We can always set {{isolationOn=true}} when create hive client for metadata -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
[ https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153084#comment-16153084 ] Apache Spark commented on SPARK-21915: -- User 'marktab' has created a pull request for this issue: https://github.com/apache/spark/pull/19126 > Model 1 and Model 2 ParamMaps Missing > - > > Key: SPARK-21915 > URL: https://issues.apache.org/jira/browse/SPARK-21915 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Mark Tabladillo >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The original Scala code says > println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) > The parent is lr > There is no method for accessing parent as is done in Scala. > > This code has been tested in Python, and returns values consistent with Scala > Proposing to call the lr variable instead of model1 or model2 > > This patch was tested with Spark 2.1.0 comparing the Scala and PySpark > results. Pyspark returns nothing at present for those two print lines. > The output for model2 in PySpark should be > {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the > convergence tolerance for iterative algorithms (>= 0).'): 1e-06, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, > 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 > penalty.'): 0.0, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', > doc='prediction column name.'): 'prediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', > doc='features column name.'): 'features', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', > doc='label column name.'): 'label', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='probabilityCol', doc='Column name for predicted class conditional > probabilities. Note: Not all models output well-calibrated probability > estimates! These probabilities should be treated as confidences, not precise > probabilities.'): 'myProbability', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column > name.'): 'rawPrediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', > doc='The name of family which is a description of the label distribution to > be used in the model. Supported options: auto, binomial, multinomial'): > 'auto', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', > doc='whether to fit an intercept term.'): True, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', > doc='Threshold in binary classification prediction, in range [0, 1]. If > threshold and thresholds are both set, they must match.e.g. if threshold is > p, then thresholds must be equal to [1-p, p].'): 0.55, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', > doc='max number of iterations (>= 0).'): 30, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', > doc='regularization parameter (>= 0).'): 0.1, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='standardization', doc='whether to standardize the training features > before fitting the model.'): True} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
[ https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21915: Assignee: Apache Spark > Model 1 and Model 2 ParamMaps Missing > - > > Key: SPARK-21915 > URL: https://issues.apache.org/jira/browse/SPARK-21915 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Mark Tabladillo >Assignee: Apache Spark >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The original Scala code says > println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) > The parent is lr > There is no method for accessing parent as is done in Scala. > > This code has been tested in Python, and returns values consistent with Scala > Proposing to call the lr variable instead of model1 or model2 > > This patch was tested with Spark 2.1.0 comparing the Scala and PySpark > results. Pyspark returns nothing at present for those two print lines. > The output for model2 in PySpark should be > {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the > convergence tolerance for iterative algorithms (>= 0).'): 1e-06, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, > 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 > penalty.'): 0.0, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', > doc='prediction column name.'): 'prediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', > doc='features column name.'): 'features', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', > doc='label column name.'): 'label', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='probabilityCol', doc='Column name for predicted class conditional > probabilities. Note: Not all models output well-calibrated probability > estimates! These probabilities should be treated as confidences, not precise > probabilities.'): 'myProbability', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column > name.'): 'rawPrediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', > doc='The name of family which is a description of the label distribution to > be used in the model. Supported options: auto, binomial, multinomial'): > 'auto', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', > doc='whether to fit an intercept term.'): True, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', > doc='Threshold in binary classification prediction, in range [0, 1]. If > threshold and thresholds are both set, they must match.e.g. if threshold is > p, then thresholds must be equal to [1-p, p].'): 0.55, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', > doc='max number of iterations (>= 0).'): 30, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', > doc='regularization parameter (>= 0).'): 0.1, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='standardization', doc='whether to standardize the training features > before fitting the model.'): True} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
[ https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Tabladillo updated SPARK-21915: Description: Error in PySpark example code [https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py] The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala Proposing to call the lr variable instead of model1 or model2 This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55, Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30, Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True} was: The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala Proposing to call the lr variable instead of model1 or model2 This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
[jira] [Assigned] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
[ https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21915: Assignee: (was: Apache Spark) > Model 1 and Model 2 ParamMaps Missing > - > > Key: SPARK-21915 > URL: https://issues.apache.org/jira/browse/SPARK-21915 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Mark Tabladillo >Priority: Minor > Labels: easyfix > Original Estimate: 1h > Remaining Estimate: 1h > > The original Scala code says > println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) > The parent is lr > There is no method for accessing parent as is done in Scala. > > This code has been tested in Python, and returns values consistent with Scala > Proposing to call the lr variable instead of model1 or model2 > > This patch was tested with Spark 2.1.0 comparing the Scala and PySpark > results. Pyspark returns nothing at present for those two print lines. > The output for model2 in PySpark should be > {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the > convergence tolerance for iterative algorithms (>= 0).'): 1e-06, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, > 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 > penalty.'): 0.0, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', > doc='prediction column name.'): 'prediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', > doc='features column name.'): 'features', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', > doc='label column name.'): 'label', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='probabilityCol', doc='Column name for predicted class conditional > probabilities. Note: Not all models output well-calibrated probability > estimates! These probabilities should be treated as confidences, not precise > probabilities.'): 'myProbability', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column > name.'): 'rawPrediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', > doc='The name of family which is a description of the label distribution to > be used in the model. Supported options: auto, binomial, multinomial'): > 'auto', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', > doc='whether to fit an intercept term.'): True, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', > doc='Threshold in binary classification prediction, in range [0, 1]. If > threshold and thresholds are both set, they must match.e.g. if threshold is > p, then thresholds must be equal to [1-p, p].'): 0.55, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', > doc='max number of iterations (>= 0).'): 30, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', > doc='regularization parameter (>= 0).'): 0.1, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='standardization', doc='whether to standardize the training features > before fitting the model.'): True} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
Mark Tabladillo created SPARK-21915: --- Summary: Model 1 and Model 2 ParamMaps Missing Key: SPARK-21915 URL: https://issues.apache.org/jira/browse/SPARK-21915 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0 Reporter: Mark Tabladillo Priority: Minor The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala Proposing to call the lr variable instead of model1 or model2 This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55, Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30, Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19126) Join Documentation Improvements
[ https://issues.apache.org/jira/browse/SPARK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153078#comment-16153078 ] Apache Spark commented on SPARK-19126: -- User 'marktab' has created a pull request for this issue: https://github.com/apache/spark/pull/19126 > Join Documentation Improvements > --- > > Key: SPARK-19126 > URL: https://issues.apache.org/jira/browse/SPARK-19126 > Project: Spark > Issue Type: Improvement >Reporter: Bill Chambers >Assignee: Bill Chambers >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > - Some join types are missing (no mention of anti join) > - Joins are labelled inconsistently both within each language and between > languages. > - Update according to new join spec for `crossJoin` > Pull request coming... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21914) Running examples as tests in SQL builtin function documentation
[ https://issues.apache.org/jira/browse/SPARK-21914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153011#comment-16153011 ] Hyukjin Kwon commented on SPARK-21914: -- [~rxin], would you mind if I ask whether you like this idea (running examples in SQL doc as tests) ? > Running examples as tests in SQL builtin function documentation > --- > > Key: SPARK-21914 > URL: https://issues.apache.org/jira/browse/SPARK-21914 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > It looks we have added many examples in {{ExpressionDescription}} for builtin > functions. > Actually, if I have seen correctly, we have fixed many examples so far in > some minor PRs and sometimes require to add the examples as tests sql and > golden files. > As we have formatted examples in {{ExpressionDescription.examples}} - > https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50, > and we have `SQLQueryTestSuite`, I think we could run the examples as tests > like Python's doctests. > Rough way I am thinking: > 1. Loads the example in {{ExpressionDescription}}. > 2. identify queries by {{>}}. > 3. identify the rest of them as the results. > 4. run the examples by reusing {{SQLQueryTestSuite}} if possible. > 5. compare the output by reusing {{SQLQueryTestSuite}} if possible. > Advantages of doing this I could think for now: > - Reduce the number of PRs to fix the examples > - De-duplicate the test cases that should be added into sql and golden files. > - Correct documentation with correct examples. > - Reduce reviewing costs for documentation fix PRs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21914) Running examples as tests in SQL builtin function documentation
Hyukjin Kwon created SPARK-21914: Summary: Running examples as tests in SQL builtin function documentation Key: SPARK-21914 URL: https://issues.apache.org/jira/browse/SPARK-21914 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.3.0 Reporter: Hyukjin Kwon It looks we have added many examples in {{ExpressionDescription}} for builtin functions. Actually, if I have seen correctly, we have fixed many examples so far in some minor PRs and sometimes require to add the examples as tests sql and golden files. As we have formatted examples in {{ExpressionDescription.examples}} - https://github.com/apache/spark/blob/ba327ee54c32b11107793604895bd38559804858/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionDescription.java#L44-L50, and we have `SQLQueryTestSuite`, I think we could run the examples as tests like Python's doctests. Rough way I am thinking: 1. Loads the example in {{ExpressionDescription}}. 2. identify queries by {{>}}. 3. identify the rest of them as the results. 4. run the examples by reusing {{SQLQueryTestSuite}} if possible. 5. compare the output by reusing {{SQLQueryTestSuite}} if possible. Advantages of doing this I could think for now: - Reduce the number of PRs to fix the examples - De-duplicate the test cases that should be added into sql and golden files. - Correct documentation with correct examples. - Reduce reviewing costs for documentation fix PRs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21913) `withDatabase` should drop database with CASCADE
[ https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152977#comment-16152977 ] Apache Spark commented on SPARK-21913: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19125 > `withDatabase` should drop database with CASCADE > > > Key: SPARK-21913 > URL: https://issues.apache.org/jira/browse/SPARK-21913 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, it fails if the database is not empty. It would be great if we > drop cleanly with CASCADE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21913) `withDatabase` should drop database with CASCADE
[ https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21913: Assignee: (was: Apache Spark) > `withDatabase` should drop database with CASCADE > > > Key: SPARK-21913 > URL: https://issues.apache.org/jira/browse/SPARK-21913 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Currently, it fails if the database is not empty. It would be great if we > drop cleanly with CASCADE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21913) `withDatabase` should drop database with CASCADE
[ https://issues.apache.org/jira/browse/SPARK-21913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21913: Assignee: Apache Spark > `withDatabase` should drop database with CASCADE > > > Key: SPARK-21913 > URL: https://issues.apache.org/jira/browse/SPARK-21913 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Currently, it fails if the database is not empty. It would be great if we > drop cleanly with CASCADE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21913) `withDatabase` should drop database with CASCADE
Dongjoon Hyun created SPARK-21913: - Summary: `withDatabase` should drop database with CASCADE Key: SPARK-21913 URL: https://issues.apache.org/jira/browse/SPARK-21913 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Dongjoon Hyun Priority: Minor Currently, it fails if the database is not empty. It would be great if we drop cleanly with CASCADE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152973#comment-16152973 ] Felix Cheung edited comment on SPARK-21727 at 9/4/17 11:08 PM: --- precisely. as far as I can tell, everything should "just work" if we return "array" from `getSerdeType()` for this case when length > 1. was (Author: felixcheung): precisely. as far as I can tell, everything should "just work" if we return `array` from `getSerdeType()` for this case when length > 1. > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152973#comment-16152973 ] Felix Cheung commented on SPARK-21727: -- precisely. as far as I can tell, everything should "just work" if we return `array` from `getSerdeType()` for this case when length > 1. > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table
[ https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152955#comment-16152955 ] Marco Gaido commented on SPARK-21905: - This is likely to be caused by a bug in the Magellan package. It expects to receive an InternalRow to deserialize but in this case it doesn't happen. So it should be fixed there. > ClassCastException when call sqlContext.sql on temp table > - > > Key: SPARK-21905 > URL: https://issues.apache.org/jira/browse/SPARK-21905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: bluejoe > > {code:java} > val schema = StructType(List( > StructField("name", DataTypes.StringType, true), > StructField("location", new PointUDT, true))) > val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), > 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); > val dataFrame = sqlContext.createDataFrame(rowRdd, schema) > dataFrame.createOrReplaceTempView("person"); > sqlContext.sql("SELECT * FROM person").foreach(println(_)); > {code} > the last statement throws exception: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) > ... 18 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-21418. --- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.3.0 2.2.1 > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Assignee: Sean Owen >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.O
[jira] [Assigned] (SPARK-21912) Creating ORC datasource table should check invalid column names
[ https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21912: Assignee: Apache Spark > Creating ORC datasource table should check invalid column names > --- > > Key: SPARK-21912 > URL: https://issues.apache.org/jira/browse/SPARK-21912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Currently, users meet job abortions while creating ORC data source tables > with invalid column names. We had better prevent this by raising > AnalysisException like Paquet data source tables. > {code} > scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") > 17/09/04 13:28:21 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: Error: : expected at the position 8 of > 'struct' but ' ' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) > ... > 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete > file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0 > 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. > 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > org.apache.spark.SparkException: Task failed while writing rows. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21912) Creating ORC datasource table should check invalid column names
[ https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21912: Assignee: (was: Apache Spark) > Creating ORC datasource table should check invalid column names > --- > > Key: SPARK-21912 > URL: https://issues.apache.org/jira/browse/SPARK-21912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > Currently, users meet job abortions while creating ORC data source tables > with invalid column names. We had better prevent this by raising > AnalysisException like Paquet data source tables. > {code} > scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") > 17/09/04 13:28:21 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: Error: : expected at the position 8 of > 'struct' but ' ' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) > ... > 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete > file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0 > 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. > 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > org.apache.spark.SparkException: Task failed while writing rows. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21912) Creating ORC datasource table should check invalid column names
[ https://issues.apache.org/jira/browse/SPARK-21912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152931#comment-16152931 ] Apache Spark commented on SPARK-21912: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19124 > Creating ORC datasource table should check invalid column names > --- > > Key: SPARK-21912 > URL: https://issues.apache.org/jira/browse/SPARK-21912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > Currently, users meet job abortions while creating ORC data source tables > with invalid column names. We had better prevent this by raising > AnalysisException like Paquet data source tables. > {code} > scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") > 17/09/04 13:28:21 ERROR Utils: Aborting task > java.lang.IllegalArgumentException: Error: : expected at the position 8 of > 'struct' but ' ' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) > ... > 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete > file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0 > 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. > 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > org.apache.spark.SparkException: Task failed while writing rows. > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21912) Creating ORC datasource table should check invalid column names
Dongjoon Hyun created SPARK-21912: - Summary: Creating ORC datasource table should check invalid column names Key: SPARK-21912 URL: https://issues.apache.org/jira/browse/SPARK-21912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Dongjoon Hyun Currently, users meet job abortions while creating ORC data source tables with invalid column names. We had better prevent this by raising AnalysisException like Paquet data source tables. {code} scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:28:21 ERROR Utils: Aborting task java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct' but ' ' is found. at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) ... 17/09/04 13:28:21 WARN FileOutputCommitter: Could not delete file:/Users/dongjoon/spark-release/spark-master/spark-warehouse/orc1/_temporary/0/_temporary/attempt_20170904132821_0001_m_00_0 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkException: Task failed while writing rows. {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152915#comment-16152915 ] Matei Zaharia commented on SPARK-21866: --- Just to chime in on this, I've also seen feedback that the deep learning libraries for Spark are too fragmented: there are too many of them, and people don't know where to start. This standard representation would at least give them a clear way to interoperate. It would let people write separate libraries for image processing, data augmentation and then training for example. > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4
[jira] [Commented] (SPARK-21882) OutputMetrics doesn't count written bytes correctly in the saveAsHadoopDataset function
[ https://issues.apache.org/jira/browse/SPARK-21882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152914#comment-16152914 ] Apache Spark commented on SPARK-21882: -- User 'awarrior' has created a pull request for this issue: https://github.com/apache/spark/pull/19115 > OutputMetrics doesn't count written bytes correctly in the > saveAsHadoopDataset function > --- > > Key: SPARK-21882 > URL: https://issues.apache.org/jira/browse/SPARK-21882 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.2.0 >Reporter: linxiaojun >Priority: Minor > Attachments: SPARK-21882.patch > > > The first job called from saveAsHadoopDataset, running in each executor, does > not calculate the writtenBytes of OutputMetrics correctly (writtenBytes is > 0). The reason is that we did not initialize the callback function called to > find bytes written in the right way. As usual, statisticsTable which records > statistics in a FileSystem must be initialized at the beginning (this will be > triggered when open SparkHadoopWriter). The solution for this issue is to > adjust the order of callback function initialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21418: Assignee: Apache Spark > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Assignee: Apache Spark >Priority: Minor > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(Ob
[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152821#comment-16152821 ] Apache Spark commented on SPARK-21418: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/19123 > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Priority: Minor > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdina
[jira] [Assigned] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21418: Assignee: (was: Apache Spark) > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Priority: Minor > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:117
[jira] [Updated] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21418: -- Summary: NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true (was: NoSuchElementException: None.get on DataFrame.rdd) > NoSuchElementException: None.get in DataSourceScanExec with > sun.io.serialization.extendedDebugInfo=true > --- > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Priority: Minor > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStrea
[jira] [Updated] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21418: -- Priority: Minor (was: Major) I think we could easily make this code a little more defensive so that this doesn't result in an error. It's just trying to check if a config exists in SparkConf and there's no particular need for this to fail. > NoSuchElementException: None.get on DataFrame.rdd > - > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos >Priority: Minor > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream
[jira] [Assigned] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python
[ https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21911: Assignee: Apache Spark > Parallel Model Evaluation for ML Tuning: Python > --- > > Key: SPARK-21911 > URL: https://issues.apache.org/jira/browse/SPARK-21911 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Apache Spark > > Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21911) Parallel Model Evaluation for ML Tuning: PySpark
[ https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-21911: --- Summary: Parallel Model Evaluation for ML Tuning: PySpark (was: Parallel Model Evaluation for ML Tuning: Python) > Parallel Model Evaluation for ML Tuning: PySpark > > > Key: SPARK-21911 > URL: https://issues.apache.org/jira/browse/SPARK-21911 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745 ] Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:54 PM: --- I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like the following {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}} In fact it seems the column's header name provided in the exception can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity. was (Author: hangleton): I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}} In fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity. > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745 ] Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:53 PM: --- I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}} In fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity. was (Author: hangleton): I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295.;}} (in fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity). > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745 ] Alexandre Dupriez commented on SPARK-17041: --- I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295.;}} (in fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity). > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152742#comment-16152742 ] Daniel Darabos commented on SPARK-21418: Sorry for the delay. I can confirm that removing {{-Dsun.io.serialization.extendedDebugInfo=true}} is the fix. We only use this flag when running unit tests, but it's very useful for debugging serialization issues. It happens often in Spark that you accidentally include something in a closure that cannot be serialized. It's hard to figure out without this flag what caused that. > NoSuchElementException: None.get on DataFrame.rdd > - > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task > serialization failed: java.util.NoSuchElementException: None.get > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54) > at > org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60) > at > org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480) > at > org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477) > at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOu
[jira] [Comment Edited] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd
[ https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088952#comment-16088952 ] Daniel Darabos edited comment on SPARK-21418 at 9/4/17 3:49 PM: I'm on holiday without a computer through the coming week, but I'll try to dig deeper after that. I do recall that we enable a JVM flag for printing extra details on serialization errors. Now I wonder if that flag collects string forms even when no error happens. I guess I should not be surprised: if it did not, there would be no reason to ever disable this feature. That already suggests an easy workaround :). Thanks! was (Author: darabos): I'm on holiday without a computer through the coming week, but I'll try to dig deeper after that. I do recall that we enable a JVM flag for printing extra details on serialization errors. Now I wonder if that flag collects string forms even when no error happens. I guess I should not be surprised: if it did not, there would be no reason to ever disable this feature. That already suggests an easy workaround :). Thanks! On Jul 15, 2017 6:44 PM, "Kazuaki Ishizaki (JIRA)" wrote: [ https://issues.apache.org/jira/browse/SPARK-21418?page= com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel&focusedCommentId=16088659#comment-16088659 ] Kazuaki Ishizaki commented on SPARK-21418: -- I am curious why {java.io.ObjectOutputStream.writeOrdinaryObject} calls `toString` method. Do you specify some option to run this program for JVM? following lines in a unit test for our Spark application: {{collect}} fails: serialization failed: java.util.NoSuchElementException: None.get $apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec. scala:70) DataSourceScanExec.scala:54) DataSourceScanExec.scala:52) 1.apply(TraversableLike.scala:234) 1.apply(TraversableLike.scala:234) ResizableArray.scala:59) DataSourceScanExec.scala:52) DataSourceScanExec.scala:75) QueryPlan.scala:349) apache$spark$sql$execution$DataSourceScanExec$$super$verboseString( DataSourceScanExec.scala:75) class.verboseString(DataSourceScanExec.scala:60) DataSourceScanExec.scala:75) generateTreeString(TreeNode.scala:556) generateTreeString(WholeStageCodegenExec.scala:451) generateTreeString(TreeNode.scala:576) TreeNode.scala:480) TreeNode.scala:477) TreeNode.scala:474) ObjectOutputStream.java:1421) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) writeObject(List.scala:468) NativeMethodAccessorImpl.java:62) DelegatingMethodAccessorImpl.java:43) ObjectStreamClass.java:1028) ObjectOutputStream.java:1496) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) ObjectOutputStream.java:1548) ObjectOutputStream.java:1509) ObjectOutputStream.java:1432) writeObject(JavaSerializer.scala:43) serialize(JavaSerializer.scala:100) DAGScheduler.scala:1003) scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) DAGScheduler.scala:874) doOnReceive(DAGScheduler.scala:1677) onReceive(DAGScheduler.scala:1669) onReceive(DAGScheduler.scala:1658) 91fa80fe8a2480d64c430bd10f97b3d44c007bcc#diff-2a91a9a59953aa82fa132aaf45bd73 1bR69 from https://issues.apache.org/jira/browse/SPARK-20070. It tries to redact sensitive information from {{explain}} output. (We are not trying to explain anything here, so I doubt it is meant to be running in this case.) When it needs to access some configuration, it tries to take it from the "current" Spark session, which it just reads from a thread-local variable. We appear to be on a thread where this variable is not set. constraint on multi-threaded Spark applications. -- This message was sent by Atlassian JIRA (v6.4.14#64029) > NoSuchElementException: None.get on DataFrame.rdd > - > > Key: SPARK-21418 > URL: https://issues.apache.org/jira/browse/SPARK-21418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Daniel Darabos > > I don't have a minimal reproducible example yet, sorry. I have the following > lines in a unit test for our Spark application: > {code} > val df = mySparkSession.read.format("jdbc") > .options(Map("url" -> url, "dbtable" -> "test_table")) > .load() > df.show > println(df.rdd.collect) > {code} > The output shows the DataFrame contents from {{df.show}}. But the {{collect}} > fails: > {nofo
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152766#comment-16152766 ] Neil McQuarrie commented on SPARK-21727: Happy to take on the change this side... (unless [~yanboliang] you had intended to?) > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152764#comment-16152764 ] Neil McQuarrie commented on SPARK-21727: Well, if class is "numeric" (or "integer", "character", etc.), then technically it is always a vector? (There are no distinct scalars in R?) We could look at length > 1? > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python
[ https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21911: Assignee: (was: Apache Spark) > Parallel Model Evaluation for ML Tuning: Python > --- > > Key: SPARK-21911 > URL: https://issues.apache.org/jira/browse/SPARK-21911 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python
[ https://issues.apache.org/jira/browse/SPARK-21911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152762#comment-16152762 ] Apache Spark commented on SPARK-21911: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19122 > Parallel Model Evaluation for ML Tuning: Python > --- > > Key: SPARK-21911 > URL: https://issues.apache.org/jira/browse/SPARK-21911 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19357) Parallel Model Evaluation for ML Tuning: Scala
[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-19357: --- Summary: Parallel Model Evaluation for ML Tuning: Scala (was: Parallel Model Evaluation for ML Tuning) > Parallel Model Evaluation for ML Tuning: Scala > -- > > Key: SPARK-19357 > URL: https://issues.apache.org/jira/browse/SPARK-19357 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Bryan Cutler > > This is a first step of the parent task of Optimizations for ML Pipeline > Tuning to perform model evaluation in parallel. A simple approach is to > naively evaluate with a possible parameter to control the level of > parallelism. There are some concerns with this: > * excessive caching of datasets > * what to set as the default value for level of parallelism. 1 will evaluate > all models in serial, as is done currently. Higher values could lead to > excessive caching. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21911) Parallel Model Evaluation for ML Tuning: Python
Weichen Xu created SPARK-21911: -- Summary: Parallel Model Evaluation for ML Tuning: Python Key: SPARK-21911 URL: https://issues.apache.org/jira/browse/SPARK-21911 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Weichen Xu Add parallelism support for ML tuning in pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17041) Columns in schema are no longer case sensitive when reading csv file
[ https://issues.apache.org/jira/browse/SPARK-17041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152745#comment-16152745 ] Alexandre Dupriez edited comment on SPARK-17041 at 9/4/17 3:53 PM: --- I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}} In fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity. was (Author: hangleton): I would advocate for a message which highlights the problem is case-related, since it may not be obvious from a message like {{Reference 'Output' is ambiguous, could be: Output#1263, Output#1295}} In fact it seems the column's header name provided in the message can be taken from either of the colliding columns - and thus contain capital letters, which can be misleading w.r.t. case sensitivity. > Columns in schema are no longer case sensitive when reading csv file > > > Key: SPARK-17041 > URL: https://issues.apache.org/jira/browse/SPARK-17041 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Barry Becker > > It used to be (in spark 1.6.2) that I could read a csv file that had columns > with names that differed only by case. For example, one column may be > "output" and another called "Output". Now (with spark 2.0.0) if I try to read > such a file, I get an error like this: > {code} > org.apache.spark.sql.AnalysisException: Reference 'Output' is ambiguous, > could be: Output#1263, Output#1295.; > {code} > The schema (dfSchema below) that I pass to the csv read looks like this: > {code} > StructType( StructField(Output,StringType,true), ... > StructField(output,StringType,true), ...) > {code} > The code that does the read is this > {code} > sqlContext.read > .format("csv") > .option("header", "false") // Use first line of all files as header > .option("inferSchema", "false") // Automatically infer data types > .schema(dfSchema) > .csv(dataFile) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21908) Checkpoint broadcast variable in spark streaming job
[ https://issues.apache.org/jira/browse/SPARK-21908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152702#comment-16152702 ] Venkat Gurukrishna commented on SPARK-21908: [~srowen] I tried sending mail from id to u...@spark.apache.org but I got the following error: : Must be sent from an @apache.org address or a subscriber address or an address in LDAP. Can you let me know how to send an email and to what email id I should send? > Checkpoint broadcast variable in spark streaming job > > > Key: SPARK-21908 > URL: https://issues.apache.org/jira/browse/SPARK-21908 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast > variables and when we checkpoint them and restart the spark job getting error. > Even tried with making the broadcast variable as transient. But we are > getting different exception. > I have checked this JIRA link: > https://issues.apache.org/jira/browse/SPARK-5206 > which had mentioned to use singleton reference to broadcast variable and also > to use the transient. > Whether this needs to be done in the driver side or at the executor side? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21910) Connection pooling in Spark Job using HBASE Context
[ https://issues.apache.org/jira/browse/SPARK-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152699#comment-16152699 ] Venkat Gurukrishna commented on SPARK-21910: [~srowen] I tried sending mail from id to u...@spark.apache.org but I got the following error: : Must be sent from an @apache.org address or a subscriber address or an address in LDAP. Can you let me know how to send an email and to what email id I should send? > Connection pooling in Spark Job using HBASE Context > --- > > Key: SPARK-21910 > URL: https://issues.apache.org/jira/browse/SPARK-21910 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > Is there a way to implement the HBASE connection pool using the HBASE > Context? In our spark job we are making the HBASE call for each batch and we > see new connection object is getting created for each batch interval of one > second. We want to implement the connection pooling for HBASE context. Not > able to do the same. Is there way to achieve the same the connection pool to > HBASE using HBASE Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job
[ https://issues.apache.org/jira/browse/SPARK-21909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152700#comment-16152700 ] Venkat Gurukrishna commented on SPARK-21909: [~srowen] I tried sending mail from id to u...@spark.apache.org but I got the following error: : Must be sent from an @apache.org address or a subscriber address or an address in LDAP. Can you let me know how to send an email and to what email id I should send? > Checkpoint HBASE Context in Spark Streaming Job > --- > > Key: SPARK-21909 > URL: https://issues.apache.org/jira/browse/SPARK-21909 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with > checkpoint and restart, it is giving exception. How to handle the > checkpointing for HBaseContext? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21910) Connection pooling in Spark Job using HBASE Context
[ https://issues.apache.org/jira/browse/SPARK-21910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21910. --- Resolution: Invalid Please stop opening JIRAs with questions. Use the mailing list > Connection pooling in Spark Job using HBASE Context > --- > > Key: SPARK-21910 > URL: https://issues.apache.org/jira/browse/SPARK-21910 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > Is there a way to implement the HBASE connection pool using the HBASE > Context? In our spark job we are making the HBASE call for each batch and we > see new connection object is getting created for each batch interval of one > second. We want to implement the connection pooling for HBASE context. Not > able to do the same. Is there way to achieve the same the connection pool to > HBASE using HBASE Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job
[ https://issues.apache.org/jira/browse/SPARK-21909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21909. --- Resolution: Invalid Please stop opening JIRAs with questions. Use the mailing list > Checkpoint HBASE Context in Spark Streaming Job > --- > > Key: SPARK-21909 > URL: https://issues.apache.org/jira/browse/SPARK-21909 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with > checkpoint and restart, it is giving exception. How to handle the > checkpointing for HBaseContext? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21910) Connection pooling in Spark Job using HBASE Context
Venkat Gurukrishna created SPARK-21910: -- Summary: Connection pooling in Spark Job using HBASE Context Key: SPARK-21910 URL: https://issues.apache.org/jira/browse/SPARK-21910 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.0 Reporter: Venkat Gurukrishna Is there a way to implement the HBASE connection pool using the HBASE Context? In our spark job we are making the HBASE call for each batch and we see new connection object is getting created for each batch interval of one second. We want to implement the connection pooling for HBASE context. Not able to do the same. Is there way to achieve the same the connection pool to HBASE using HBASE Context. We are using Spark 1.6.0, CDH 5.8.3, HBASE 1.2.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21908) Checkpoint broadcast variable in spark streaming job
[ https://issues.apache.org/jira/browse/SPARK-21908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21908. --- Resolution: Invalid It's not clear what your error is or what the result was of using a singleton, but, questions should go to the mailing list in any event. > Checkpoint broadcast variable in spark streaming job > > > Key: SPARK-21908 > URL: https://issues.apache.org/jira/browse/SPARK-21908 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Venkat Gurukrishna > > In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast > variables and when we checkpoint them and restart the spark job getting error. > Even tried with making the broadcast variable as transient. But we are > getting different exception. > I have checked this JIRA link: > https://issues.apache.org/jira/browse/SPARK-5206 > which had mentioned to use singleton reference to broadcast variable and also > to use the transient. > Whether this needs to be done in the driver side or at the executor side? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21909) Checkpoint HBASE Context in Spark Streaming Job
Venkat Gurukrishna created SPARK-21909: -- Summary: Checkpoint HBASE Context in Spark Streaming Job Key: SPARK-21909 URL: https://issues.apache.org/jira/browse/SPARK-21909 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.0 Reporter: Venkat Gurukrishna In our Spark 1.6 CDH 5.8.3, job application, when using the HBaseContext with checkpoint and restart, it is giving exception. How to handle the checkpointing for HBaseContext? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21908) Checkpoint broadcast variable in spark streaming job
Venkat Gurukrishna created SPARK-21908: -- Summary: Checkpoint broadcast variable in spark streaming job Key: SPARK-21908 URL: https://issues.apache.org/jira/browse/SPARK-21908 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.0 Reporter: Venkat Gurukrishna In our Spark 1.6 CDH 5.8.3, job application, we are using the broadcast variables and when we checkpoint them and restart the spark job getting error. Even tried with making the broadcast variable as transient. But we are getting different exception. I have checked this JIRA link: https://issues.apache.org/jira/browse/SPARK-5206 which had mentioned to use singleton reference to broadcast variable and also to use the transient. Whether this needs to be done in the driver side or at the executor side? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152639#comment-16152639 ] Yanbo Liang commented on SPARK-21727: - [~felixcheung] What do you mean for this comment? {quote} But with that said, I think we could and should make a minor change to support that implicitly https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L39 {quote} How we can get the SerDe type of atomic vector? Just like I mentioned above, {code} > class(rep(0, 20)) [1] "numeric" > class(as.list(rep(0, 20))) [1] "list" {code} _class_ function can't return type _vector_, how we can determine the type of object is _vector_ or _numeric_ ? Thanks. > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table
[ https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bluejoe updated SPARK-21905: Description: {code:java} val schema = StructType(List( StructField("name", DataTypes.StringType, true), StructField("location", new PointUDT, true))) val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); val dataFrame = sqlContext.createDataFrame(rowRdd, schema) dataFrame.createOrReplaceTempView("person"); sqlContext.sql("SELECT * FROM person").foreach(println(_)); {code} the last statement throws exception: {code:java} Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 18 more {code} was: {code:java} val schema = StructType(List( StructField("name", DataTypes.StringType, true), StructField("location", new PointUDT, true))) val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); val dataFrame = sqlContext.createDataFrame(rowRdd, schema) dataFrame.createOrReplaceTempView("person"); sqlContext.sql("SELECT * FROM person").foreach(println(_)); {code} the last statement throws exception: {code:java} Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 18 more {code} > ClassCastException when call sqlContext.sql on temp table > - > > Key: SPARK-21905 > URL: https://issues.apache.org/jira/browse/SPARK-21905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: bluejoe > > {code:java} > val schema = StructType(List( > StructField("name", DataTypes.StringType, true), > StructField("location", new PointUDT, true))) > val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), > 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); > val dataFrame = sqlContext.createDataFrame(rowRdd, schema) > dataFrame.createOrReplaceTempView("person"); > sqlContext.sql("SELECT * FROM person").foreach(println(_)); > {code} > the last statement throws exception: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) > ... 18 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152409#comment-16152409 ] jincheng commented on SPARK-18085: -- *{color:red}Here is a picture of how it looks{color}* !screenshot-1.png! {color:red}*and I also tried in spark 2.0.it looks like this *{color} !screenshot-2.png! the code is located at : org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:108) and it calls pagetable.pageData(PagedTable.scala:56) and throws an exception {code:java} def pageData(page: Int): PageData[T] = { val totalPages = (dataSize + pageSize - 1) / pageSize if (page <= 0 || page > totalPages) { throw new IndexOutOfBoundsException( s"Page $page is out of range. Please select a page number between 1 and $totalPages.") } val from = (page - 1) * pageSize val to = dataSize.min(page * pageSize) PageData(totalPages, sliceData(from, to)) } {code} it looks page=1 but totalPages = 0. so datasize + pagesize = 1. as {code:java} private[ui] abstract class PagedDataSource[T](val pageSize: Int) { if (pageSize <= 0) { throw new IllegalArgumentException("Page size must be positive") } {code} we did not meet this exception. so datasize = 0. this matches the case that no completed tasks, but instead all failed tasks should displayed just like spark 2.0. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152408#comment-16152408 ] Juliusz Sompolski commented on SPARK-21907: --- Note that UnsafeExternalSorter.spill appears twice on the stack trace, so it's nested spilling: the first triggered spilling triggers another spilling through UnsafeInMemorySorter.reset. Possibly it's messing up something by nested-spilling itself twice? Or messing something with {code:java} if (trigger != this) { if (readingIterator != null) { return readingIterator.spill(); } return 0L; // this should throw exception } {code} in spill() > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125
[jira] [Created] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
Juliusz Sompolski created SPARK-21907: - Summary: NullPointerException in UnsafeExternalSorter.spill() Key: SPARK-21907 URL: https://issues.apache.org/jira/browse/SPARK-21907 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Juliusz Sompolski I see NPE during sorting with the following stacktrace: {code} java.lang.NullPointerException at org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) at org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) at org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mai
[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark
[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21866: -- Shepherd: Joseph K. Bradley > SPIP: Image support in Spark > > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Timothy Hunter > Labels: SPIP > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image. The content of this
[jira] [Updated] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15689: -- Shepherd: Reynold Xin Affects Version/s: 2.3.0 > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Reynold Xin > Labels: SPIP, releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18085: -- Shepherd: Marcelo Vanzin > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jincheng updated SPARK-18085: - Attachment: screenshot-2.png > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21906: Assignee: Apache Spark > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao >Assignee: Apache Spark > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set {code:java} env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am > container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21906: Assignee: (was: Apache Spark) > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set {code:java} env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am > container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jincheng updated SPARK-18085: - Attachment: screenshot-1.png > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152295#comment-16152295 ] Apache Spark commented on SPARK-21906: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/19121 > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set {code:java} env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am > container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21900) Numerical Error in simple Skewness Computation
[ https://issues.apache.org/jira/browse/SPARK-21900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152284#comment-16152284 ] Sean Owen commented on SPARK-21900: --- I don't feel strongly about it, but this is a reasonable issue to report. Especially since it didn't seem like it acted this way in 2.2. I don't have a suggested change but would be open to a patch for this if someone finds a method to compute the higher-order moments more accurately without sacrificing (much) speed. > Numerical Error in simple Skewness Computation > -- > > Key: SPARK-21900 > URL: https://issues.apache.org/jira/browse/SPARK-21900 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jakob Bach >Priority: Minor > > The skewness() aggregate SQL function in the Scala implementation > (org.apache.spark.sql.skewness) seems to be buggy .The following code > {code:java} > import org.apache.spark.sql.functions > import org.apache.spark.sql.SparkSession > object SkewTest { > def main(args: Array[String]): Unit = { > val spark = SparkSession. > builder(). > appName("Skewness example"). > master("local[1]"). > getOrCreate() > > spark.createDataFrame(Seq(4,1,2,3).map(Tuple1(_))).agg(functions.skewness("_1")).show() > } > } > {code} > should output 0 (as it does for Seq(1,2,3,4)), but outputs > {code:none} > ++ > |skewness(_1)| > ++ > |5.958081967793454...| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21892) status code is 200 OK when kill application fail via spark master rest api
[ https://issues.apache.org/jira/browse/SPARK-21892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21892. --- Resolution: Not A Problem > status code is 200 OK when kill application fail via spark master rest api > --- > > Key: SPARK-21892 > URL: https://issues.apache.org/jira/browse/SPARK-21892 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Zhuang Xueyin >Priority: Minor > > Sent a post request to spark master restapi, eg: > http://:6066/v1/submissions/kill/driver-xxx > Request body: > { > "action" : "KillSubmissionRequest", > "clientSparkVersion" : "2.1.0", > } > Response body: > { > "action" : "KillSubmissionResponse", > "message" : "Driver driver-xxx has already finished or does not exist", > "serverSparkVersion" : "2.1.0", > "submissionId" : "driver-xxx", > "success" : false > } > Response headers: > *Status Code: 200 OK* > Content-Length: 203 > Content-Type: application/json; charset=UTF-8 > Date: Fri, 01 Sep 2017 05:56:04 GMT > Server: Jetty(9.2.z-SNAPSHOT) > Result: > status code is 200 OK when kill application fail via spark master rest api. > While the response body indicates that the update is not successfully, this > is not rest api standard, suggest to improve it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-21906: - Description: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set {code:java} env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am container context {code:java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} was: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set bq. env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() in the am container context {code:java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set {code:java} env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() {code} in the am > container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-21906: - Description: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set bq. env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() in the am container context {code:java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} was: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set ` env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() ` in the am container context {code:java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set bq. env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() > in the am container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-21906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-21906: - Description: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set ` env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() ` in the am container context {code:java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} was: 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set ` env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() ` in the am container context {code|java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} > No need to runAsSparkUser to switch UserGroupInformation in YARN mode > - > > Key: SPARK-21906 > URL: https://issues.apache.org/jira/browse/SPARK-21906 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.2.0 >Reporter: Kent Yao > > 1、The Yarn application‘s ugi is determined by the ugi launching it > 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have > already set ` env("SPARK_USER") = > UserGroupInformation.getCurrentUser().getShortUserName() > ` in the am container context > {code:java} > def runAsSparkUser(func: () => Unit) { > val user = Utils.getCurrentUserName() // get the user itself > logDebug("running as user: " + user) > val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi > use itself > transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // > transfer its own credentials > ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft > def run: Unit = func() > }) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21906) No need to runAsSparkUser to switch UserGroupInformation in YARN mode
Kent Yao created SPARK-21906: Summary: No need to runAsSparkUser to switch UserGroupInformation in YARN mode Key: SPARK-21906 URL: https://issues.apache.org/jira/browse/SPARK-21906 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 2.2.0 Reporter: Kent Yao 1、The Yarn application‘s ugi is determined by the ugi launching it 2、 runAsSparkUser is used to switch a ugi as same as itself, because we have already set ` env("SPARK_USER") = UserGroupInformation.getCurrentUser().getShortUserName() ` in the am container context {code|java} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() // get the user itself logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) // create a new ugi use itself transferCredentials(UserGroupInformation.getCurrentUser(), ugi) // transfer its own credentials ugi.doAs(new PrivilegedExceptionAction[Unit] { // doAs as itseft def run: Unit = func() }) } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144943#comment-16144943 ] Adrien Lavoillotte edited comment on SPARK-21850 at 9/4/17 8:41 AM: I am not saying it should _interpret the \n_, quite the opposite. I'm saying it comes from a column, so the \ should be _auto-escaped_ and not crash. As it stands, *LIKE + column will crash if the column value contains a backslash* not followed by \, _ or % precisely because it tries to interpret it. Also, please note that this behaviour is buggy only in Spark 2.2.0, but works in -every- other database/SQL engine that we tested, including hive and earlier versions of SparkSQL. was (Author: instanceof me): I am not saying it should _interpret the \n_, quite the opposite. I'm saying it comes from a column, so the \ should be _auto-escaped_ and not crash. As it stands, *LIKE + column will crash if the column value contains a backslash* not followed by \, _ or % precisely because it tries to interpret it. Also, please note that this behaviour is buggy only in Spark 2.2.0, but -works in every other database/SQL engine that we tested-, including hive and earlier versions of SparkSQL. > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table
bluejoe created SPARK-21905: --- Summary: ClassCastException when call sqlContext.sql on temp table Key: SPARK-21905 URL: https://issues.apache.org/jira/browse/SPARK-21905 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: bluejoe {code:java} val schema = StructType(List( StructField("name", DataTypes.StringType, true), StructField("location", new PointUDT, true))) val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); val dataFrame = sqlContext.createDataFrame(rowRdd, schema) dataFrame.createOrReplaceTempView("person"); sqlContext.sql("SELECT * FROM person").foreach(println(_)); {code} the last statement throws exception: {code:java} Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to org.apache.spark.sql.catalyst.InternalRow at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) ... 18 more {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Lavoillotte closed SPARK-21850. -- Resolution: Not A Bug > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152241#comment-16152241 ] Adrien Lavoillotte commented on SPARK-21850: The behaviour seems indeed logical if it takes the actual value without escaping it, and I actually replicated it in some other DBs (our earlier tests were wrong, each DB having its own rules for escaping \), although they just don't match instead of failing, which is arguably preferable. I'll close the issue, thank you for your help! > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21850) SparkSQL cannot perform LIKE someColumn if someColumn's value contains a backslash \
[ https://issues.apache.org/jira/browse/SPARK-21850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144943#comment-16144943 ] Adrien Lavoillotte edited comment on SPARK-21850 at 9/4/17 8:07 AM: I am not saying it should _interpret the \n_, quite the opposite. I'm saying it comes from a column, so the \ should be _auto-escaped_ and not crash. As it stands, *LIKE + column will crash if the column value contains a backslash* not followed by \, _ or % precisely because it tries to interpret it. Also, please note that this behaviour is buggy only in Spark 2.2.0, but -works in every other database/SQL engine that we tested-, including hive and earlier versions of SparkSQL. was (Author: instanceof me): I am not saying it should _interpret the \n_, quite the opposite. I'm saying it comes from a column, so the \ should be _auto-escaped_ and not crash. As it stands, *LIKE + column will crash if the column value contains a backslash* not followed by \, _ or % precisely because it tries to interpret it. Also, please note that this behaviour is buggy only in Spark 2.2.0, but works in every other database/SQL engine that we tested, including hive and earlier versions of SparkSQL. > SparkSQL cannot perform LIKE someColumn if someColumn's value contains a > backslash \ > > > Key: SPARK-21850 > URL: https://issues.apache.org/jira/browse/SPARK-21850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Adrien Lavoillotte > > I have a test table looking like this: > {code:none} > spark.sql("select * from `test`.`types_basic`").show() > {code} > ||id||c_tinyint|| [...] || c_string|| > | 0| -128| [...] | string| > | 1|0| [...] |string 'with' "qu...| > | 2| 127| [...] | unicod€ strĭng| > | 3| 42| [...] |there is a \n in ...| > | 4| null| [...] |null| > Note the line with ID 3, which has a literal \n in c_string (e.g. "some \\n > string", not a line break). I would like to join another table using a LIKE > condition (to join on prefix). If I do this: > {code:none} > spark.sql("select * from `test`.`types_basic` a where a.`c_string` LIKE > CONCAT(a.`c_string`, '%')").show() > {code} > I get the following error in spark 2.2 (but not in any earlier version): > {noformat} > 17/08/28 12:47:38 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 9.0 > (TID 12, cdh5.local, executor 2): org.apache.spark.sql.AnalysisException: the > pattern 'there is a \n in this line%' is invalid, the escape character is not > allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > It seems to me that if LIKE requires special escaping there, then it should > be provided by SparkSQL on the value of the column. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org