[jira] [Created] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
Josh Rosen created SPARK-4194: - Summary: Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state Key: SPARK-4194 URL: https://issues.apache.org/jira/browse/SPARK-4194 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Critical The SparkContext and SparkEnv constructors instantiate a bunch of objects that may need to be cleaned up after they're no longer needed. If an exception is thrown during SparkContext or SparkEnv construction (e.g. due to a bad configuration setting), then objects created earlier in the constructor may not be properly cleaned up. This is unlikely to cause problems for batch jobs submitted through {{spark-submit}}, since failure to construct SparkContext will probably cause the JVM to exit, but it is a potentially serious issue in interactive environments where a user might attempt to create SparkContext with some configuration, fail due to an error, and re-attempt the creation with new settings. In this case, resources from the previous creation attempt might not have been cleaned up and could lead to confusing errors (especially if the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state
[ https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193732#comment-14193732 ] Josh Rosen commented on SPARK-4194: --- I've marked this a blocker of SPARK-4180, an issue that tries to add exceptions when users try to create multiple active SparkContexts in the same JVM. PySpark already guards against this, but earlier versions of its error-checking code ran into issues where users would fail their initial attempt to create a SparkContext and then be unable to create new ones because we didn't clear the {{activeSparkContext}} variable after the constructor threw an exception. To fix this, we need to wrap the constructor code in a {{try}} block. Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state - Key: SPARK-4194 URL: https://issues.apache.org/jira/browse/SPARK-4194 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Critical The SparkContext and SparkEnv constructors instantiate a bunch of objects that may need to be cleaned up after they're no longer needed. If an exception is thrown during SparkContext or SparkEnv construction (e.g. due to a bad configuration setting), then objects created earlier in the constructor may not be properly cleaned up. This is unlikely to cause problems for batch jobs submitted through {{spark-submit}}, since failure to construct SparkContext will probably cause the JVM to exit, but it is a potentially serious issue in interactive environments where a user might attempt to create SparkContext with some configuration, fail due to an error, and re-attempt the creation with new settings. In this case, resources from the previous creation attempt might not have been cleaned up and could lead to confusing errors (especially if the old, leaked resources share global state with the new SparkContext). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4180: -- Labels: (was: starter) SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Blocker Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3466. -- Resolution: Fixed Fix Version/s: 1.2.0 Limit size of results that a driver collects for each action Key: SPARK-3466 URL: https://issues.apache.org/jira/browse/SPARK-3466 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Matei Zaharia Assignee: Davies Liu Priority: Critical Fix For: 1.2.0 Right now, operations like {{collect()}} and {{take()}} can crash the driver with an OOM if they bring back too many data. We should add a {{spark.driver.maxResultSize}} setting (or something like that) that will make the driver abort a job if its result is too big. We can set it to some fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running
[ https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193747#comment-14193747 ] Patrick Wendell commented on SPARK-4180: Yeah [~adav] just ran into an issue where a lot of our tests weren't properly cleaning SparkContext's and it was a huge pain. We could also log the callsite where a SparkContext is created, so if you try to create another one it actually tells you where the previous one was created and throws a helpful exception. SparkContext constructor should throw exception if another SparkContext is already running -- Key: SPARK-4180 URL: https://issues.apache.org/jira/browse/SPARK-4180 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Priority: Blocker Spark does not currently support multiple concurrently-running SparkContexts in the same JVM (see SPARK-2243). Therefore, SparkContext's constructor should throw an exception if there is an active SparkContext that has not been shut down via {{stop()}}. PySpark already does this, but the Scala SparkContext should do the same thing. The current behavior with multiple active contexts is unspecified / not understood and it may be the source of confusing errors (see the user error report in SPARK-4080, for example). This should be pretty easy to add: just add a {{activeSparkContext}} field to the SparkContext companion object and {{synchronize}} on it in the constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an example of this approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4195) retry to fetch blocks's result when fetchfailed's reason is connection timeout
[ https://issues.apache.org/jira/browse/SPARK-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193827#comment-14193827 ] Apache Spark commented on SPARK-4195: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/3061 retry to fetch blocks's result when fetchfailed's reason is connection timeout -- Key: SPARK-4195 URL: https://issues.apache.org/jira/browse/SPARK-4195 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Lianhui Wang when there are many executors in a application(example:1000),Connection timeout often occure.Exception is: WARN nio.SendingConnection: Error finishing connection java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.nio.SendingConnection.finishConnect(Connection.scala:342) at org.apache.spark.network.nio.ConnectionManager$$anon$11.run(ConnectionManager.scala:273) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) that will make driver as these executors are lost,but in fact these executors are alive.so add retry mechanism to reduce the probability of the occurrence of this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincenzo Selvaggio updated SPARK-1406: -- Attachment: kmeans.xml SPARK-1406.pdf PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont Attachments: SPARK-1406.pdf, kmeans.xml It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193829#comment-14193829 ] Vincenzo Selvaggio commented on SPARK-1406: --- Hi, based on what Sean suggested I had a go at this requirement, in particular the export of models to pmml as I find useful to decouple the producer (spark) and consumer (an app) of mining models. Attached details on the approach taken, if you think it is valid I could proceed with the implementation of the other exporter (so far only kmeans is supported). Also attached the pmml exported for kmeans using the compiled spark-shell. PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont Attachments: SPARK-1406.pdf, kmeans.xml It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193830#comment-14193830 ] Apache Spark commented on SPARK-1406: - User 'selvinsource' has created a pull request for this issue: https://github.com/apache/spark/pull/3062 PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont Attachments: SPARK-1406.pdf, kmeans.xml It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3914) InMemoryRelation should inherit statistics of its child to enable broadcast join
[ https://issues.apache.org/jira/browse/SPARK-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-3914. --- Resolution: Fixed InMemoryRelation should inherit statistics of its child to enable broadcast join Key: SPARK-3914 URL: https://issues.apache.org/jira/browse/SPARK-3914 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Assignee: Cheng Lian When a table/query is cached, {{InMemoryRelation}} stores the physical plan rather than the logical plan of the original table/query, thus loses the statistics information and disables broadcast join optimization. Sample {{spark-shell}} session to reproduce this issue: {code} val sparkContext = sc import org.apache.spark.sql._ import sparkContext._ val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class Sale(year: Int) makeRDD((1 to 100).map(Sale(_))).registerTempTable(sales) sql(select distinct year from sales limit 10).registerTempTable(tinyTable) cacheTable(tinyTable) sql(select * from sales join tinyTable on sales.year = tinyTable.year).queryExecution.executedPlan ... res3: org.apache.spark.sql.execution.SparkPlan = Project [year#4,year#5] ShuffledHashJoin [year#4], [year#5], BuildRight Exchange (HashPartitioning [year#4], 200) PhysicalRDD [year#4], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:37 Exchange (HashPartitioning [year#5], 200) InMemoryColumnarTableScan [year#5], [], (InMemoryRelation [year#5], false, 1000, StorageLevel(true, true, false, true, 1), (Limit 10)) {code} A workaround for this is to add a {{LIMIT}} operator above the {{InMemoryColumnarTableScan}} operator: {code} sql(select * from sales join (select * from tinyTable limit 10) tiny on sales.year = tiny.year).queryExecution.executedPlan ... res8: org.apache.spark.sql.execution.SparkPlan = Project [year#12,year#13] BroadcastHashJoin [year#12], [year#13], BuildRight PhysicalRDD [year#12], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:37 Limit 10 InMemoryColumnarTableScan [year#13], [], (InMemoryRelation [year#13], false, 1000, StorageLevel(true, true, false, true, 1), (Limit 10)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4196) Streaming + checkpointing yields NotSerializableException for Hadoop Configuration from saveAsNewAPIHadoopFiles ?
Sean Owen created SPARK-4196: Summary: Streaming + checkpointing yields NotSerializableException for Hadoop Configuration from saveAsNewAPIHadoopFiles ? Key: SPARK-4196 URL: https://issues.apache.org/jira/browse/SPARK-4196 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Sean Owen I am reasonably sure there is some issue here in Streaming and that I'm not missing something basic, but not 100%. I went ahead and posted it as a JIRA to track, since it's come up a few times before without resolution, and right now I can't get checkpointing to work at all. When Spark Streaming checkpointing is enabled, I see a NotSerializableException thrown for a Hadoop Configuration object, and it seems like it is not one from my user code. Before I post my particular instance see http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E for another occurrence. I was also on customer site last week debugging an identical issue with checkpointing in a Scala-based program and they also could not enable checkpointing without hitting exactly this error. The essence of my code is: {code} final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaStreamingContextFactory streamingContextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() { return new JavaStreamingContext(sparkContext, new Duration(batchDurationMS)); } }; streamingContext = JavaStreamingContext.getOrCreate( checkpointDirString, sparkContext.hadoopConfiguration(), streamingContextFactory, false); streamingContext.checkpoint(checkpointDirString); {code} It yields: {code} 2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66 org.apache.hadoop.conf.Configuration - field (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, name: conf$2, type: class org.apache.hadoop.conf.Configuration) - object (class org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9, function2) - field (class org.apache.spark.streaming.dstream.ForEachDStream, name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc, type: interface scala.Function2) - object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@cb8016a) ... {code} This looks like it's due to PairRDDFunctions, as this saveFunc seems to be org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9 : {code} def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration ) { val saveFunc = (rdd: RDD[(K, V)], time: Time) = { val file = rddToFileName(prefix, suffix, time) rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass, outputFormatClass, conf) } self.foreachRDD(saveFunc) } {code} Is that not a problem? but then I don't know how it would ever work in Spark. But then again I don't see why this is an issue and only when checkpointing is enabled. Long-shot, but I wonder if it is related to closure issues like https://issues.apache.org/jira/browse/SPARK-1866 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens
[ https://issues.apache.org/jira/browse/SPARK-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4166. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3033 [https://github.com/apache/spark/pull/3033] Display the executor ID in the Web UI when ExecutorLostFailure happens -- Key: SPARK-4166 URL: https://issues.apache.org/jira/browse/SPARK-4166 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 1.1.0 Reporter: Shixiong Zhu Priority: Minor Fix For: 1.2.0 Now when ExecutorLostFailure happens, it only displays ExecutorLostFailure (executor lost) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3436) Streaming SVM
[ https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3436: - Target Version/s: (was: 1.2.0) Streaming SVM -- Key: SPARK-3436 URL: https://issues.apache.org/jira/browse/SPARK-3436 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liquan Pei Assignee: Liquan Pei Implement online learning with kernels according to http://users.cecs.anu.edu.au/~williams/papers/P172.pdf The algorithms proposed in the above paper are implemented in R (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib (http://doc.madlib.net/latest/group__grp__kernmach.html) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3436) Streaming SVM
[ https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3436: - Affects Version/s: (was: 1.2.0) Streaming SVM -- Key: SPARK-3436 URL: https://issues.apache.org/jira/browse/SPARK-3436 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liquan Pei Assignee: Liquan Pei Implement online learning with kernels according to http://users.cecs.anu.edu.au/~williams/papers/P172.pdf The algorithms proposed in the above paper are implemented in R (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib (http://doc.madlib.net/latest/group__grp__kernmach.html) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3147) Implement A/B testing
[ https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3147: - Target Version/s: 1.3.0 (was: 1.2.0) Implement A/B testing - Key: SPARK-3147 URL: https://issues.apache.org/jira/browse/SPARK-3147 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Xiangrui Meng A/B testing is widely used to compare online models. We can implement A/B testing in MLlib and integrate it with Spark Streaming. For example, we have a PairDStream[String, Double], whose keys are model ids and values are observations (click or not, or revenue associated with the event). With A/B testing, we can tell whether one model is significantly better than another at a certain time. There are some caveats. For example, we should avoid multiple testing and support A/A testing as a sanity check. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners
[ https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3974: - Target Version/s: 1.3.0 (was: 1.2.0) Block matrix abstracitons and partitioners -- Key: SPARK-3974 URL: https://issues.apache.org/jira/browse/SPARK-3974 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh Assignee: Burak Yavuz We need abstractions for block matrices with fixed block sizes, with each block being dense. Partitioners along both rows and columns required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3188: - Target Version/s: 1.3.0 (was: 1.2.0) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) -- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Priority: Minor Labels: features Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Tukey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194058#comment-14194058 ] Daniel Erenrich commented on SPARK-3080: Bumping RAM and changing parallelism did fix the issue for me. ArrayIndexOutOfBoundsException in ALS for Large datasets Key: SPARK-3080 URL: https://issues.apache.org/jira/browse/SPARK-3080 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Burak Yavuz The stack trace is below: {quote} java.lang.ArrayIndexOutOfBoundsException: 2716 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) {quote} This happened after the dataset was sub-sampled. Dataset properties: ~12B ratings Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3247) Improved support for external data sources
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3247. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2475 [https://github.com/apache/spark/pull/2475] Improved support for external data sources -- Key: SPARK-3247 URL: https://issues.apache.org/jira/browse/SPARK-3247 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4182) Caching tables containing boolean, binary, array, struct and/or map columns doesn't work
[ https://issues.apache.org/jira/browse/SPARK-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4182. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3059 [https://github.com/apache/spark/pull/3059] Caching tables containing boolean, binary, array, struct and/or map columns doesn't work Key: SPARK-4182 URL: https://issues.apache.org/jira/browse/SPARK-4182 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Fix For: 1.2.0 If a table contains a column whose type is binary, array, struct, map, and for some reason, boolean, in-memory columnar caching doesn't work because a {{NoopColumnStats}} is used to collect column statistics. {{NoopColumnStats}} returns an empty statistics row, and thus breaks {{InMemoryRelation}} statistics calculation. Execute the following snippet to reproduce this bug in {{spark-shell}}: {code} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.catalyst.types._ val sparkContext = sc import sparkContext._ val sqlContext = new SQLContext(sparkContext) import sqlContext._ case class BoolField(flag: Boolean) val schemaRDD = parallelize(true :: false :: Nil).map(BoolField(_)).toSchemaRDD schemaRDD.cache().count() schemaRDD.count() {code} Exception thrown: {code} java.lang.ArrayIndexOutOfBoundsException: 4 at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142) at org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66) at org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87) at org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73) at org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147) at org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122) at org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4185) JSON schema inference failed when dealing with type conflicts in arrays
[ https://issues.apache.org/jira/browse/SPARK-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4185. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3056 [https://github.com/apache/spark/pull/3056] JSON schema inference failed when dealing with type conflicts in arrays --- Key: SPARK-4185 URL: https://issues.apache.org/jira/browse/SPARK-4185 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.2.0 {code} val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) val diverging = sparkContext.parallelize(List({branches: [foo]}, {branches: [{foo:42}]})) sqlContext.jsonRDD(diverging) // throws a MatchError {code} The case is from http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3572) Support register UserType in SQL
[ https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194113#comment-14194113 ] Apache Spark commented on SPARK-3572: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3063 Support register UserType in SQL Key: SPARK-3572 URL: https://issues.apache.org/jira/browse/SPARK-3572 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Assignee: Joseph K. Bradley If a user knows how to map a class to a struct type in Spark SQL, he should be able to register this mapping through sqlContext and hence SQL can figure out the schema automatically. {code} trait RowSerializer[T] { def dataType: StructType def serialize(obj: T): Row def deserialize(row: Row): T } sqlContext.registerUserType[T](clazz: classOf[T], serializer: classOf[RowSerializer[T]]) {code} In sqlContext, we can maintain a class-to-serializer map and use it for conversion. The serializer class can be embedded into the metadata, so when `select` is called, we know we want to deserialize the result. {code} sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer]) val points: RDD[LabeledPoint] = ... val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) = v } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2189) Method for removing temp tables created by registerAsTable
[ https://issues.apache.org/jira/browse/SPARK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2189. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3039 [https://github.com/apache/spark/pull/3039] Method for removing temp tables created by registerAsTable -- Key: SPARK-2189 URL: https://issues.apache.org/jira/browse/SPARK-2189 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4183) Enable Netty-based BlockTransferService by default
[ https://issues.apache.org/jira/browse/SPARK-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4183. Resolution: Fixed Resolved a second time via: https://github.com/apache/spark/pull/3053 Enable Netty-based BlockTransferService by default -- Key: SPARK-4183 URL: https://issues.apache.org/jira/browse/SPARK-4183 Project: Spark Issue Type: Bug Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.2.0 Spark's NettyBlockTransferService relies on Netty to achieve high performance and zero-copy IO, which is a big simplification over the manual connection management that's done in today's NioBlockTransferService. Additionally, the core functionality of the NettyBlockTransferService has been extracted out of spark core into the network package, with the intention of reusing this code for SPARK-3796 ([PR #3001|https://github.com/apache/spark/pull/3001/files#diff-54]). We should turn NettyBlockTransferService on by default in order to improve debuggability and stability of the network layer (which has historically been more challenging with the current BlockTransferService). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4197) Gradient Boosting API cleanups
Joseph K. Bradley created SPARK-4197: Summary: Gradient Boosting API cleanups Key: SPARK-4197 URL: https://issues.apache.org/jira/browse/SPARK-4197 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Issues: * GradientBoosting.train methods take a very large number of parameters. * No examples for GradientBoosting Plan: * Use Builder constructs taking simple types to make it easier to specify BoostingStrategy parameters * Create examples in Scala and Java to make sure it is easy to construct BoostingStrategy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces
Aaron Davidson created SPARK-4198: - Summary: Refactor BlockManager's doPut and doGetLocal into smaller pieces Key: SPARK-4198 URL: https://issues.apache.org/jira/browse/SPARK-4198 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Currently the BlockManager is pretty hard to grok due to a lot of complicated, interconnected details. Two particularly smelly pieces of code are doPut and doGetLocal, which are both around 200 lines long, with heavily nested statements and lots of orthogonal logic. We should break this up into smaller pieces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces
[ https://issues.apache.org/jira/browse/SPARK-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194212#comment-14194212 ] Apache Spark commented on SPARK-4198: - User 'aarondav' has created a pull request for this issue: https://github.com/apache/spark/pull/3065 Refactor BlockManager's doPut and doGetLocal into smaller pieces Key: SPARK-4198 URL: https://issues.apache.org/jira/browse/SPARK-4198 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Currently the BlockManager is pretty hard to grok due to a lot of complicated, interconnected details. Two particularly smelly pieces of code are doPut and doGetLocal, which are both around 200 lines long, with heavily nested statements and lots of orthogonal logic. We should break this up into smaller pieces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4131) Support Writing data into the filesystem from queries
[ https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaoJing wang updated SPARK-4131: - Description: Writing data into the filesystem from queries,SparkSql is not support . eg: {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; {code} was: Writing data into the filesystem from queries,SparkSql is not support . eg: precodeinsert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; /code/pre Support Writing data into the filesystem from queries --- Key: SPARK-4131 URL: https://issues.apache.org/jira/browse/SPARK-4131 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: XiaoJing wang Priority: Critical Fix For: 1.3.0 Original Estimate: 0.05h Remaining Estimate: 0.05h Writing data into the filesystem from queries,SparkSql is not support . eg: {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from page_views; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang updated SPARK-4199: - Summary: Drop table if exists raises table not found exception in HiveContext (was: Drop table if exists raises table not found exception ) Drop table if exists raises table not found exception in HiveContext -- Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4199) Drop table if exists raises table not found exception
Jianshi Huang created SPARK-4199: Summary: Drop table if exists raises table not found exception Key: SPARK-4199 URL: https://issues.apache.org/jira/browse/SPARK-4199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Jianshi Huang Try this: sql(DROP TABLE IF EXISTS some_table) The exception looks like this: 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS some_table 14/11/02 19:55:29 INFO ParseDriver: Parse Completed 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 end=1414986929678 duration=0 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table not found some_table org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824) at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4200) akka.loglevel
SuYan created SPARK-4200: Summary: akka.loglevel Key: SPARK-4200 URL: https://issues.apache.org/jira/browse/SPARK-4200 Project: Spark Issue Type: Question Components: Spark Core Reporter: SuYan Hi, I want more debug information for akka, I found a argument akka.loglevel in core/org/apache/spark/deploy/Client.scala. but I can't find any place use that akka.loglevel? btw, how to open debug for akka? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4192) Support UDT in Python
[ https://issues.apache.org/jira/browse/SPARK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194237#comment-14194237 ] Apache Spark commented on SPARK-4192: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3068 Support UDT in Python - Key: SPARK-4192 URL: https://issues.apache.org/jira/browse/SPARK-4192 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor This is a sub-task of SPARK-3572 for UDT support in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4200) akka.loglevel
[ https://issues.apache.org/jira/browse/SPARK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4200. Resolution: Invalid Hi There, For issues like this do you mind e-mailing the spark user list instead of on JIRA? See here for the mailing list: http://spark.apache.org/community.html For this specific issue, you aren't finding it in a grep of the code because it is handled programmatically and passed to akka directly: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L210 Happy to discuss in more detail on the user list... akka.loglevel - Key: SPARK-4200 URL: https://issues.apache.org/jira/browse/SPARK-4200 Project: Spark Issue Type: Question Components: Spark Core Reporter: SuYan Hi, I want more debug information for akka, I found a argument akka.loglevel in core/org/apache/spark/deploy/Client.scala. but I can't find any place use that akka.loglevel? btw, how to open debug for akka? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4109) Task.stageId is not been deserialized correctly
[ https://issues.apache.org/jira/browse/SPARK-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4109. Resolution: Fixed Task.stageId is not been deserialized correctly --- Key: SPARK-4109 URL: https://issues.apache.org/jira/browse/SPARK-4109 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Reporter: Lu Lu Fix For: 1.0.3 The two subclasses of Task, ShuffleMapTask and ResultTask, do not correctly deserialize stageId. Therefore, the accessing of TaskContext.stageId always returns zero value to the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reopened SPARK-4165: Assignee: Prashant Sharma This is not a duplicate of SPARK-3200. It happens that the patch provided for SPARK-3200 fixes the issue here. Actor with Companion throws ambiguous reference error in REPL - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Affects Versions: 1.0.1, 1.1.0, 1.2.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-4165: --- Component/s: Spark Shell Actor with Companion throws ambiguous reference error in REPL - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0, 1.2.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194290#comment-14194290 ] Prashant Sharma commented on SPARK-4165: I think this should happen when we create any class with a companion. Can you confirm [~shiti] ? Actor with Companion throws ambiguous reference error in REPL - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0, 1.2.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4177) update build doc for already supporting hive 13 in jdbc/cli
[ https://issues.apache.org/jira/browse/SPARK-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4177. Resolution: Fixed Assignee: wangfei update build doc for already supporting hive 13 in jdbc/cli --- Key: SPARK-4177 URL: https://issues.apache.org/jira/browse/SPARK-4177 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei Assignee: wangfei Fix For: 1.2.0 fix build doc since already support hive 13 in jdbc/cli -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-4165: --- Affects Version/s: (was: 1.2.0) Actor with Companion throws ambiguous reference error in REPL - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3572) Internal API for User-Defined Types
[ https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3572: --- Summary: Internal API for User-Defined Types (was: Support register UserType in SQL) Internal API for User-Defined Types --- Key: SPARK-3572 URL: https://issues.apache.org/jira/browse/SPARK-3572 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Assignee: Joseph K. Bradley If a user knows how to map a class to a struct type in Spark SQL, he should be able to register this mapping through sqlContext and hence SQL can figure out the schema automatically. {code} trait RowSerializer[T] { def dataType: StructType def serialize(obj: T): Row def deserialize(row: Row): T } sqlContext.registerUserType[T](clazz: classOf[T], serializer: classOf[RowSerializer[T]]) {code} In sqlContext, we can maintain a class-to-serializer map and use it for conversion. The serializer class can be embedded into the metadata, so when `select` is called, we know we want to deserialize the result. {code} sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer]) val points: RDD[LabeledPoint] = ... val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) = v } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4165: Summary: Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized (was: Using Companion Objects throws ambiguous reference error in REPL when an instance of Class isinitialized) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class isinitialized
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4165: Summary: Using Companion Objects throws ambiguous reference error in REPL when an instance of Class isinitialized (was: Actor with Companion throws ambiguous reference error in REPL) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class isinitialized Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4165: Description: Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} The following code works, {noformat} case class TestA object TestA{ def print = { println(hello)} } {noformat} was: Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} The following code works, {noformat} case class TestA object TestA{ def print = { println(hello)} } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized
[ https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiti Saxena updated SPARK-4165: Description: Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} Using a companion Object works when the apply method is not overloaded. {noformat} scala :pas // Entering paste mode (ctrl-D to finish) case class TestA(w:String) object TestA{ def print = { println(hello)} } // Exiting paste mode, now interpreting. defined class TestA defined module TestA scala TestA.print hello {noformat} When the apply method is overloaded, it throws ambiguous reference error {noformat} scala :pas // Entering paste mode (ctrl-D to finish) case class T1(s:String,len:Int) object T1{ def apply(s:String):T1 = T1(s,s.size) } // Exiting paste mode, now interpreting. defined class T1 defined module T1 scala T1(abcd) console:15: error: reference to T1 is ambiguous; it is imported twice in the same scope by import $VAL17.T1 and import INSTANCE.T1 T1(abcd) ^ scala T1(abcd,4) console:15: error: reference to T1 is ambiguous; it is imported twice in the same scope by import $VAL18.T1 and import INSTANCE.T1 T1(abcd,4) {noformat} was: Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} The following code works, {noformat} case class TestA object TestA{ def print = { println(hello)} } {noformat} Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized - Key: SPARK-4165 URL: https://issues.apache.org/jira/browse/SPARK-4165 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.0.1, 1.1.0 Reporter: Shiti Saxena Assignee: Prashant Sharma Tried the following in the master branch REPL. {noformat} Spark context available as sc. scala import akka.actor.{Actor,Props} import akka.actor.{Actor, Props} scala :pas // Entering paste mode (ctrl-D to finish) class EchoActor extends Actor{ override def receive = { case message = sender ! message } } object EchoActor { def props: Props = Props(new EchoActor()) } // Exiting paste mode, now interpreting. defined class EchoActor defined module EchoActor scala EchoActor.props console:15: error: reference to EchoActor is ambiguous; it is imported twice in the same scope by import $VAL1.EchoActor and import INSTANCE.EchoActor EchoActor.props {noformat} Using a companion Object works when the apply method is not overloaded. {noformat} scala :pas // Entering paste mode (ctrl-D to finish) case class TestA(w:String) object TestA{ def print = { println(hello)} } // Exiting paste mode, now interpreting. defined class TestA defined module TestA scala TestA.print hello {noformat} When the apply method is overloaded, it throws ambiguous reference error {noformat} scala :pas // Entering paste mode (ctrl-D to finish) case class T1(s:String,len:Int) object T1{ def apply(s:String):T1 = T1(s,s.size) } // Exiting paste mode, now interpreting. defined class T1 defined module T1 scala T1(abcd) console:15: error: reference to T1 is ambiguous; it is imported twice in the same scope by import $VAL17.T1 and import INSTANCE.T1 T1(abcd) ^ scala T1(abcd,4) console:15: error: reference to T1 is ambiguous; it is imported twice in the same scope by import $VAL18.T1 and import INSTANCE.T1 T1(abcd,4) {noformat} -- This message was sent by Atlassian
[jira] [Commented] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194311#comment-14194311 ] Apache Spark commented on SPARK-3573: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/3070 Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. .Sample code Suppose we have training events stored on HDFS and user/ad features in Hive, we want to assemble features for training and then apply decision tree. The proposed pipeline with dataset looks like the following (need more refinements): {code} sqlContext.jsonFile(/path/to/training/events, 0.01).registerTempTable(event) val training = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, event.action AS label, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;).cache() val indexer = new Indexer() val interactor = new Interactor() val fvAssembler = new FeatureVectorAssembler() val treeClassifer = new DecisionTreeClassifer() val paramMap = new ParamMap() .put(indexer.features, Map(userCountryIndex - userCountry)) .put(indexer.sortByFrequency, true) .put(interactor.features, Map(genderMatch - Array(userGender, targetGender))) .put(fvAssembler.features, Map(features - Array(genderMatch, userCountryIndex, userFeatures))) .put(fvAssembler.dense, true) .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes features and label columns. val pipeline = Pipeline.create(indexer, interactor, fvAssembler, treeClassifier) val model = pipeline.fit(training, paramMap) sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event) val test = sqlContext.sql( SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, user.gender AS userGender, user.country AS userCountry, user.features AS userFeatures, ad.targetGender AS targetGender FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = ad.id;) val prediction = model.transform(test).select('eventId, 'prediction) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194315#comment-14194315 ] Sean Owen commented on SPARK-1406: -- I put some comments on the PR. Thanks for starting on this. I think PMML interoperability is indeed helpful. So, one big issue here is that MLlib does not at the moment have any notion of a schema. PMML does, and this is vital to actually using the model elsewhere. You have to document what the variables are so they can be matched up with the same variables in another tool. So it's not possible now to do anything but make a model with field_1, field_2, ... This calls into question whether PMML can be meaningfully exported at this point from MLlib? Maybe it will have to wait until other PRs go in that start to add schema. I also thought it would be a little better to separate the representation of a model, from utility methods to write the model to things like files. The latter can be at least separated out of the type hierarchy. I'm also wondering how much value it adds to design for non-PMML export at this stage. (Finally I have some code lying around here that will translate the MLlib logistic regression model to PMML. I can put that in the pot at a suitable time.) PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont Attachments: SPARK-1406.pdf, kmeans.xml It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1704) Support EXPLAIN in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194322#comment-14194322 ] Apache Spark commented on SPARK-1704: - User 'concretevitamin' has created a pull request for this issue: https://github.com/apache/spark/pull/1003 Support EXPLAIN in Spark SQL Key: SPARK-1704 URL: https://issues.apache.org/jira/browse/SPARK-1704 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: linux Reporter: Yangjp Assignee: Zongheng Yang Labels: sql Fix For: 1.0.1, 1.1.0 Original Estimate: 612h Remaining Estimate: 612h 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src 14/05/03 22:08:40 INFO ParseDriver: Parse Completed 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION : java.lang.AssertionError: assertion failed: No plan for ExplainCommand (Project [*]) at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248) at org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39) at org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72) at org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:701) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker
[ https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194329#comment-14194329 ] Eduardo Jimenez commented on SPARK-2691: I was looking at this, and I came up with a patch that simply passes ContainerInfo in the appropriate places with both Coarse and Fine grained mode. The only thing I can see as a bit of a kludge is how to pass the spark configuration. So far, I've added: spark.mesos.container.type spark.mesos.container.docker.image I would prefer to simply have spark pass these through without much validation, otherwise Spark and Mesos have to be kept in sync wrt to what is supported. I could add the network setting (HOST or BRIDGE). But other settings could be a kludge to provide. I would prefer to not pass Docker Parameters for the CLI as mess might not use those in the future anyway. Port Mappings and Volumes could be useful, but how to provide them? spark.mesos.container.volumes.A.container_path spark.mesos.container.volumes.A.host_path spark.mesos.container.volumes.A.mode spark.mesos.container.volumes.B.container_path spark.mesos.container.volumes.B.host_path spark.mesos.container.volumes.B.mode or spark.mesos.container.volumes = A:container_path:host_path:mode, B:... Any preference? I'll try to work on this tomorrow, put some tests together, and take it for a spin. Allow Spark on Mesos to be launched with Docker --- Key: SPARK-2691 URL: https://issues.apache.org/jira/browse/SPARK-2691 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Timothy Chen Labels: mesos Currently to launch Spark with Mesos one must upload a tarball and specifiy the executor URI to be passed in that is to be downloaded on each slave or even each execution depending coarse mode or not. We want to make Spark able to support launching Executors via a Docker image that utilizes the recent Docker and Mesos integration work. With the recent integration Spark can simply specify a Docker image and options that is needed and it should continue to work as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
dongxu created SPARK-4201: - Summary: Can't use concat() on partition column in where condition (Hive compatibility problem) Key: SPARK-4201 URL: https://issues.apache.org/jira/browse/SPARK-4201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.0.0 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1 Reporter: dongxu Priority: Minor The team used hive to query,we try to move it to spark-sql. when I search sentences like that. select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'; It can't work ,but it work well in hive. I have to rewrite the sql to select count(1) from gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. There are some error logs. 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 16 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 20 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at
[jira] [Updated] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)
[ https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dongxu updated SPARK-4201: -- Description: The team used hive to query,we try to move it to spark-sql. when I search sentences like that. select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'; It can't work ,but it work well in hive. I have to rewrite the sql to select count(1) from gulfstream_day_driver_base_2 where year = 2014 and month = 09 day= 29. There are some error log. 14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from gulfstream_day_driver_base_2 where concat(year,month,day) = '20140929'] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L] Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 16 more Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Aggregate true, [], [COUNT(1) AS PartialCount#1390L] HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, None), Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341) = 20140929)) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46) ... 20 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
[jira] [Created] (SPARK-4202) DSL support for Scala UDF
Cheng Lian created SPARK-4202: - Summary: DSL support for Scala UDF Key: SPARK-4202 URL: https://issues.apache.org/jira/browse/SPARK-4202 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Using Scala UDF with current DSL API is quite verbose, e.g.: {code} case class KeyValue(key: Int, value: String) val schemaRDD = sc.parallelize(1 to 10).map(i = KeyValue(i, i.toString)).toSchemaRDD def foo = (a: Int, b: String) = a.toString + b schemaRDD.select( // SELECT Star(None), // *, ScalaUdf( // foo, // foo( StringType, // 'key.attr :: 'value.attr :: Nil) // key, value ).collect() // ) FROM ... {code} It would be good to add a DSL syntax to simplify UDF invocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4202) DSL support for Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194341#comment-14194341 ] Apache Spark commented on SPARK-4202: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3067 DSL support for Scala UDF - Key: SPARK-4202 URL: https://issues.apache.org/jira/browse/SPARK-4202 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Using Scala UDF with current DSL API is quite verbose, e.g.: {code} case class KeyValue(key: Int, value: String) val schemaRDD = sc.parallelize(1 to 10).map(i = KeyValue(i, i.toString)).toSchemaRDD def foo = (a: Int, b: String) = a.toString + b schemaRDD.select( // SELECT Star(None), // *, ScalaUdf( // foo, // foo( StringType, // 'key.attr :: 'value.attr :: Nil) // key, value ).collect() // ) FROM ... {code} It would be good to add a DSL syntax to simplify UDF invocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org