date:20141102

Josh Rosen created SPARK-4194:
-

 Summary: Exceptions thrown during SparkContext or SparkEnv 
construction might lead to resource leaks or corrupted global state
 Key: SPARK-4194
 URL: https://issues.apache.org/jira/browse/SPARK-4194
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Critical


The SparkContext and SparkEnv constructors instantiate a bunch of objects that 
may need to be cleaned up after they're no longer needed.  If an exception is 
thrown during SparkContext or SparkEnv construction (e.g. due to a bad 
configuration setting), then objects created earlier in the constructor may not 
be properly cleaned up.

This is unlikely to cause problems for batch jobs submitted through 
{{spark-submit}}, since failure to construct SparkContext will probably cause 
the JVM to exit, but it is a potentially serious issue in interactive 
environments where a user might attempt to create SparkContext with some 
configuration, fail due to an error, and re-attempt the creation with new 
settings.  In this case, resources from the previous creation attempt might not 
have been cleaned up and could lead to confusing errors (especially if the old, 
leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4194) Exceptions thrown during SparkContext or SparkEnv construction might lead to resource leaks or corrupted global state


[ 
https://issues.apache.org/jira/browse/SPARK-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193732#comment-14193732
 ] 

Josh Rosen commented on SPARK-4194:
---

I've marked this a blocker of SPARK-4180, an issue that tries to add exceptions 
when users try to create multiple active SparkContexts in the same JVM.  
PySpark already guards against this, but earlier versions of its error-checking 
code ran into issues where users would fail their initial attempt to create a 
SparkContext and then be unable to create new ones because we didn't clear the 
{{activeSparkContext}} variable after the constructor threw an exception.

To fix this, we need to wrap the constructor code in a {{try}} block. 

 Exceptions thrown during SparkContext or SparkEnv construction might lead to 
 resource leaks or corrupted global state
 -

 Key: SPARK-4194
 URL: https://issues.apache.org/jira/browse/SPARK-4194
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Critical

 The SparkContext and SparkEnv constructors instantiate a bunch of objects 
 that may need to be cleaned up after they're no longer needed.  If an 
 exception is thrown during SparkContext or SparkEnv construction (e.g. due to 
 a bad configuration setting), then objects created earlier in the constructor 
 may not be properly cleaned up.
 This is unlikely to cause problems for batch jobs submitted through 
 {{spark-submit}}, since failure to construct SparkContext will probably cause 
 the JVM to exit, but it is a potentially serious issue in interactive 
 environments where a user might attempt to create SparkContext with some 
 configuration, fail due to an error, and re-attempt the creation with new 
 settings.  In this case, resources from the previous creation attempt might 
 not have been cleaned up and could lead to confusing errors (especially if 
 the old, leaked resources share global state with the new SparkContext).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4180:
--
Labels:   (was: starter)

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Blocker

 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3466) Limit size of results that a driver collects for each action

2014-11-02 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3466.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Limit size of results that a driver collects for each action
 

 Key: SPARK-3466
 URL: https://issues.apache.org/jira/browse/SPARK-3466
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.2.0


 Right now, operations like {{collect()}} and {{take()}} can crash the driver 
 with an OOM if they bring back too many data. We should add a 
 {{spark.driver.maxResultSize}} setting (or something like that) that will 
 make the driver abort a job if its result is too big. We can set it to some 
 fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


[ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193747#comment-14193747
 ] 

Patrick Wendell commented on SPARK-4180:


Yeah [~adav] just ran into an issue where a lot of our tests weren't properly 
cleaning SparkContext's and it was a huge pain. We could also log the callsite 
where a SparkContext is created, so if you try to create another one it 
actually tells you where the previous one was created and throws a helpful 
exception.

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Priority: Blocker

 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4195) retry to fetch blocks's result when fetchfailed's reason is connection timeout


[ 
https://issues.apache.org/jira/browse/SPARK-4195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193827#comment-14193827
 ] 

Apache Spark commented on SPARK-4195:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/3061

 retry to fetch blocks's result when fetchfailed's reason is connection timeout
 --

 Key: SPARK-4195
 URL: https://issues.apache.org/jira/browse/SPARK-4195
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Lianhui Wang

 when there are many executors in a application(example:1000),Connection 
 timeout often occure.Exception is:
 WARN nio.SendingConnection: Error finishing connection 
 java.net.ConnectException: Connection timed out
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 org.apache.spark.network.nio.SendingConnection.finishConnect(Connection.scala:342)
 at 
 org.apache.spark.network.nio.ConnectionManager$$anon$11.run(ConnectionManager.scala:273)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 that will make driver as these executors are lost,but in fact these executors 
 are alive.so add retry mechanism to reduce the probability of the occurrence 
 of this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib

2014-11-02 Thread Vincenzo Selvaggio (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincenzo Selvaggio updated SPARK-1406:
--
Attachment: kmeans.xml
SPARK-1406.pdf

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont
 Attachments: SPARK-1406.pdf, kmeans.xml


 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-11-02 Thread Vincenzo Selvaggio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193829#comment-14193829
 ] 

Vincenzo Selvaggio commented on SPARK-1406:
---

Hi, 
based on what Sean suggested I had a go at this requirement, in particular the 
export of models to pmml as I find useful to decouple the producer (spark) and 
consumer (an app) of mining models.

Attached details on the approach taken, if you think it is valid I could 
proceed with the implementation of the other exporter (so far only kmeans is 
supported). 

Also attached the pmml exported for kmeans using the compiled spark-shell.

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont
 Attachments: SPARK-1406.pdf, kmeans.xml


 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib


[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193830#comment-14193830
 ] 

Apache Spark commented on SPARK-1406:
-

User 'selvinsource' has created a pull request for this issue:
https://github.com/apache/spark/pull/3062

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont
 Attachments: SPARK-1406.pdf, kmeans.xml


 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3914) InMemoryRelation should inherit statistics of its child to enable broadcast join

2014-11-02 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-3914.
---
Resolution: Fixed

 InMemoryRelation should inherit statistics of its child to enable broadcast 
 join
 

 Key: SPARK-3914
 URL: https://issues.apache.org/jira/browse/SPARK-3914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 When a table/query is cached, {{InMemoryRelation}} stores the physical plan 
 rather than the logical plan of the original table/query, thus loses the 
 statistics information and disables broadcast join optimization.
 Sample {{spark-shell}} session to reproduce this issue:
 {code}
 val sparkContext = sc
 import org.apache.spark.sql._
 import sparkContext._
 val sqlContext = new SQLContext(sparkContext)
 import sqlContext._
 case class Sale(year: Int)
 makeRDD((1 to 100).map(Sale(_))).registerTempTable(sales)
 sql(select distinct year from sales limit 10).registerTempTable(tinyTable)
 cacheTable(tinyTable)
 sql(select * from sales join tinyTable on sales.year = 
 tinyTable.year).queryExecution.executedPlan
 ...
 res3: org.apache.spark.sql.execution.SparkPlan =
 Project [year#4,year#5]
  ShuffledHashJoin [year#4], [year#5], BuildRight
   Exchange (HashPartitioning [year#4], 200)
PhysicalRDD [year#4], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:37
   Exchange (HashPartitioning [year#5], 200)
InMemoryColumnarTableScan [year#5], [], (InMemoryRelation [year#5], false, 
 1000, StorageLevel(true, true, false, true, 1), (Limit 10))
 {code}
 A workaround for this is to add a {{LIMIT}} operator above the 
 {{InMemoryColumnarTableScan}} operator:
 {code}
 sql(select * from sales join (select * from tinyTable limit 10) tiny on 
 sales.year = tiny.year).queryExecution.executedPlan
 ...
 res8: org.apache.spark.sql.execution.SparkPlan =
 Project [year#12,year#13]
  BroadcastHashJoin [year#12], [year#13], BuildRight
   PhysicalRDD [year#12], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:37
   Limit 10
InMemoryColumnarTableScan [year#13], [], (InMemoryRelation [year#13], 
 false, 1000, StorageLevel(true, true, false, true, 1), (Limit 10))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4196) Streaming + checkpointing yields NotSerializableException for Hadoop Configuration from saveAsNewAPIHadoopFiles ?

2014-11-02 Thread Sean Owen (JIRA)

Sean Owen created SPARK-4196:


 Summary: Streaming + checkpointing yields NotSerializableException 
for Hadoop Configuration from saveAsNewAPIHadoopFiles ?
 Key: SPARK-4196
 URL: https://issues.apache.org/jira/browse/SPARK-4196
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Sean Owen


I am reasonably sure there is some issue here in Streaming and that I'm not 
missing something basic, but not 100%. I went ahead and posted it as a JIRA to 
track, since it's come up a few times before without resolution, and right now 
I can't get checkpointing to work at all.

When Spark Streaming checkpointing is enabled, I see a NotSerializableException 
thrown for a Hadoop Configuration object, and it seems like it is not one from 
my user code.

Before I post my particular instance see 
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408135046777-12202.p...@n3.nabble.com%3E
 for another occurrence.

I was also on customer site last week debugging an identical issue with 
checkpointing in a Scala-based program and they also could not enable 
checkpointing without hitting exactly this error.

The essence of my code is:

{code}
final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

JavaStreamingContextFactory streamingContextFactory = new
JavaStreamingContextFactory() {
  @Override
  public JavaStreamingContext create() {
return new JavaStreamingContext(sparkContext, new
Duration(batchDurationMS));
  }
};

  streamingContext = JavaStreamingContext.getOrCreate(
  checkpointDirString, sparkContext.hadoopConfiguration(),
streamingContextFactory, false);
  streamingContext.checkpoint(checkpointDirString);
{code}

It yields:

{code}
2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
org.apache.hadoop.conf.Configuration
- field (class 
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
name: conf$2, type: class org.apache.hadoop.conf.Configuration)
- object (class
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
function2)
- field (class org.apache.spark.streaming.dstream.ForEachDStream,
name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
type: interface scala.Function2)
- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
...
{code}


This looks like it's due to PairRDDFunctions, as this saveFunc seems
to be  org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
:

{code}
def saveAsNewAPIHadoopFiles(
prefix: String,
suffix: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[_ : NewOutputFormat[_, _]],
conf: Configuration = new Configuration
  ) {
  val saveFunc = (rdd: RDD[(K, V)], time: Time) = {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
outputFormatClass, conf)
  }
  self.foreachRDD(saveFunc)
}
{code}

Is that not a problem? but then I don't know how it would ever work in Spark. 
But then again I don't see why this is an issue and only when checkpointing is 
enabled.

Long-shot, but I wonder if it is related to closure issues like 
https://issues.apache.org/jira/browse/SPARK-1866



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4166) Display the executor ID in the Web UI when ExecutorLostFailure happens


 [ 
https://issues.apache.org/jira/browse/SPARK-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4166.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3033
[https://github.com/apache/spark/pull/3033]

 Display the executor ID in the Web UI when ExecutorLostFailure happens
 --

 Key: SPARK-4166
 URL: https://issues.apache.org/jira/browse/SPARK-4166
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Priority: Minor
 Fix For: 1.2.0


 Now when  ExecutorLostFailure happens, it only displays ExecutorLostFailure 
 (executor lost)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3436) Streaming SVM


 [ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3436:
-
Target Version/s:   (was: 1.2.0)

 Streaming SVM 
 --

 Key: SPARK-3436
 URL: https://issues.apache.org/jira/browse/SPARK-3436
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liquan Pei
Assignee: Liquan Pei

 Implement online learning with kernels according to 
 http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
 The algorithms proposed in the above paper are implemented in R 
 (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
 (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3436) Streaming SVM


 [ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3436:
-
Affects Version/s: (was: 1.2.0)

 Streaming SVM 
 --

 Key: SPARK-3436
 URL: https://issues.apache.org/jira/browse/SPARK-3436
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liquan Pei
Assignee: Liquan Pei

 Implement online learning with kernels according to 
 http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
 The algorithms proposed in the above paper are implemented in R 
 (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
 (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3147) Implement A/B testing


 [ 
https://issues.apache.org/jira/browse/SPARK-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3147:
-
Target Version/s: 1.3.0  (was: 1.2.0)

 Implement A/B testing
 -

 Key: SPARK-3147
 URL: https://issues.apache.org/jira/browse/SPARK-3147
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Xiangrui Meng

 A/B testing is widely used to compare online models. We can implement A/B 
 testing in MLlib and integrate it with Spark Streaming. For example, we have 
 a PairDStream[String, Double], whose keys are model ids and values are 
 observations (click or not, or revenue associated with the event). With A/B 
 testing, we can tell whether one model is significantly better than another 
 at a certain time. There are some caveats. For example, we should avoid 
 multiple testing and support A/A testing as a sanity check.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3974) Block matrix abstracitons and partitioners


 [ 
https://issues.apache.org/jira/browse/SPARK-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3974:
-
Target Version/s: 1.3.0  (was: 1.2.0)

 Block matrix abstracitons and partitioners
 --

 Key: SPARK-3974
 URL: https://issues.apache.org/jira/browse/SPARK-3974
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
Assignee: Burak Yavuz

 We need abstractions for block matrices with fixed block sizes, with each 
 block being dense. Partitioners along both rows and columns required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)


 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3188:
-
Target Version/s: 1.3.0  (was: 1.2.0)

 Add Robust Regression Algorithm with Tukey bisquare weight  function 
 (Biweight Estimates) 
 --

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Priority: Minor
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Tukey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-11-02 Thread Daniel Erenrich (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194058#comment-14194058
 ] 

Daniel Erenrich commented on SPARK-3080:


Bumping RAM and changing parallelism did fix the issue for me.

 ArrayIndexOutOfBoundsException in ALS for Large datasets
 

 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Burak Yavuz

 The stack trace is below:
 {quote}
 java.lang.ArrayIndexOutOfBoundsException: 2716
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 
 org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 {quote}
 This happened after the dataset was sub-sampled. 
 Dataset properties: ~12B ratings
 Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3247) Improved support for external data sources


 [ 
https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3247.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2475
[https://github.com/apache/spark/pull/2475]

 Improved support for external data sources
 --

 Key: SPARK-3247
 URL: https://issues.apache.org/jira/browse/SPARK-3247
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4182) Caching tables containing boolean, binary, array, struct and/or map columns doesn't work


 [ 
https://issues.apache.org/jira/browse/SPARK-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4182.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3059
[https://github.com/apache/spark/pull/3059]

 Caching tables containing boolean, binary, array, struct and/or map columns 
 doesn't work
 

 Key: SPARK-4182
 URL: https://issues.apache.org/jira/browse/SPARK-4182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 If a table contains a column whose type is binary, array, struct, map, and 
 for some reason, boolean, in-memory columnar caching doesn't work because a 
 {{NoopColumnStats}} is used to collect column statistics. {{NoopColumnStats}} 
 returns an empty statistics row, and thus breaks {{InMemoryRelation}} 
 statistics calculation.
 Execute the following snippet to reproduce this bug in {{spark-shell}}:
 {code}
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.catalyst.types._
 val sparkContext = sc
 import sparkContext._
 val sqlContext = new SQLContext(sparkContext)
 import sqlContext._
 case class BoolField(flag: Boolean)
 val schemaRDD = parallelize(true :: false :: 
 Nil).map(BoolField(_)).toSchemaRDD
 schemaRDD.cache().count()
 schemaRDD.count()
 {code}
 Exception thrown:
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 4
 at 
 org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
 at 
 org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
 at 
 org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
 at 
 org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
 at 
 org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4185) JSON schema inference failed when dealing with type conflicts in arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4185.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3056
[https://github.com/apache/spark/pull/3056]

 JSON schema inference failed when dealing with type conflicts in arrays
 ---

 Key: SPARK-4185
 URL: https://issues.apache.org/jira/browse/SPARK-4185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.2.0


 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 val diverging = sparkContext.parallelize(List({branches: [foo]}, 
 {branches: [{foo:42}]}))
 sqlContext.jsonRDD(diverging)  // throws a MatchError
 {code}
 The case is from http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3572) Support register UserType in SQL


[ 
https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194113#comment-14194113
 ] 

Apache Spark commented on SPARK-3572:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3063

 Support register UserType in SQL
 

 Key: SPARK-3572
 URL: https://issues.apache.org/jira/browse/SPARK-3572
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley

 If a user knows how to map a class to a struct type in Spark SQL, he should 
 be able to register this mapping through sqlContext and hence SQL can figure 
 out the schema automatically.
 {code}
 trait RowSerializer[T] {
   def dataType: StructType
   def serialize(obj: T): Row
   def deserialize(row: Row): T
 }
 sqlContext.registerUserType[T](clazz: classOf[T], serializer: 
 classOf[RowSerializer[T]])
 {code}
 In sqlContext, we can maintain a class-to-serializer map and use it for 
 conversion. The serializer class can be embedded into the metadata, so when 
 `select` is called, we know we want to deserialize the result.
 {code}
 sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer])
 val points: RDD[LabeledPoint] = ...
 val features: RDD[Vector] = points.select('features).map { case Row(v: 
 Vector) = v }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2189) Method for removing temp tables created by registerAsTable


 [ 
https://issues.apache.org/jira/browse/SPARK-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2189.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3039
[https://github.com/apache/spark/pull/3039]

 Method for removing temp tables created by registerAsTable
 --

 Key: SPARK-2189
 URL: https://issues.apache.org/jira/browse/SPARK-2189
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4183) Enable Netty-based BlockTransferService by default


 [ 
https://issues.apache.org/jira/browse/SPARK-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4183.

Resolution: Fixed

Resolved a second time via: https://github.com/apache/spark/pull/3053

 Enable Netty-based BlockTransferService by default
 --

 Key: SPARK-4183
 URL: https://issues.apache.org/jira/browse/SPARK-4183
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 1.2.0


 Spark's NettyBlockTransferService relies on Netty to achieve high performance 
 and zero-copy IO, which is a big simplification over the manual connection 
 management that's done in today's NioBlockTransferService. Additionally, the 
 core functionality of the NettyBlockTransferService has been extracted out of 
 spark core into the network package, with the intention of reusing this 
 code for SPARK-3796 ([PR 
 #3001|https://github.com/apache/spark/pull/3001/files#diff-54]).
 We should turn NettyBlockTransferService on by default in order to improve 
 debuggability and stability of the network layer (which has historically been 
 more challenging with the current BlockTransferService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4197) Gradient Boosting API cleanups

2014-11-02 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-4197:


 Summary: Gradient Boosting API cleanups
 Key: SPARK-4197
 URL: https://issues.apache.org/jira/browse/SPARK-4197
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


Issues:
* GradientBoosting.train methods take a very large number of parameters.
* No examples for GradientBoosting

Plan:
* Use Builder constructs taking simple types to make it easier to specify 
BoostingStrategy parameters
* Create examples in Scala and Java to make sure it is easy to construct 
BoostingStrategy




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces

2014-11-02 Thread Aaron Davidson (JIRA)

Aaron Davidson created SPARK-4198:
-

 Summary: Refactor BlockManager's doPut and doGetLocal into smaller 
pieces
 Key: SPARK-4198
 URL: https://issues.apache.org/jira/browse/SPARK-4198
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently the BlockManager is pretty hard to grok due to a lot of complicated, 
interconnected details. Two particularly smelly pieces of code are doPut and 
doGetLocal, which are both around 200 lines long, with heavily nested 
statements and lots of orthogonal logic. We should break this up into smaller 
pieces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces


[ 
https://issues.apache.org/jira/browse/SPARK-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194212#comment-14194212
 ] 

Apache Spark commented on SPARK-4198:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3065

 Refactor BlockManager's doPut and doGetLocal into smaller pieces
 

 Key: SPARK-4198
 URL: https://issues.apache.org/jira/browse/SPARK-4198
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 Currently the BlockManager is pretty hard to grok due to a lot of 
 complicated, interconnected details. Two particularly smelly pieces of code 
 are doPut and doGetLocal, which are both around 200 lines long, with heavily 
 nested statements and lots of orthogonal logic. We should break this up into 
 smaller pieces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4131) Support Writing data into the filesystem from queries

2014-11-02 Thread XiaoJing wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaoJing wang updated SPARK-4131:
-
Description: 
Writing data into the filesystem from queries,SparkSql is not support .
eg:
{code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * from 
page_views;
{code}

  was:
Writing data into the filesystem from queries,SparkSql is not support .
eg:
precodeinsert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
from page_views;
/code/pre


 Support Writing data into the filesystem from queries
 ---

 Key: SPARK-4131
 URL: https://issues.apache.org/jira/browse/SPARK-4131
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Priority: Critical
 Fix For: 1.3.0

   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 Writing data into the filesystem from queries,SparkSql is not support .
 eg:
 {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
 from page_views;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4199) Drop table if exists raises table not found exception in HiveContext

2014-11-02 Thread Jianshi Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang updated SPARK-4199:
-
Summary: Drop table if exists raises table not found exception in 
HiveContext  (was: Drop table if exists raises table not found exception )

 Drop table if exists raises table not found exception in HiveContext
 --

 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang

 Try this:
   sql(DROP TABLE IF EXISTS some_table)
 The exception looks like this:
 14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
 some_table
 14/11/02 19:55:29 INFO ParseDriver: Parse Completed
 14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
 end=1414986929678 duration=0
 14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
 14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
 14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: 
 Table not found some_table
 org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
 at 
 org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
 at 
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
 at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
 at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
 at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
 at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
 at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4199) Drop table if exists raises table not found exception

2014-11-02 Thread Jianshi Huang (JIRA)

Jianshi Huang created SPARK-4199:


 Summary: Drop table if exists raises table not found exception 
 Key: SPARK-4199
 URL: https://issues.apache.org/jira/browse/SPARK-4199
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Jianshi Huang


Try this:

  sql(DROP TABLE IF EXISTS some_table)

The exception looks like this:

14/11/02 19:55:29 INFO ParseDriver: Parsing command: DROP TABLE IF EXISTS 
some_table
14/11/02 19:55:29 INFO ParseDriver: Parse Completed
14/11/02 19:55:29 INFO Driver: /PERFLOG method=parse start=1414986929678 
end=1414986929678 duration=0
14/11/02 19:55:29 INFO Driver: PERFLOG method=semanticAnalyze
14/11/02 19:55:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
14/11/02 19:55:29 INFO ObjectStore: ObjectStore, initialize called
14/11/02 19:55:29 ERROR Driver: FAILED: SemanticException [Error 10001]: Table 
not found some_table
org.apache.hadoop.hive.ql.parse.SemanticException: Table not found some_table
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3294)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.getTable(DDLSemanticAnalyzer.java:3281)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeDropTable(DDLSemanticAnalyzer.java:824)
at 
org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:249)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:441)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:294)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.Command$class.execute(commands.scala:44)
at 
org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:353)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:353)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4200) akka.loglevel

2014-11-02 Thread SuYan (JIRA)

SuYan created SPARK-4200:


 Summary: akka.loglevel
 Key: SPARK-4200
 URL: https://issues.apache.org/jira/browse/SPARK-4200
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Reporter: SuYan


Hi, I want more debug information for akka, I found a argument akka.loglevel 
in core/org/apache/spark/deploy/Client.scala.

but I can't find any place use that akka.loglevel?
btw, how to open debug for akka?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4192) Support UDT in Python


[ 
https://issues.apache.org/jira/browse/SPARK-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194237#comment-14194237
 ] 

Apache Spark commented on SPARK-4192:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3068

 Support UDT in Python
 -

 Key: SPARK-4192
 URL: https://issues.apache.org/jira/browse/SPARK-4192
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 This is a sub-task of SPARK-3572 for UDT support in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4200) akka.loglevel


 [ 
https://issues.apache.org/jira/browse/SPARK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4200.

Resolution: Invalid

Hi There,

For issues like this do you mind e-mailing the spark user list instead of on 
JIRA? See here for the mailing list: http://spark.apache.org/community.html

For this specific issue, you aren't finding it in a grep of the code because it 
is handled programmatically and passed to akka directly:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkConf.scala#L210

Happy to discuss in more detail on the user list...

 akka.loglevel
 -

 Key: SPARK-4200
 URL: https://issues.apache.org/jira/browse/SPARK-4200
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Reporter: SuYan

 Hi, I want more debug information for akka, I found a argument 
 akka.loglevel in core/org/apache/spark/deploy/Client.scala.
 but I can't find any place use that akka.loglevel?
 btw, how to open debug for akka?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4109) Task.stageId is not been deserialized correctly

2014-11-02 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4109.

Resolution: Fixed

 Task.stageId is not been deserialized correctly
 ---

 Key: SPARK-4109
 URL: https://issues.apache.org/jira/browse/SPARK-4109
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.2
Reporter: Lu Lu
 Fix For: 1.0.3


 The two subclasses of Task, ShuffleMapTask and ResultTask, do not correctly 
 deserialize stageId. Therefore, the accessing of TaskContext.stageId always 
 returns zero value to the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma reopened SPARK-4165:

  Assignee: Prashant Sharma

This is not a duplicate of SPARK-3200. It happens that the patch provided for 
SPARK-3200 fixes the issue here.

 Actor with Companion throws ambiguous reference error in REPL
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.1, 1.1.0, 1.2.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-4165:
---
Component/s: Spark Shell

 Actor with Companion throws ambiguous reference error in REPL
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0, 1.2.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL


[ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194290#comment-14194290
 ] 

Prashant Sharma commented on SPARK-4165:


I think this should happen when we create any class with a companion. Can you 
confirm [~shiti] ?

 Actor with Companion throws ambiguous reference error in REPL
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0, 1.2.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4177) update build doc for already supporting hive 13 in jdbc/cli


 [ 
https://issues.apache.org/jira/browse/SPARK-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4177.

Resolution: Fixed
  Assignee: wangfei

 update build doc for already supporting hive 13 in jdbc/cli
 ---

 Key: SPARK-4177
 URL: https://issues.apache.org/jira/browse/SPARK-4177
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.2.0


 fix build doc since already support hive 13 in jdbc/cli



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Actor with Companion throws ambiguous reference error in REPL


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-4165:
---
Affects Version/s: (was: 1.2.0)

 Actor with Companion throws ambiguous reference error in REPL
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3572) Internal API for User-Defined Types


 [ 
https://issues.apache.org/jira/browse/SPARK-3572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3572:
---
Summary: Internal API for User-Defined Types  (was: Support register 
UserType in SQL)

 Internal API for User-Defined Types
 ---

 Key: SPARK-3572
 URL: https://issues.apache.org/jira/browse/SPARK-3572
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley

 If a user knows how to map a class to a struct type in Spark SQL, he should 
 be able to register this mapping through sqlContext and hence SQL can figure 
 out the schema automatically.
 {code}
 trait RowSerializer[T] {
   def dataType: StructType
   def serialize(obj: T): Row
   def deserialize(row: Row): T
 }
 sqlContext.registerUserType[T](clazz: classOf[T], serializer: 
 classOf[RowSerializer[T]])
 {code}
 In sqlContext, we can maintain a class-to-serializer map and use it for 
 conversion. The serializer class can be embedded into the metadata, so when 
 `select` is called, we know we want to deserialize the result.
 {code}
 sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer])
 val points: RDD[LabeledPoint] = ...
 val features: RDD[Vector] = points.select('features).map { case Row(v: 
 Vector) = v }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiti Saxena updated SPARK-4165:

Summary: Using Companion Objects throws ambiguous reference error in REPL 
when an instance of Class is initialized  (was: Using Companion Objects throws 
ambiguous reference error in REPL when an instance of Class isinitialized)

 Using Companion Objects throws ambiguous reference error in REPL when an 
 instance of Class is initialized
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class isinitialized


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiti Saxena updated SPARK-4165:

Summary: Using Companion Objects throws ambiguous reference error in REPL 
when an instance of Class isinitialized  (was: Actor with Companion throws 
ambiguous reference error in REPL)

 Using Companion Objects throws ambiguous reference error in REPL when an 
 instance of Class isinitialized
 

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiti Saxena updated SPARK-4165:

Description: 
Tried the following in the master branch REPL.

{noformat}
Spark context available as sc.

scala import akka.actor.{Actor,Props}
import akka.actor.{Actor, Props}

scala :pas
// Entering paste mode (ctrl-D to finish)

class EchoActor extends Actor{
 override def receive = {
case message = sender ! message
}
}
object EchoActor {
  def props: Props = Props(new EchoActor())
}

// Exiting paste mode, now interpreting.

defined class EchoActor
defined module EchoActor

scala EchoActor.props
console:15: error: reference to EchoActor is ambiguous;
it is imported twice in the same scope by
import $VAL1.EchoActor
and import INSTANCE.EchoActor
  EchoActor.props
{noformat}

The following code works,
{noformat}
case class TestA
object TestA{
 def print = { println(hello)}
}
{noformat}

  was:
Tried the following in the master branch REPL.

{noformat}
Spark context available as sc.

scala import akka.actor.{Actor,Props}
import akka.actor.{Actor, Props}

scala :pas
// Entering paste mode (ctrl-D to finish)

class EchoActor extends Actor{
 override def receive = {
case message = sender ! message
}
}
object EchoActor {
  def props: Props = Props(new EchoActor())
}

// Exiting paste mode, now interpreting.

defined class EchoActor
defined module EchoActor

scala EchoActor.props
console:15: error: reference to EchoActor is ambiguous;
it is imported twice in the same scope by
import $VAL1.EchoActor
and import INSTANCE.EchoActor
  EchoActor.props
{noformat}


 Using Companion Objects throws ambiguous reference error in REPL when an 
 instance of Class is initialized
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}
 The following code works,
 {noformat}
 case class TestA
 object TestA{
  def print = { println(hello)}
 }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4165) Using Companion Objects throws ambiguous reference error in REPL when an instance of Class is initialized


 [ 
https://issues.apache.org/jira/browse/SPARK-4165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiti Saxena updated SPARK-4165:

Description: 
Tried the following in the master branch REPL.

{noformat}
Spark context available as sc.

scala import akka.actor.{Actor,Props}
import akka.actor.{Actor, Props}

scala :pas
// Entering paste mode (ctrl-D to finish)

class EchoActor extends Actor{
 override def receive = {
case message = sender ! message
}
}
object EchoActor {
  def props: Props = Props(new EchoActor())
}

// Exiting paste mode, now interpreting.

defined class EchoActor
defined module EchoActor

scala EchoActor.props
console:15: error: reference to EchoActor is ambiguous;
it is imported twice in the same scope by
import $VAL1.EchoActor
and import INSTANCE.EchoActor
  EchoActor.props
{noformat}

Using a companion Object works when the apply method is not overloaded.
{noformat}
scala :pas
// Entering paste mode (ctrl-D to finish)

case class TestA(w:String)
object TestA{
 def print = { println(hello)}
}

// Exiting paste mode, now interpreting.

defined class TestA
defined module TestA

scala TestA.print
hello

{noformat}


When the apply method is overloaded, it throws ambiguous reference error
{noformat}
scala :pas
// Entering paste mode (ctrl-D to finish)

case class T1(s:String,len:Int)
object T1{
 def apply(s:String):T1 = T1(s,s.size)
}

// Exiting paste mode, now interpreting.

defined class T1
defined module T1

scala T1(abcd)
console:15: error: reference to T1 is ambiguous;
it is imported twice in the same scope by
import $VAL17.T1
and import INSTANCE.T1
  T1(abcd)
  ^

scala T1(abcd,4)
console:15: error: reference to T1 is ambiguous;
it is imported twice in the same scope by
import $VAL18.T1
and import INSTANCE.T1
  T1(abcd,4)
{noformat}

  was:
Tried the following in the master branch REPL.

{noformat}
Spark context available as sc.

scala import akka.actor.{Actor,Props}
import akka.actor.{Actor, Props}

scala :pas
// Entering paste mode (ctrl-D to finish)

class EchoActor extends Actor{
 override def receive = {
case message = sender ! message
}
}
object EchoActor {
  def props: Props = Props(new EchoActor())
}

// Exiting paste mode, now interpreting.

defined class EchoActor
defined module EchoActor

scala EchoActor.props
console:15: error: reference to EchoActor is ambiguous;
it is imported twice in the same scope by
import $VAL1.EchoActor
and import INSTANCE.EchoActor
  EchoActor.props
{noformat}

The following code works,
{noformat}
case class TestA
object TestA{
 def print = { println(hello)}
}
{noformat}


 Using Companion Objects throws ambiguous reference error in REPL when an 
 instance of Class is initialized
 -

 Key: SPARK-4165
 URL: https://issues.apache.org/jira/browse/SPARK-4165
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.1, 1.1.0
Reporter: Shiti Saxena
Assignee: Prashant Sharma

 Tried the following in the master branch REPL.
 {noformat}
 Spark context available as sc.
 scala import akka.actor.{Actor,Props}
 import akka.actor.{Actor, Props}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 class EchoActor extends Actor{
  override def receive = {
 case message = sender ! message
 }
 }
 object EchoActor {
   def props: Props = Props(new EchoActor())
 }
 // Exiting paste mode, now interpreting.
 defined class EchoActor
 defined module EchoActor
 scala EchoActor.props
 console:15: error: reference to EchoActor is ambiguous;
 it is imported twice in the same scope by
 import $VAL1.EchoActor
 and import INSTANCE.EchoActor
   EchoActor.props
 {noformat}
 Using a companion Object works when the apply method is not overloaded.
 {noformat}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 case class TestA(w:String)
 object TestA{
  def print = { println(hello)}
 }
 // Exiting paste mode, now interpreting.
 defined class TestA
 defined module TestA
 scala TestA.print
 hello
 {noformat}
 When the apply method is overloaded, it throws ambiguous reference error
 {noformat}
 scala :pas
 // Entering paste mode (ctrl-D to finish)
 case class T1(s:String,len:Int)
 object T1{
  def apply(s:String):T1 = T1(s,s.size)
 }
 // Exiting paste mode, now interpreting.
 defined class T1
 defined module T1
 scala T1(abcd)
 console:15: error: reference to T1 is ambiguous;
 it is imported twice in the same scope by
 import $VAL17.T1
 and import INSTANCE.T1
   T1(abcd)
   ^
 scala T1(abcd,4)
 console:15: error: reference to T1 is ambiguous;
 it is imported twice in the same scope by
 import $VAL18.T1
 and import INSTANCE.T1
   T1(abcd,4)
 {noformat}



--
This message was sent by Atlassian

[jira] [Commented] (SPARK-3573) Dataset


[ 
https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194311#comment-14194311
 ] 

Apache Spark commented on SPARK-3573:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3070

 Dataset
 ---

 Key: SPARK-3573
 URL: https://issues.apache.org/jira/browse/SPARK-3573
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra 
 ML-specific metadata embedded in its schema.
 .Sample code
 Suppose we have training events stored on HDFS and user/ad features in Hive, 
 we want to assemble features for training and then apply decision tree.
 The proposed pipeline with dataset looks like the following (need more 
 refinements):
 {code}
 sqlContext.jsonFile(/path/to/training/events, 
 0.01).registerTempTable(event)
 val training = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId, 
 event.action AS label,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;).cache()
 val indexer = new Indexer()
 val interactor = new Interactor()
 val fvAssembler = new FeatureVectorAssembler()
 val treeClassifer = new DecisionTreeClassifer()
 val paramMap = new ParamMap()
   .put(indexer.features, Map(userCountryIndex - userCountry))
   .put(indexer.sortByFrequency, true)
   .put(interactor.features, Map(genderMatch - Array(userGender, 
 targetGender)))
   .put(fvAssembler.features, Map(features - Array(genderMatch, 
 userCountryIndex, userFeatures)))
   .put(fvAssembler.dense, true)
   .put(treeClassifer.maxDepth, 4) // By default, classifier recognizes 
 features and label columns.
 val pipeline = Pipeline.create(indexer, interactor, fvAssembler, 
 treeClassifier)
 val model = pipeline.fit(training, paramMap)
 sqlContext.jsonFile(/path/to/events, 0.01).registerTempTable(event)
 val test = sqlContext.sql(
   SELECT event.id AS eventId, event.userId AS userId, event.adId AS adId,
  user.gender AS userGender, user.country AS userCountry, 
 user.features AS userFeatures,
  ad.targetGender AS targetGender
 FROM event JOIN user ON event.userId = user.id JOIN ad ON event.adId = 
 ad.id;)
 val prediction = model.transform(test).select('eventId, 'prediction)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-11-02 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194315#comment-14194315
]

Sean Owen commented on SPARK-1406:
--

I put some comments on the PR. Thanks for starting on this. I think PMML
interoperability is indeed helpful.

So, one big issue here is that MLlib does not at the moment have any notion of
a schema. PMML does, and this is vital to actually using the model elsewhere.
You have to document what the variables are so they can be matched up with the
same variables in another tool. So it's not possible now to do anything but
make a model with field_1, field_2, ... This calls into question whether
PMML can be meaningfully exported at this point from MLlib? Maybe it will have
to wait until other PRs go in that start to add schema.

I also thought it would be a little better to separate the representation of a
model, from utility methods to write the model to things like files. The latter
can be at least separated out of the type hierarchy. I'm also wondering how
much value it adds to design for non-PMML export at this stage.

(Finally I have some code lying around here that will translate the MLlib
logistic regression model to PMML. I can put that in the pot at a suitable
time.)

PMML model evaluation support via MLib
--

Key: SPARK-1406
URL: https://issues.apache.org/jira/browse/SPARK-1406
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Thomas Darimont
Attachments: SPARK-1406.pdf, kmeans.xml

It would be useful if spark would provide support the evaluation of PMML
models (http://www.dmg.org/v4-2/GeneralStructure.html).
This would allow to use analytical models that were created with a
statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which
would perform the actual model evaluation for a given input tuple. The PMML
model would then just contain the parameterization of an analytical model.
Other projects like JPMML-Evaluator do a similar thing.
https://github.com/jpmml/jpmml/tree/master/pmml-evaluator

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1704) Support EXPLAIN in Spark SQL


[ 
https://issues.apache.org/jira/browse/SPARK-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194322#comment-14194322
 ] 

Apache Spark commented on SPARK-1704:
-

User 'concretevitamin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1003

 Support EXPLAIN in Spark SQL
 

 Key: SPARK-1704
 URL: https://issues.apache.org/jira/browse/SPARK-1704
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: linux
Reporter: Yangjp
Assignee: Zongheng Yang
  Labels: sql
 Fix For: 1.0.1, 1.1.0

   Original Estimate: 612h
  Remaining Estimate: 612h

 14/05/03 22:08:40 INFO ParseDriver: Parsing command: explain select * from src
 14/05/03 22:08:40 INFO ParseDriver: Parse Completed
 14/05/03 22:08:40 WARN LoggingFilter: EXCEPTION :
 java.lang.AssertionError: assertion failed: No plan for ExplainCommand 
 (Project [*])
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:263)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:264)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:264)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:260)
 at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:248)
 at 
 org.apache.spark.sql.hive.api.java.JavaHiveContext.hql(JavaHiveContext.scala:39)
 at 
 org.apache.spark.examples.TimeServerHandler.messageReceived(TimeServerHandler.java:72)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$TailFilter.messageReceived(DefaultIoFilterChain.java:690)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter$ProtocolDecoderOutputImpl.flush(ProtocolCodecFilter.java:407)
 at 
 org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:236)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.filter.logging.LoggingFilter.messageReceived(LoggingFilter.java:208)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1200(DefaultIoFilterChain.java:47)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:765)
 at 
 org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:417)
 at 
 org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:410)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:710)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:664)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:653)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
 at 
 org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1124)
 at 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:701)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To

[jira] [Commented] (SPARK-2691) Allow Spark on Mesos to be launched with Docker

2014-11-02 Thread Eduardo Jimenez (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194329#comment-14194329
]

Eduardo Jimenez commented on SPARK-2691:

I was looking at this, and I came up with a patch that simply passes
ContainerInfo in the appropriate places with both Coarse and Fine grained mode.

The only thing I can see as a bit of a kludge is how to pass the spark
configuration.

So far, I've added:

spark.mesos.container.type
spark.mesos.container.docker.image

I would prefer to simply have spark pass these through without much validation,
otherwise Spark and Mesos have to be kept in sync wrt to what is supported. I
could add the network setting (HOST or BRIDGE).

But other settings could be a kludge to provide. I would prefer to not pass
Docker Parameters for the CLI as mess might not use those in the future anyway.
Port Mappings and Volumes could be useful, but how to provide them?

spark.mesos.container.volumes.A.container_path
spark.mesos.container.volumes.A.host_path
spark.mesos.container.volumes.A.mode
spark.mesos.container.volumes.B.container_path
spark.mesos.container.volumes.B.host_path
spark.mesos.container.volumes.B.mode

or
spark.mesos.container.volumes = A:container_path:host_path:mode, B:...

Any preference? I'll try to work on this tomorrow, put some tests together, and
take it for a spin.

Allow Spark on Mesos to be launched with Docker
---

Key: SPARK-2691
URL: https://issues.apache.org/jira/browse/SPARK-2691
Project: Spark
Issue Type: Improvement
Components: Mesos
Reporter: Timothy Chen
Assignee: Timothy Chen
Labels: mesos

Currently to launch Spark with Mesos one must upload a tarball and specifiy
the executor URI to be passed in that is to be downloaded on each slave or
even each execution depending coarse mode or not.
We want to make Spark able to support launching Executors via a Docker image
that utilizes the recent Docker and Mesos integration work.
With the recent integration Spark can simply specify a Docker image and
options that is needed and it should continue to work as-is.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-11-02 Thread dongxu (JIRA)

dongxu created SPARK-4201:
-

 Summary: Can't use concat() on partition column in where condition 
(Hive compatibility problem)
 Key: SPARK-4201
 URL: https://issues.apache.org/jira/browse/SPARK-4201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.0.0
 Environment: Hive 0.12+hadoop 2.4/hadoop 2.2 +spark 1.1
Reporter: dongxu
Priority: Minor


The team used hive to query,we try to  move it to spark-sql.
when I search sentences like that. 
select count(1) from  gulfstream_day_driver_base_2 where  
concat(year,month,day) = '20140929';
It can't work ,but it work well in hive.
I have to rewrite the sql to  select count(1) from  
gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
There are some error logs.
14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Exchange SinglePartition
 Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 16 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
 HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 20 more
Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at

[jira] [Updated] (SPARK-4201) Can't use concat() on partition column in where condition (Hive compatibility problem)

2014-11-02 Thread dongxu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dongxu updated SPARK-4201:
--
Description: 
The team used hive to query,we try to  move it to spark-sql.
when I search sentences like that. 
select count(1) from  gulfstream_day_driver_base_2 where  
concat(year,month,day) = '20140929';
It can't work ,but it work well in hive.
I have to rewrite the sql to  select count(1) from  
gulfstream_day_driver_base_2 where  year = 2014 and  month = 09 day= 29.
There are some error log.
14/11/03 15:05:03 ERROR SparkSQLDriver: Failed in [select count(1) from  
gulfstream_day_driver_base_2 where  concat(year,month,day) = '20140929']
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Aggregate false, [], [SUM(PartialCount#1390L) AS c_0#1337L]
 Exchange SinglePartition
  Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
   HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:415)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Exchange SinglePartition
 Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
  HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 16 more
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
Aggregate true, [], [COUNT(1) AS PartialCount#1390L]
 HiveTableScan [], (MetastoreRelation default, gulfstream_day_driver_base_2, 
None), 
Some((HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(year#1339,month#1340,day#1341)
 = 20140929))

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:86)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
... 20 more
Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:128)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)

[jira] [Created] (SPARK-4202) DSL support for Scala UDF

2014-11-02 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-4202:
-

 Summary: DSL support for Scala UDF
 Key: SPARK-4202
 URL: https://issues.apache.org/jira/browse/SPARK-4202
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian


Using Scala UDF with current DSL API is quite verbose, e.g.:
{code}
case class KeyValue(key: Int, value: String)

val schemaRDD = sc.parallelize(1 to 10).map(i = KeyValue(i, 
i.toString)).toSchemaRDD

def foo = (a: Int, b: String) = a.toString + b

schemaRDD.select( // SELECT
  Star(None), // *,
  ScalaUdf(   //
foo,  // foo(
StringType,   //
'key.attr :: 'value.attr :: Nil)  //   key, value
).collect()   // ) FROM ...
{code}
It would be good to add a DSL syntax to simplify UDF invocation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4202) DSL support for Scala UDF