[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975844#comment-14975844
 ] 

Sean Owen commented on SPARK-11327:
---

I may be being dense but what is spark-dispatcher? is this about code in Spark?

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 15/10/26 22:08:08 INFO SparkContext: Running Spark 

[jira] [Commented] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975906#comment-14975906
 ] 

Jeff Zhang commented on SPARK-11342:


Yes, it would be ideal to allow to set any available profile for spark. But 
considering hadoop is the most important profile, it would be nice to be 
allowed to customize it first. I notice the following hadoop-profile in 
run_test.py.  They can be set in jerkins environment, so think it may be also 
possible to do that in local environment. 
{code}
sbt_maven_hadoop_profiles = {
"hadoop1.0": ["-Phadoop-1", "-Dhadoop.version=1.2.1"],
"hadoop2.0": ["-Phadoop-1", "-Dhadoop.version=2.0.0-mr1-cdh4.1.1"],
"hadoop2.2": ["-Pyarn", "-Phadoop-2.2"],
"hadoop2.3": ["-Pyarn", "-Phadoop-2.3", "-Dhadoop.version=2.3.0"],
"hadoop2.6": ["-Pyarn", "-Phadoop-2.6"],
}
{code}

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-27 Thread Stephen De Gennaro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976017#comment-14976017
 ] 

Stephen De Gennaro commented on SPARK-10947:


Hi Yin here is my Jira profile as you requested. Just a quick clarification for 
anyone looking at this ticket, Nulls are still treated as NullType and not 
StringType.

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Priority: Minor
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5569) Checkpoints cannot reference classes defined outside of Spark's assembly

2015-10-27 Thread Deming Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976023#comment-14976023
 ] 

Deming Zhu commented on SPARK-5569:
---

since patch had been merged, we may set status to be fixed

> Checkpoints cannot reference classes defined outside of Spark's assembly
> 
>
> Key: SPARK-5569
> URL: https://issues.apache.org/jira/browse/SPARK-5569
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Patrick Wendell
>
> Not sure if this is a bug or a feature, but it's not obvious, so wanted to 
> create a JIRA to make sure we document this behavior.
> First documented by Cody Koeninger:
> https://gist.github.com/koeninger/561a61482cd1b5b3600c
> {code}
> 15/01/12 16:07:07 INFO CheckpointReader: Attempting to load checkpoint from 
> file file:/var/tmp/cp/checkpoint-142110041.bk
> 15/01/12 16:07:07 WARN CheckpointReader: Error reading checkpoint from file 
> file:/var/tmp/cp/checkpoint-142110041.bk
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.rdd.kafka.KafkaRDDPartition
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1043)
> at 
> org.apache.spark.streaming.dstream.DStreamCheckpointData.readObject(DStreamCheckpointData.scala:146)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$readObject$1.apply$mcV$sp(DStreamGraph.scala:180)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1040)
> at 
> org.apache.spark.streaming.DStreamGraph.readObject(DStreamGraph.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:251)
> at 
> org.apache.spark.streaming.CheckpointReader$$anonfun$read$2.apply(Checkpoint.scala:239)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at 
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> at 
> 

[jira] [Updated] (SPARK-11343) Regression Imposes doubles on prediction/label columns

2015-10-27 Thread Dominik Dahlem (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Dahlem updated SPARK-11343:
---
Affects Version/s: 1.5.1
  Environment: all environments
  Description: 
Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated using 
the RegressionEvaluator, because of a type mis-match between the model 
transformation and the evaluation APIs. One can work around this by casting the 
prediction column into double before passing it into the evaluator. However, 
this does not work with pipelines and cross validation.
Code and traceback below:

{code}
als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
itemCol='movieID', ratingCol='rating')
model = als.fit(training)
predictions = model.transform(validation)
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 'rmse'})
{code}

Traceback:
validationRmse = evaluator.evaluate(predictions,
{evaluator.metricName: 'rmse'}
)
File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 63, in evaluate
File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 94, in _evaluate
File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
 line 813, in _call_
File 
"/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py", 
line 42, in deco
raise IllegalArgumentException(s.split(': ', 1)[1])
pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
prediction must be of type DoubleType but was actually FloatType.

  Component/s: ML
  Summary: Regression Imposes doubles on prediction/label columns  
(was: Regression Imposes doubles on prediciton)

> Regression Imposes doubles on prediction/label columns
> --
>
> Key: SPARK-11343
> URL: https://issues.apache.org/jira/browse/SPARK-11343
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: all environments
>Reporter: Dominik Dahlem
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions,
> {evaluator.metricName: 'rmse'}
> )
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in _call_
> File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975888#comment-14975888
 ] 

Jeff Zhang commented on SPARK-11342:


[~sowen] Isn't it also for local testing ? 
{code}
if os.environ.get("AMPLAB_JENKINS"):
# if we're on the Amplab Jenkins build servers setup variables
# to reflect the environment settings
build_tool = os.environ.get("AMPLAB_JENKINS_BUILD_TOOL", "sbt")
hadoop_version = os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", 
"hadoop2.3")
test_env = "amplab_jenkins"
# add path for Python3 in Jenkins if we're calling from a Jenkins 
machine
os.environ["PATH"] = "/home/anaconda/envs/py3k/bin:" + 
os.environ.get("PATH")
else:
# else we're running locally and can use local settings
build_tool = "sbt"
hadoop_version = "hadoop2.3"
test_env = "local"
{code}

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11302:
--
Summary:  Multivariate Gaussian Model with Covariance  matrix returns 
incorrect answer in some cases   (was:  Multivariate Gaussian Model with 
Covariance  matrix return zero always )

>  Multivariate Gaussian Model with Covariance  matrix returns incorrect answer 
> in some cases 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973993#comment-14973993
 ] 

Yanbo Liang edited comment on SPARK-11303 at 10/27/15 8:07 AM:
---

When sampling and then filtering DataFrame, the SQL optimizer will push down 
filter into sample and produce wrong result. This is due to the sampler is 
calculated based on the original scope rather than the scope after filtering. I 
think we should not allow the optimizer to do this optimization.  [~marmbrus] 
[~rxin]


was (Author: yanboliang):
When sampling and then filtering DataFrame, the SQL Optimizer will push down 
filter into sample and produce wrong result. This is due to the sampler is 
calculated based on the original scope rather than the scope after filtering.

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11338) HistoryPage not multi-tenancy enabled (app links not prefixed with APPLICATION_WEB_PROXY_BASE)

2015-10-27 Thread Christian Kadner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kadner updated SPARK-11338:
-
Description: 
Links on {{HistoryPage}} are not prepended with {{uiRoot}} ({{export 
APPLICATION_WEB_PROXY_BASE=}}). This makes it 
impossible/unpractical to expose the *History Server* in a multi-tenancy 
environment where each Spark service instance has one history server behind a 
multi-tenant enabled proxy server.  All other Spark web UI pages are correctly 
prefixed when the {{APPLICATION_WEB_PROXY_BASE}} environment variable is set.

*Repro steps:*\\
# Configure history log collection:
{code:title=conf/spark-defaults.conf|borderStyle=solid}
spark.eventLog.enabled true
spark.eventLog.dir logs/history
spark.history.fs.logDirectory  logs/history
{code}
...create the logs folders:
{code}
$ mkdir -p logs/history
{code}
# Start the Spark shell and run the word count example:
{code:java|borderStyle=solid}
$ bin/spark-shell
...
scala> sc.textFile("README.md").flatMap(_.split(" ")).map(w => (w, 
1)).reduceByKey(_ + _).collect
scala> sc.stop
{code}
# Set the web proxy root path path (i.e. {{/testwebuiproxy/..}}):
{code}
$ export APPLICATION_WEB_PROXY_BASE=/testwebuiproxy/..
{code}
# Start the history server:
{code}
$  sbin/start-history-server.sh
{code}
# Bring up the History Server web UI at {{localhost:18080}} and view the 
application link in the HTML source text:
{code:xml|borderColor=#c00}
...
App IDApp 
Name...
  

  local-1445896187531Spark 
shell
  ...
{code}
*Notice*, application link "{{/history/local-1445896187531}}" does _not_ have 
the prefix {{/testwebuiproxy/..}} \\ \\
All site-relative links (URL starting with {{"/"}}) should have been prepended 
with the uiRoot prefix {{/testwebuiproxy/..}} like this ...
{code:xml|borderColor=#0c0}
...
App IDApp 
Name...
  

  local-1445896187531Spark
 shell
  ...
{code}

  was:
Links on {{HistoryPage}} are not prepended with {{uiRoot}} ({{export 
APPLICATION_WEB_PROXY_BASE=}}). This makes it 
impossible/unpractical to expose the *History Server* in a multi-tenancy 
environment where each Spark service instance has one history server behind a 
multi-tenant enabled proxy server.  All other Spark web UI pages are correctly 
prefixed when the {{APPLICATION_WEB_PROXY_BASE}} variable is set.

*Repro steps:*\\
# Configure history log collection:
{code:title=conf/spark-defaults.conf|borderStyle=solid}
spark.eventLog.enabled true
spark.eventLog.dir logs/history
spark.history.fs.logDirectory  logs/history
{code}
...create the logs folders:
{code}
$ mkdir -p logs/history
{code}
# Start the Spark shell and run the word count example:
{code:java|borderStyle=solid}
$ bin/spark-shell
...
scala> sc.textFile("README.md").flatMap(_.split(" ")).map(w => (w, 
1)).reduceByKey(_ + _).collect
scala> sc.stop
{code}
# Set the web proxy root path path (i.e. {{/testwebuiproxy/..}}):
{code}
$ export APPLICATION_WEB_PROXY_BASE=/testwebuiproxy/..
{code}
# Start the history server:
{code}
$  sbin/start-history-server.sh
{code}
# Bring up the History Server web UI at {{localhost:18080}} and view the 
application link in the HTML source text:
{code:xml|borderColor=#c00}
...
App IDApp 
Name...
  

  local-1445896187531Spark 
shell
  ...
{code}
*Notice*, application link "{{/history/local-1445896187531}}" does _not_ have 
the prefix {{/testwebuiproxy/..}} \\ \\
All site-relative links (URL starting with {{"/"}}) should have been prepended 
with the uiRoot prefix {{/testwebuiproxy/..}} like this ...
{code:xml|borderColor=#0c0}
...
App IDApp 
Name...
  

  local-1445896187531Spark
 shell
  ...
{code}


> HistoryPage not multi-tenancy enabled (app links not prefixed with 
> APPLICATION_WEB_PROXY_BASE)
> --
>
> Key: SPARK-11338
> URL: https://issues.apache.org/jira/browse/SPARK-11338
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Christian Kadner
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Links on {{HistoryPage}} are not prepended with {{uiRoot}} ({{export 
> APPLICATION_WEB_PROXY_BASE=}}). This makes it 
> impossible/unpractical to expose the *History Server* in a multi-tenancy 
> environment where each Spark service instance has one history server behind a 
> multi-tenant enabled proxy server.  All other Spark web UI pages are 
> correctly prefixed when the {{APPLICATION_WEB_PROXY_BASE}} environment 
> variable is set.
> *Repro steps:*\\
> # Configure history log collection:
> {code:title=conf/spark-defaults.conf|borderStyle=solid}
> spark.eventLog.enabled true
> spark.eventLog.dir logs/history
> 

[jira] [Resolved] (SPARK-11297) code example generated by include_example is not exactly the same with {% highlight %}

2015-10-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11297.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9265
[https://github.com/apache/spark/pull/9265]

> code example generated by include_example is not exactly the same with {% 
> highlight %}
> --
>
> Key: SPARK-11297
> URL: https://issues.apache.org/jira/browse/SPARK-11297
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xusen Yin
>Assignee: Xusen Yin
> Fix For: 1.6.0
>
>
> Code example generated by include_example is a little different with previous 
> {% highlight %} results, which causes a bigger font size of code examples. We 
> need to substitute "" to "", and add new code 
> tags to make it looks the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11342:
--

 Summary: Allow to set hadoop profile when running dev/run_tests
 Key: SPARK-11342
 URL: https://issues.apache.org/jira/browse/SPARK-11342
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Jeff Zhang
Priority: Minor


Usually I will assembly spark with hadoop 2.6.0. But when I run dev/run_tests, 
it would use hadoop-2.3. And when I run bin/spark-shell the next time, it would 
complain that there're multiple of spark assembly jars. It would be nice that I 
can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975884#comment-14975884
 ] 

Sean Owen commented on SPARK-11342:
---

That's a tool for running tests on Jenkins, not really for developers to use. 
See 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests
 for how to run tests.

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975893#comment-14975893
 ] 

Sean Owen commented on SPARK-11342:
---

Hm... maybe so. If so then I think this is more for running a default test 
profile. Why not build/test directly using Maven/SBT in the case where you need 
to test a specific set of flags? that is, this isn't the end of this 
requirement: what if you need to set a particular Hive version? in the end I 
think you want to run manually anyway.

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11276.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9244
[https://github.com/apache/spark/pull/9244]

> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
> Fix For: 1.6.0
>
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11276) SizeEstimator prevents class unloading

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11276:
--
Assignee: Sem Mulder

> SizeEstimator prevents class unloading
> --
>
> Key: SPARK-11276
> URL: https://issues.apache.org/jira/browse/SPARK-11276
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.5.1
>Reporter: Sem Mulder
>Assignee: Sem Mulder
> Fix For: 1.6.0
>
>
> The SizeEstimator keeps a cache of ClassInfos but this cache uses Class 
> objects as keys.
> Which results in strong references to the Class objects. If these classes are 
> dynamically created
> this prevents the corresponding ClassLoader from being GCed. Leading to 
> PermGen exhaustion.
> An easy fix would be to use a WeakRef for the keys. A proposed fix can be 
> found here:
> [https://github.com/Site2Mobile/spark/commit/21c572cbda5607d0c7c6643bfaf43e53c8aa6f8c]
> We are currently running this in production and it seems to resolve the issue.
> I will prepare a pull request ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975974#comment-14975974
 ] 

Apache Spark commented on SPARK-11342:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9295

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11342:


Assignee: (was: Apache Spark)

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11342) Allow to set hadoop profile when running dev/run_tests

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11342:


Assignee: Apache Spark

> Allow to set hadoop profile when running dev/run_tests
> --
>
> Key: SPARK-11342
> URL: https://issues.apache.org/jira/browse/SPARK-11342
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Usually I will assembly spark with hadoop 2.6.0. But when I run 
> dev/run_tests, it would use hadoop-2.3. And when I run bin/spark-shell the 
> next time, it would complain that there're multiple of spark assembly jars. 
> It would be nice that I can specify hadoop profile when run dev/run_tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-27 Thread hujiayin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975993#comment-14975993
 ] 

hujiayin commented on SPARK-11200:
--

sparkscore found it happened since commit number cf2e0ae7 and resolved today.

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11298:
--
Target Version/s:   (was: 2+)
   Fix Version/s: (was: 1.6.0)
  (was: 2+)

[~KaiXinXIaoLei] There are a number of problems with the JIRA you created here: 
dont' set Target or Fix version; there's no reason this is targeted at "2+" 
anyway. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
making more JIRAs. This is also a duplicate, I believe

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11298) When driver sends message "GetExecutorLossReason" to AM, the AM stops.

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11298.
---
Resolution: Duplicate

> When driver sends message "GetExecutorLossReason" to AM, the AM stops.
> --
>
> Key: SPARK-11298
> URL: https://issues.apache.org/jira/browse/SPARK-11298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: driver.log
>
>
> I get lastest code form github, and just run "bin/spark-shell --master yarn 
> --conf spark.dynamicAllocation.enabled=true --conf 
> spark.dynamicAllocation.initialExecutors=1 --conf 
> spark.shuffle.service.enabled=true". There is error infor:
> 15/10/25 12:11:02 ERROR TransportChannelHandler: Connection to 
> /9.96.1.113:35066 has been quiet for 12 ms while there are outstanding 
> requests. Assuming connection is dead; please adjust spark.network.timeout if 
> this is wrong.
> 15/10/25 12:11:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from vm113/9.96.1.113:35066 is closed
> 15/10/25 12:11:02 WARN NettyRpcEndpointRef: Ignore message 
> Failure(java.io.IOException: Connection from vm113/9.96.1.113:35066 closed)
> 15/10/25 12:11:02 ERROR YarnScheduler: Lost executor 1 on vm111: Slave lost
> From log, when driver sends message "GetExecutorLossReason" to AM, the error 
> appears. From code, i think AM gets this message, should reply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11302:


Assignee: (was: Apache Spark)

>  Multivariate Gaussian Model with Covariance  matrix returns incorrect answer 
> in some cases 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11302) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975902#comment-14975902
 ] 

Apache Spark commented on SPARK-11302:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9293

>  Multivariate Gaussian Model with Covariance  matrix returns incorrect answer 
> in some cases 
> 
>
> Key: SPARK-11302
> URL: https://issues.apache.org/jira/browse/SPARK-11302
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: eyal sharon
>Priority: Minor
>
> I have been trying to apply an Anomaly Detection model  using Spark MLib. 
> As an input, I feed the model with a mean vector and a Covariance matrix. 
> ,assuming my features contain Co-variance.
> Here are my input for the  model ,and the model returns zero for each data 
> point for this input.
> MU vector - 
> 1054.8, 1069.8, 1.3 ,1040.1
> Cov' matrix - 
> 165496.0 , 167996.0,  11.0 , 163037.0  
> 167996.0,  170631.0,  19.0,  165405.0  
> 11.0,   19.0 , 0.0,   2.0   
> 163037.0,   165405.0 2.0 ,  160707.0 
> Conversely,  for the  non covariance case, represented by  this matrix ,the 
> model is working and returns results as expected 
> 165496.0,  0.0 ,   0.0,   0.0 
> 0.0,   170631.0,   0.0,   0.0 
> 0.0 ,   0.0 ,   0.8,   0.0 
> 0.0 ,   0.0,0.0,  160594.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975918#comment-14975918
 ] 

Apache Spark commented on SPARK-11303:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9294

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11303:


Assignee: (was: Apache Spark)

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973993#comment-14973993
 ] 

Yanbo Liang edited comment on SPARK-11303 at 10/27/15 8:06 AM:
---

When sampling and then filtering DataFrame, the SQL Optimizer will push down 
filter into sample and produce wrong result. This is due to the sampler is 
calculated based on the original scope rather than the scope after filtering.


was (Author: yanboliang):
It looks like this bug caused by mutable row copy related problem similar with 
SPARK-4963. But after adding *copy* to *sample*, it still can not resolve this 
issue. I found *map(_copy())* was removed by 
https://github.com/apache/spark/pull/8040/files, [~rxin] Could you tell us the 
motivation of removing *map(_copy())* for withReplacement = false in that PR?

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> output:
> 14
> 7
> 8
> 14
> 7
> 7
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11343) Regression Imposes doubles on prediciton

2015-10-27 Thread Dominik Dahlem (JIRA)
Dominik Dahlem created SPARK-11343:
--

 Summary: Regression Imposes doubles on prediciton
 Key: SPARK-11343
 URL: https://issues.apache.org/jira/browse/SPARK-11343
 Project: Spark
  Issue Type: Bug
Reporter: Dominik Dahlem






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11343) Regression Imposes doubles on prediction/label columns

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11343:


Assignee: (was: Apache Spark)

> Regression Imposes doubles on prediction/label columns
> --
>
> Key: SPARK-11343
> URL: https://issues.apache.org/jira/browse/SPARK-11343
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: all environments
>Reporter: Dominik Dahlem
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions,
> {evaluator.metricName: 'rmse'}
> )
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in _call_
> File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11343) Regression Imposes doubles on prediction/label columns

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11343:


Assignee: Apache Spark

> Regression Imposes doubles on prediction/label columns
> --
>
> Key: SPARK-11343
> URL: https://issues.apache.org/jira/browse/SPARK-11343
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: all environments
>Reporter: Dominik Dahlem
>Assignee: Apache Spark
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions,
> {evaluator.metricName: 'rmse'}
> )
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in _call_
> File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11343) Regression Imposes doubles on prediction/label columns

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976053#comment-14976053
 ] 

Apache Spark commented on SPARK-11343:
--

User 'dahlem' has created a pull request for this issue:
https://github.com/apache/spark/pull/9296

> Regression Imposes doubles on prediction/label columns
> --
>
> Key: SPARK-11343
> URL: https://issues.apache.org/jira/browse/SPARK-11343
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
> Environment: all environments
>Reporter: Dominik Dahlem
>
> Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated 
> using the RegressionEvaluator, because of a type mis-match between the model 
> transformation and the evaluation APIs. One can work around this by casting 
> the prediction column into double before passing it into the evaluator. 
> However, this does not work with pipelines and cross validation.
> Code and traceback below:
> {code}
> als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
> itemCol='movieID', ratingCol='rating')
> model = als.fit(training)
> predictions = model.transform(validation)
> evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating')
> validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
> 'rmse'})
> {code}
> Traceback:
> validationRmse = evaluator.evaluate(predictions,
> {evaluator.metricName: 'rmse'}
> )
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 63, in evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
>  line 94, in _evaluate
> File 
> "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in _call_
> File 
> "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py",
>  line 42, in deco
> raise IllegalArgumentException(s.split(': ', 1)[1])
> pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
> prediction must be of type DoubleType but was actually FloatType.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11325) Alias alias in Scala's DataFrame to as to match python

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11325:
--
Assignee: Nong Li

> Alias alias in Scala's DataFrame to as to match python
> --
>
> Key: SPARK-11325
> URL: https://issues.apache.org/jira/browse/SPARK-11325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Nong Li
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10947:
--
Assignee: Stephen De Gennaro

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Assignee: Stephen De Gennaro
>Priority: Minor
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976167#comment-14976167
 ] 

Apache Spark commented on SPARK-11305:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9298

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11305:


Assignee: Apache Spark

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11305) Remove Third-Party Hadoop Distributions Doc Page

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11305:


Assignee: (was: Apache Spark)

> Remove Third-Party Hadoop Distributions Doc Page
> 
>
> Key: SPARK-11305
> URL: https://issues.apache.org/jira/browse/SPARK-11305
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Patrick Wendell
>Priority: Critical
>
> There is a fairly old page in our docs that contains a bunch of assorted 
> information regarding running Spark on Hadoop clusters. I think this page 
> should be removed and merged into other parts of the docs because the 
> information is largely redundant and somewhat outdated.
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> There are three sections:
> 1. Compile time Hadoop version - this information I think can be removed in 
> favor of that on the "building spark" page. These days most "advanced users" 
> are building without bundling Hadoop, so I'm not sure giving them a bunch of 
> different Hadoop versions sends the right message.
> 2. Linking against Hadoop - this doesn't seem to add much beyond what is in 
> the programming guide.
> 3. Where to run Spark - redundant with the hardware provisioning guide.
> 4. Inheriting cluster configurations - I think this would be better as a 
> section at the end of the configuration page. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11306) Executor JVM loss can lead to a hang in Standalone mode

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11306:
--
Component/s: Scheduler

> Executor JVM loss can lead to a hang in Standalone mode
> ---
>
> Key: SPARK-11306
> URL: https://issues.apache.org/jira/browse/SPARK-11306
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> This commit: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
>  introduced a bug where, in Standalone mode, if a task fails and crashes the 
> JVM, the failure is considered a "normal failure" (meaning it's considered 
> unrelated to the task), so the failure isn't counted against the task's 
> maximum number of failures: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
>   As a result, if a task fails in a way that results in it crashing the JVM, 
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race 
> condition where we have multiple code paths that are used to handle executor 
> losses, and in the setup I'm using, Akka's notification that the executor was 
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets 
> killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11337:
--
Component/s: Documentation

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11317) YARN HBase token code shouldn't swallow invocation target exceptions

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11317:
--
Component/s: YARN

> YARN HBase token code shouldn't swallow invocation target exceptions
> 
>
> Key: SPARK-11317
> URL: https://issues.apache.org/jira/browse/SPARK-11317
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Steve Loughran
>
> As with SPARK-11265; the HBase token retrieval code of SPARK-6918
> 1. swallows exceptions it should be rethrowing as serious problems (e.g 
> NoSuchMethodException)
> 1. Swallows any exception raised by the HBase client, without even logging 
> the details (it logs that an `InvocationTargetException` was caught, but not 
> the contents)
> As such it is potentially brittle to changes in the HDFS client code, and 
> absolutely not going to provide any assistance if HBase won't actually issue 
> tokens to the caller.
> The code in SPARK-11265 can be re-used to provide consistent and better 
> exception processing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11332) WeightedLeastSquares should use ml features generic Instance class instead of private

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11332:
--
Component/s: ML

> WeightedLeastSquares should use ml features generic Instance class instead of 
> private
> -
>
> Key: SPARK-11332
> URL: https://issues.apache.org/jira/browse/SPARK-11332
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Assignee: DB Tsai
>Priority: Minor
>
> WeightedLeastSquares should use the common Instance class in ml.feature 
> instead of a private one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976061#comment-14976061
 ] 

Apache Spark commented on SPARK-11206:
--

User 'carsonwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9297

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11206) Support SQL UI on the history server

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11206:


Assignee: (was: Apache Spark)

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11206) Support SQL UI on the history server

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11206:


Assignee: Apache Spark

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>Assignee: Apache Spark
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11303:
-
Description: 
When sampling and then filtering DataFrame from python, we get inconsistent 
result when not caching the sampled DataFrame. This bug  doesn't appear in 
spark 1.4.1.

{code}
d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
d_sampled = d.sample(False, 0.1, 1)
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()
d_sampled.cache()
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()
{code}

output:
{code}
14
7
8
14
7
7
{code}

  was:
When sampling and then filtering DataFrame from python, we get inconsistent 
result when not caching the sampled DataFrame. This bug  doesn't appear in 
spark 1.4.1.

d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
d_sampled = d.sample(False, 0.1, 1)
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()
d_sampled.cache()
print d_sampled.count()
print d_sampled.filter('t = 1').count()
print d_sampled.filter('t != 1').count()

output:
14
7
8
14
7
7

Thanks!


> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11341) Given non-zero ordinal toRow in the encoders of primitive types will cause problem

2015-10-27 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-11341.
---
Resolution: Not A Problem

> Given non-zero ordinal toRow in the encoders of primitive types will cause 
> problem
> --
>
> Key: SPARK-11341
> URL: https://issues.apache.org/jira/browse/SPARK-11341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The toRow in LongEncoder, IntEncoder writes given ordinal of an unsafe row 
> with only one field. Since the ordinal is parametric. If given non-zero 
> ordinal, it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11295:
--
Component/s: Tests

> Add packages to JUnit output for Python tests
> -
>
> Key: SPARK-11295
> URL: https://issues.apache.org/jira/browse/SPARK-11295
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Gabor Liptak
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11303.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9294
[https://github.com/apache/spark/pull/9294]

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
> Fix For: 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-27 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11277.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9247
[https://github.com/apache/spark/pull/9247]

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
> Fix For: 1.6.0
>
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway. Similar issue exists with an array of 
> struct type. 
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5737.
--
   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.5.1)

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5737) Scanning duplicate columns from parquet table

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-5737:
--

[~kallsu] since we can't point to a change that resolved this at this point, it 
should be Cannot Reproduce

> Scanning duplicate columns from parquet table
> -
>
> Key: SPARK-5737
> URL: https://issues.apache.org/jira/browse/SPARK-5737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Kevin Jung
>
> {quote}
> import org.apache.spark.sql._
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> val rdd = sqlContext.parquetFile("temp.parquet")
> rdd.select('d1,'d1,'d2,'d2).take(3).foreach(println)
> {quote}
> The results of above code have null values at the preceding columns of 
> duplicate two.
> For example,
> {quote}
> [null,-5.7,null,121.05]
> [null,-61.17,null,108.91]
> [null,50.60,null,72.15]
> {quote}
> This happens only in ParquetTableScan. PysicalRDD works fine and the rows 
> have duplicate values like...
> {quote}
> [-5.7,-5.7,121.05,121.05]
> [-61.17,-61.17,108.91,108.91]
> [50.60,50.60,72.15,72.15]
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Jacek Lewandowski (JIRA)
Jacek Lewandowski created SPARK-11344:
-

 Summary: ApplicationDescription should be immutable case class
 Key: SPARK-11344
 URL: https://issues.apache.org/jira/browse/SPARK-11344
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.5.1, 1.4.1
Reporter: Jacek Lewandowski


{{ApplicationDescription}} should be a case class. Currently it is not 
immutable because it has one {{var}} field. This is something which has to be 
refactored because it causes confusion and bugs - for example, with SPARK-1706 
introduced additional {{val}} to {{ApplicationDescription}} but it was missed 
in {{copy}} method. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Yuval Tanny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976317#comment-14976317
 ] 

Yuval Tanny commented on SPARK-11303:
-

Is the fix is going to be merged to 1.5 (and 1.5.2)?

Thanks

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
> Fix For: 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-10-27 Thread Yuval Tanny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976317#comment-14976317
 ] 

Yuval Tanny edited comment on SPARK-11303 at 10/27/15 1:10 PM:
---

Is the fix is going to be in 1.5.2?

Thanks


was (Author: yuvalt):
Is the fix is going to be merged to 1.5 (and 1.5.2)?

Thanks

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
> Fix For: 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10592) deprecate weights and use coefficients instead in ML models

2015-10-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976369#comment-14976369
 ] 

Yanbo Liang commented on SPARK-10592:
-

I think you are in the right way. :) Looking forward [~mengxr]'s comments. 

> deprecate weights and use coefficients instead in ML models
> ---
>
> Key: SPARK-10592
> URL: https://issues.apache.org/jira/browse/SPARK-10592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The name `weights` becomes confusing as we are supporting weighted instanced. 
> As discussed in https://github.com/apache/spark/pull/7884, we want to 
> deprecate `weights` and use `coefficients` instead:
> * Deprecate but do not remove `weights`.
> * Only make changes under `spark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-11345:
--

 Summary: Make HadoopFsRelation always outputs UnsafeRow
 Key: SPARK-11345
 URL: https://issues.apache.org/jira/browse/SPARK-11345
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10592) deprecate weights and use coefficients instead in ML models

2015-10-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10592:

Comment: was deleted

(was: I think you are in the right way. :) Looking forward [~mengxr]'s 
comments. )

> deprecate weights and use coefficients instead in ML models
> ---
>
> Key: SPARK-10592
> URL: https://issues.apache.org/jira/browse/SPARK-10592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The name `weights` becomes confusing as we are supporting weighted instanced. 
> As discussed in https://github.com/apache/spark/pull/7884, we want to 
> deprecate `weights` and use `coefficients` instead:
> * Deprecate but do not remove `weights`.
> * Only make changes under `spark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10592) deprecate weights and use coefficients instead in ML models

2015-10-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10592:

Comment: was deleted

(was: I think you are in the right way. :) Looking forward [~mengxr]'s 
comments. )

> deprecate weights and use coefficients instead in ML models
> ---
>
> Key: SPARK-10592
> URL: https://issues.apache.org/jira/browse/SPARK-10592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The name `weights` becomes confusing as we are supporting weighted instanced. 
> As discussed in https://github.com/apache/spark/pull/7884, we want to 
> deprecate `weights` and use `coefficients` instead:
> * Deprecate but do not remove `weights`.
> * Only make changes under `spark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2015-10-27 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976364#comment-14976364
 ] 

Thomas Graves commented on SPARK-6951:
--

Unfortunately for us I don't think the current timeline server will scale at 
this point. But I'll have to try it once things are ready.

There are other ways to make this faster to startup.  

- we could simply do an ls and show applications without any details and then 
add details as they are loaded.   Then if user clicks on one load that one 
first.
- we could change file name or directory structures to have some basic 
information that could be displayed by a simple file listing (like the 
MapReduce history server does).

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10592) deprecate weights and use coefficients instead in ML models

2015-10-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976372#comment-14976372
 ] 

Yanbo Liang commented on SPARK-10592:
-

I think you are in the right way. :) Looking forward [~mengxr]'s comments. 

> deprecate weights and use coefficients instead in ML models
> ---
>
> Key: SPARK-10592
> URL: https://issues.apache.org/jira/browse/SPARK-10592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The name `weights` becomes confusing as we are supporting weighted instanced. 
> As discussed in https://github.com/apache/spark/pull/7884, we want to 
> deprecate `weights` and use `coefficients` instead:
> * Deprecate but do not remove `weights`.
> * Only make changes under `spark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10592) deprecate weights and use coefficients instead in ML models

2015-10-27 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976374#comment-14976374
 ] 

Yanbo Liang commented on SPARK-10592:
-

I think you are in the right way. :) Looking forward [~mengxr]'s comments. 

> deprecate weights and use coefficients instead in ML models
> ---
>
> Key: SPARK-10592
> URL: https://issues.apache.org/jira/browse/SPARK-10592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The name `weights` becomes confusing as we are supporting weighted instanced. 
> As discussed in https://github.com/apache/spark/pull/7884, we want to 
> deprecate `weights` and use `coefficients` instead:
> * Deprecate but do not remove `weights`.
> * Only make changes under `spark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976250#comment-14976250
 ] 

Jacek Lewandowski commented on SPARK-11344:
---

[~srowen] I was going to update it, but something interrupted me.

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11344:


Assignee: Apache Spark

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Assignee: Apache Spark
>Priority: Minor
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11344:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)
  Issue Type: Improvement  (was: Bug)

[~jlewandowski] Don't set Target version. I also don't think you can call this 
a "major bug"

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976240#comment-14976240
 ] 

Apache Spark commented on SPARK-11315:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/8744

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11315:


Assignee: Apache Spark

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11315) Add YARN extension service to publish Spark events to YARN timeline service

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11315:


Assignee: (was: Apache Spark)

> Add YARN extension service to publish Spark events to YARN timeline service
> ---
>
> Key: SPARK-11315
> URL: https://issues.apache.org/jira/browse/SPARK-11315
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Hadoop 2.6+
>Reporter: Steve Loughran
>
> Add an extension service (using SPARK-11314) to subscribe to Spark lifecycle 
> events, batch them and forward them to the YARN Application Timeline Service. 
> This data can then be retrieved by a new back end for the Spark History 
> Service, and by other analytics tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11344:


Assignee: (was: Apache Spark)

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11344) ApplicationDescription should be immutable case class

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976256#comment-14976256
 ] 

Apache Spark commented on SPARK-11344:
--

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/9299

> ApplicationDescription should be immutable case class
> -
>
> Key: SPARK-11344
> URL: https://issues.apache.org/jira/browse/SPARK-11344
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 1.4.1, 1.5.1
>Reporter: Jacek Lewandowski
>Priority: Minor
>
> {{ApplicationDescription}} should be a case class. Currently it is not 
> immutable because it has one {{var}} field. This is something which has to be 
> refactored because it causes confusion and bugs - for example, with 
> SPARK-1706 introduced additional {{val}} to {{ApplicationDescription}} but it 
> was missed in {{copy}} method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976595#comment-14976595
 ] 

Alan Braithwaite edited comment on SPARK-11327 at 10/27/15 3:56 PM:


I don't think it's in the startup script.  We're running the java command 
directly using marathon/mesos.  (java -cp .. etc).  It's worked fine on my 
laptop before, but I suspect that's because it's pulling 
spark.mesos.executor.docker.image from spark-defaults.conf instead of the CLI.


was (Author: abraithwaite):
I don't think it's in the startup script.  We're running the java command 
directly using mesos.  (java -cp .. etc).  It's worked fine on my laptop 
before, but I suspect that's because it's pulling 
spark.mesos.executor.docker.image from spark-defaults.conf instead of the CLI.

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : 

[jira] [Assigned] (SPARK-11349) Support transform string label for RFormula

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11349:


Assignee: Apache Spark

> Support transform string label for RFormula
> ---
>
> Key: SPARK-11349
> URL: https://issues.apache.org/jira/browse/SPARK-11349
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Currently RFormula can only handle label with NumericType or BinaryType (cast 
> it to DoubleType as the label of Linear Regression training), we should also 
> support label of StringType which is needed for Logistic Regression (glm with 
> family = "binomial"). 
> For label of StringType, we should use StringIndexer to transform it to 
> 0-based index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11351) support hive interval literal in sql parser

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11351:


Assignee: Apache Spark

> support hive interval literal in sql parser
> ---
>
> Key: SPARK-11351
> URL: https://issues.apache.org/jira/browse/SPARK-11351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11352) codegen.GeneratePredicate fails due to unquoted comment

2015-10-27 Thread Rares Mirica (JIRA)
Rares Mirica created SPARK-11352:


 Summary: codegen.GeneratePredicate fails due to unquoted comment
 Key: SPARK-11352
 URL: https://issues.apache.org/jira/browse/SPARK-11352
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Rares Mirica


Somehow the code being generated ends up having comments with 
comment-terminators unquoted, eg.:

/* ((input[35, StringType] <= 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8) && 
(text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 <= input[36, 
StringType])) */

with emphasis on ... =0.9,*/...

This leads to a org.codehaus.commons.compiler.CompileException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11349) Support transform string label for RFormula

2015-10-27 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11349:
---

 Summary: Support transform string label for RFormula
 Key: SPARK-11349
 URL: https://issues.apache.org/jira/browse/SPARK-11349
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Yanbo Liang


Currently RFormula can only handle label with NumericType or BinaryType (cast 
it to DoubleType as the label of Linear Regression training), we should also 
support label of StringType which is needed for Logistic Regression (glm with 
family = "binomial"). 
For label of StringType, we should use StringIndexer to transform it to 0-based 
index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11351) support hive interval literal in sql parser

2015-10-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11351:
---

 Summary: support hive interval literal in sql parser
 Key: SPARK-11351
 URL: https://issues.apache.org/jira/browse/SPARK-11351
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976595#comment-14976595
 ] 

Alan Braithwaite commented on SPARK-11327:
--

I don't think it's in the startup script.  We're running the java command 
directly using mesos.  (java -cp .. etc).  It's worked fine on my laptop 
before, but I suspect that's because it's pulling 
spark.mesos.executor.docker.image from spark-defaults.conf instead of the CLI.

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {

[jira] [Assigned] (SPARK-11349) Support transform string label for RFormula

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11349:


Assignee: (was: Apache Spark)

> Support transform string label for RFormula
> ---
>
> Key: SPARK-11349
> URL: https://issues.apache.org/jira/browse/SPARK-11349
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> Currently RFormula can only handle label with NumericType or BinaryType (cast 
> it to DoubleType as the label of Linear Regression training), we should also 
> support label of StringType which is needed for Logistic Regression (glm with 
> family = "binomial"). 
> For label of StringType, we should use StringIndexer to transform it to 
> 0-based index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11349) Support transform string label for RFormula

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976526#comment-14976526
 ] 

Apache Spark commented on SPARK-11349:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9302

> Support transform string label for RFormula
> ---
>
> Key: SPARK-11349
> URL: https://issues.apache.org/jira/browse/SPARK-11349
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> Currently RFormula can only handle label with NumericType or BinaryType (cast 
> it to DoubleType as the label of Linear Regression training), we should also 
> support label of StringType which is needed for Logistic Regression (glm with 
> family = "binomial"). 
> For label of StringType, we should use StringIndexer to transform it to 
> 0-based index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976587#comment-14976587
 ] 

Alan Braithwaite commented on SPARK-11327:
--

Ah, sorry about that.  Thanks for the pointer.

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 15/10/26 22:08:08 INFO SparkContext: 

[jira] [Commented] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976530#comment-14976530
 ] 

Alan Braithwaite commented on SPARK-11327:
--

It's the cluster mode manager for mesos.

http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> {code}
> 

[jira] [Assigned] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11345:


Assignee: Apache Spark  (was: Cheng Lian)

> Make HadoopFsRelation always outputs UnsafeRow
> --
>
> Key: SPARK-11345
> URL: https://issues.apache.org/jira/browse/SPARK-11345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-27 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-11348:
---
Attachment: spark-11348.txt

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Trivial
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11347) Support for joining two datasets, returning a tuple of objects

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11347:


Assignee: Apache Spark  (was: Michael Armbrust)

> Support for joining two datasets, returning a tuple of objects
> --
>
> Key: SPARK-11347
> URL: https://issues.apache.org/jira/browse/SPARK-11347
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11350) There is no best practice to handle warnings or messages produced by Executors in a distributed manner

2015-10-27 Thread Antonio Murgia (JIRA)
Antonio Murgia created SPARK-11350:
--

 Summary: There is no best practice to handle warnings or messages 
produced by Executors in a distributed manner
 Key: SPARK-11350
 URL: https://issues.apache.org/jira/browse/SPARK-11350
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Reporter: Antonio Murgia


I looked around on the web and I couldn’t find any way to deal, in a 
distributed way with malformed/faulty records during computation. All I was 
able to find was the flatMap/Some/None technique + logging. 
I’m facing this problem because I have a processing algorithm that extracts 
more than one value from each record, but can fail in extracting one of those 
multiple values, and I want to keep track of them. Logging is not feasible 
because this “warning” happens so frequently that the logs would become 
overwhelming and impossibile to read. 
Since I have 3 different possible outcomes from my processing I modeled it with 
this class hierarchy: 

http://i.imgur.com/NIesYUm.png?1

That holds result and/or warnings. Since Result implements Traversable it can 
be used in a flatMap, discarding all warnings and failure results, in the other 
hand, if we want to keep track of warnings, we can elaborate them and output 
them if we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11350) There is no best practice to handle warnings or messages produced by Executors in a distributed manner

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11350.
---
Resolution: Invalid

I think this is a question or discussion for u...@spark.apache.org -- I don't 
see a specific change to Spark being proposed.

> There is no best practice to handle warnings or messages produced by 
> Executors in a distributed manner
> --
>
> Key: SPARK-11350
> URL: https://issues.apache.org/jira/browse/SPARK-11350
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Reporter: Antonio Murgia
>  Labels: Suggestion, improvement
>
> I looked around on the web and I couldn’t find any way to deal, in a 
> distributed way with malformed/faulty records during computation. All I was 
> able to find was the flatMap/Some/None technique + logging. 
> I’m facing this problem because I have a processing algorithm that extracts 
> more than one value from each record, but can fail in extracting one of those 
> multiple values, and I want to keep track of them. Logging is not feasible 
> because this “warning” happens so frequently that the logs would become 
> overwhelming and impossibile to read. 
> Since I have 3 different possible outcomes from my processing I modeled it 
> with this class hierarchy: 
> http://i.imgur.com/NIesYUm.png?1
> That holds result and/or warnings. Since Result implements Traversable it can 
> be used in a flatMap, discarding all warnings and failure results, in the 
> other hand, if we want to keep track of warnings, we can elaborate them and 
> output them if we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11351) support hive interval literal in sql parser

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11351:


Assignee: (was: Apache Spark)

> support hive interval literal in sql parser
> ---
>
> Key: SPARK-11351
> URL: https://issues.apache.org/jira/browse/SPARK-11351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11353) Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails

2015-10-27 Thread JIRA
Łukasz Piepiora created SPARK-11353:
---

 Summary: Writing to S3 buckets, which only support 
AWS4-HMAC-SHA256 fails
 Key: SPARK-11353
 URL: https://issues.apache.org/jira/browse/SPARK-11353
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.5.1, 1.3.1
Reporter: Łukasz Piepiora


For certain regions like for example Frankfurt (eu-central-1) AWS supports only 
[AWS Signature Version 
4|http://docs.aws.amazon.com/general/latest/gr/rande.html#d0e3788].

Currently Spark is using jets3t library in version 0.9.3, which throws an 
exception when code tries to save files in S3 in eu-central-1.

{code}
Caused by: java.lang.RuntimeException: Failed to automatically set required 
header "x-amz-content-sha256" for request with entity 
org.jets3t.service.impl.rest.httpclient.RepeatableRequestEntity@1e4bc601
at 
org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:238)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.authorizeHttpRequest(RestStorageService.java:762)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:324)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.createObjectImpl(RestStorageService.java:1954)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectWithRequestEntityImpl(RestStorageService.java:1875)
at 
org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectImpl(RestStorageService.java:1867)
at org.jets3t.service.StorageService.putObject(StorageService.java:840)
at org.jets3t.service.S3Service.putObject(S3Service.java:2212)
at org.jets3t.service.S3Service.putObject(S3Service.java:2356)
... 23 more
Caused by: java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
at 
org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:236)
... 33 more
{code}

There is a newer version of jets3t 0.9.4, which seems to fix this issue 
(http://www.jets3t.org/RELEASE_NOTES.html).

Therefore I suggest to upgrade jets3t dependency from 0.9.3 to 0.9.4 for Hadoop 
profiles.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-27 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11348:
--

 Summary: Replace addOnCompleteCallback with 
addTaskCompletionListener() in UnsafeExternalSorter
 Key: SPARK-11348
 URL: https://issues.apache.org/jira/browse/SPARK-11348
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Trivial


When practicing the command from SPARK-11318, I got the following:
{code}
[WARNING] 
/home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
 [deprecation]  
addOnCompleteCallback(Function0) in TaskContext has been deprecated
{code}
addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11347) Support for joining two datasets, returning a tuple of objects

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11347:


Assignee: Michael Armbrust  (was: Apache Spark)

> Support for joining two datasets, returning a tuple of objects
> --
>
> Key: SPARK-11347
> URL: https://issues.apache.org/jira/browse/SPARK-11347
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9492) LogisticRegression in R should provide model statistics

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976531#comment-14976531
 ] 

Apache Spark commented on SPARK-9492:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9303

> LogisticRegression in R should provide model statistics
> ---
>
> Key: SPARK-9492
> URL: https://issues.apache.org/jira/browse/SPARK-9492
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, R
>Reporter: Eric Liang
>
> Like ml LinearRegression, LogisticRegression should provide a training 
> summary including feature names and their coefficients.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11327) spark-dispatcher doesn't pass along some spark properties

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11327:
--
Component/s: Mesos

[~abraithwaite] setting the component would help here 
(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)

> spark-dispatcher doesn't pass along some spark properties
> -
>
> Key: SPARK-11327
> URL: https://issues.apache.org/jira/browse/SPARK-11327
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Alan Braithwaite
>
> I haven't figured out exactly what's going on yet, but there's something in 
> the spark-dispatcher which is failing to pass along properties to the 
> spark-driver when using spark-submit in a clustered mesos docker environment.
> Most importantly, it's not passing along spark.mesos.executor.docker.image...
> cli:
> {code}
> docker run -t -i --rm --net=host 
> --entrypoint=/usr/local/spark/bin/spark-submit 
> docker.example.com/spark:2015.10.2 --conf spark.driver.memory=8G --conf 
> spark.mesos.executor.docker.image=docker.example.com/spark:2015.10.2 --master 
> mesos://spark-dispatcher.example.com:31262 --deploy-mode cluster 
> --properties-file /usr/local/spark/conf/spark-defaults.conf --class 
> com.example.spark.streaming.MyApp 
> http://jarserver.example.com:8000/sparkapp.jar zk1.example.com:2181 
> spark-testing my-stream 40
> {code}
> submit output:
> {code}
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending POST request to server 
> at http://compute1.example.com:31262/v1/submissions/create:
> {
>   "action" : "CreateSubmissionRequest",
>   "appArgs" : [ "zk1.example.com:2181", "spark-testing", "requests", "40" ],
>   "appResource" : "http://jarserver.example.com:8000/sparkapp.jar;,
>   "clientSparkVersion" : "1.5.0",
>   "environmentVariables" : {
> "SPARK_SCALA_VERSION" : "2.10",
> "SPARK_CONF_DIR" : "/usr/local/spark/conf",
> "SPARK_HOME" : "/usr/local/spark",
> "SPARK_ENV_LOADED" : "1"
>   },
>   "mainClass" : "com.example.spark.streaming.MyApp",
>   "sparkProperties" : {
> "spark.serializer" : "org.apache.spark.serializer.KryoSerializer",
> "spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY" : 
> "/usr/local/lib/libmesos.so",
> "spark.history.fs.logDirectory" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.eventLog.enabled" : "true",
> "spark.driver.maxResultSize" : "0",
> "spark.mesos.deploy.recoveryMode" : "ZOOKEEPER",
> "spark.mesos.deploy.zookeeper.url" : 
> "zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181,zk4.example.com:2181,zk5.example.com:2181",
> "spark.jars" : "http://jarserver.example.com:8000/sparkapp.jar;,
> "spark.driver.supervise" : "false",
> "spark.app.name" : "com.example.spark.streaming.MyApp",
> "spark.driver.memory" : "8G",
> "spark.logConf" : "true",
> "spark.deploy.zookeeper.dir" : "/spark_mesos_dispatcher",
> "spark.mesos.executor.docker.image" : 
> "docker.example.com/spark-prod:2015.10.2",
> "spark.submit.deployMode" : "cluster",
> "spark.master" : "mesos://compute1.example.com:31262",
> "spark.executor.memory" : "8G",
> "spark.eventLog.dir" : "hdfs://hdfsha.example.com/spark/logs",
> "spark.mesos.docker.executor.network" : "HOST",
> "spark.mesos.executor.home" : "/usr/local/spark"
>   }
> }
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151026220353-0011. Polling submission state...
> 15/10/26 22:03:53 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151026220353-0011 in 
> mesos://compute1.example.com:31262.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Sending GET request to server 
> at 
> http://compute1.example.com:31262/v1/submissions/status/driver-20151026220353-0011.
> 15/10/26 22:03:53 DEBUG RestSubmissionClient: Response from the server:
> {
>   "action" : "SubmissionStatusResponse",
>   "driverState" : "QUEUED",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> 15/10/26 22:03:53 INFO RestSubmissionClient: State of driver 
> driver-20151026220353-0011 is now QUEUED.
> 15/10/26 22:03:53 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151026220353-0011",
>   "success" : true
> }
> {code}
> driver log:
> 

[jira] [Assigned] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11345:


Assignee: Cheng Lian  (was: Apache Spark)

> Make HadoopFsRelation always outputs UnsafeRow
> --
>
> Key: SPARK-11345
> URL: https://issues.apache.org/jira/browse/SPARK-11345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11117) PhysicalRDD.outputsUnsafeRows should return true when the underlying data source produces UnsafeRows

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976635#comment-14976635
 ] 

Apache Spark commented on SPARK-7:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9305

> PhysicalRDD.outputsUnsafeRows should return true when the underlying data 
> source produces UnsafeRows
> 
>
> Key: SPARK-7
> URL: https://issues.apache.org/jira/browse/SPARK-7
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{PhysicalRDD}} doesn't override {{SparkPlan.outputsUnsafeRows}}, and thus 
> can't avoid {{ConvertToUnsafe}} when upper level operators only support 
> {{UnsafeRow}} even if the underlying data source produces {{UnsafeRow}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11345) Make HadoopFsRelation always outputs UnsafeRow

2015-10-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976636#comment-14976636
 ] 

Apache Spark commented on SPARK-11345:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9305

> Make HadoopFsRelation always outputs UnsafeRow
> --
>
> Key: SPARK-11345
> URL: https://issues.apache.org/jira/browse/SPARK-11345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11348) Replace addOnCompleteCallback with addTaskCompletionListener() in UnsafeExternalSorter

2015-10-27 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-11348:
---
Priority: Minor  (was: Trivial)

> Replace addOnCompleteCallback with addTaskCompletionListener() in 
> UnsafeExternalSorter
> --
>
> Key: SPARK-11348
> URL: https://issues.apache.org/jira/browse/SPARK-11348
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-11348.txt
>
>
> When practicing the command from SPARK-11318, I got the following:
> {code}
> [WARNING] 
> /home/hbase/spark/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[141,15]
>  [deprecation]  
> addOnCompleteCallback(Function0) in TaskContext has been deprecated
> {code}
> addOnCompleteCallback should be replaced with addTaskCompletionListener()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11353) Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails

2015-10-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11353:


Assignee: Apache Spark

> Writing to S3 buckets, which only support AWS4-HMAC-SHA256 fails
> 
>
> Key: SPARK-11353
> URL: https://issues.apache.org/jira/browse/SPARK-11353
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.3.1, 1.5.1
>Reporter: Łukasz Piepiora
>Assignee: Apache Spark
>
> For certain regions like for example Frankfurt (eu-central-1) AWS supports 
> only [AWS Signature Version 
> 4|http://docs.aws.amazon.com/general/latest/gr/rande.html#d0e3788].
> Currently Spark is using jets3t library in version 0.9.3, which throws an 
> exception when code tries to save files in S3 in eu-central-1.
> {code}
> Caused by: java.lang.RuntimeException: Failed to automatically set required 
> header "x-amz-content-sha256" for request with entity 
> org.jets3t.service.impl.rest.httpclient.RepeatableRequestEntity@1e4bc601
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:238)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.authorizeHttpRequest(RestStorageService.java:762)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:324)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.createObjectImpl(RestStorageService.java:1954)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectWithRequestEntityImpl(RestStorageService.java:1875)
>   at 
> org.jets3t.service.impl.rest.httpclient.RestStorageService.putObjectImpl(RestStorageService.java:1867)
>   at org.jets3t.service.StorageService.putObject(StorageService.java:840)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2212)
>   at org.jets3t.service.S3Service.putObject(S3Service.java:2356)
>   ... 23 more
> Caused by: java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.reset(BufferedInputStream.java:446)
>   at 
> org.jets3t.service.utils.SignatureUtils.awsV4GetOrCalculatePayloadHash(SignatureUtils.java:236)
>   ... 33 more
> {code}
> There is a newer version of jets3t 0.9.4, which seems to fix this issue 
> (http://www.jets3t.org/RELEASE_NOTES.html).
> Therefore I suggest to upgrade jets3t dependency from 0.9.3 to 0.9.4 for 
> Hadoop profiles.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11356) Option to refresh information about partitions

2015-10-27 Thread JIRA
Maciej Bryński created SPARK-11356:
--

 Summary: Option to refresh information about partitions
 Key: SPARK-11356
 URL: https://issues.apache.org/jira/browse/SPARK-11356
 Project: Spark
  Issue Type: Improvement
Reporter: Maciej Bryński


I have two apps:
1) First one periodically append data to parquet (which creates new partition)
2) Second one executes query on data

Right now I can't find any possibility to force Spark to make partition 
discovery. So every query is executed on the same data.
I tried --conf spark.sql.parquet.cacheMetadata=false but without success.

Is there any option to make this happen ?


App 1 - periodically (eg. every hour)
{code}
df.write.partitionBy("day").mode("append").parquet("some_location")
{code}


App 2 
{code}
sqlContext.read.parquet("some_location").registerTempTable("t")
sqlContext.sql("select * from t where day = 20151027").count()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6488) Support addition/multiplication in PySpark's BlockMatrix

2015-10-27 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6488.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9139
[https://github.com/apache/spark/pull/9139]

> Support addition/multiplication in PySpark's BlockMatrix
> 
>
> Key: SPARK-6488
> URL: https://issues.apache.org/jira/browse/SPARK-6488
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Mike Dusenberry
> Fix For: 1.6.0
>
>
> This JIRA is to add addition/multiplication to BlockMatrix in PySpark. We 
> should reuse the Scala implementation instead of having a separate 
> implementation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster

2015-10-27 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-11354:
-
Attachment: custom log4j on executor page.png

> Expose custom log4j to executor page in Spark standalone cluster 
> -
>
> Key: SPARK-11354
> URL: https://issues.apache.org/jira/browse/SPARK-11354
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Yongjia Wang
> Attachments: custom log4j on executor page.png
>
>
> Spark use log4j, which is very flexible. However, on the executor page in 
> standalone cluster, only stdout and stderr are shown in the UI. In the 
> default log4j profile, all messages are forwarded to System.err which is in 
> turned written to the file stderr in executor directory. Similarly, stdout is 
> written to the stdout file in executor directory. 
> It would be very useful to show all the file appenders configured in a custom 
> log4j profile. Right now, these file appenders are written to the executor 
> directory, but they are not exposed to the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11354) Expose custom log4j to executor page in Spark standalone cluster

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11354:
--
Priority: Minor  (was: Major)

> Expose custom log4j to executor page in Spark standalone cluster 
> -
>
> Key: SPARK-11354
> URL: https://issues.apache.org/jira/browse/SPARK-11354
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: custom log4j on executor page.png
>
>
> Spark use log4j, which is very flexible. However, on the executor page in 
> standalone cluster, only stdout and stderr are shown in the UI. In the 
> default log4j profile, all messages are forwarded to System.err which is in 
> turned written to the file stderr in executor directory. Similarly, stdout is 
> written to the stdout file in executor directory. 
> It would be very useful to show all the file appenders configured in a custom 
> log4j profile. Right now, these file appenders are written to the executor 
> directory, but they are not exposed to the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11356) Option to refresh information about parquet partitions

2015-10-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11356:
---
Description: 
I have two apps:
1) First one periodically append data to parquet (which creates new partition)
2) Second one executes query on data

Right now I can't find any possibility to force Spark to make partition 
discovery. So every query is executed on the same data.
I tried --conf spark.sql.parquet.cacheMetadata=false but without success.

Is there any option to make this happen ?


App 1 - periodically (eg. every hour)
{code}
df.write.partitionBy("day").mode("append").parquet("some_location")
{code}


App 2 - example
{code}
sqlContext.read.parquet("some_location").registerTempTable("t")
sqlContext.sql("select * from t where day = 20151027").count()
{code}


  was:
I have two apps:
1) First one periodically append data to parquet (which creates new partition)
2) Second one executes query on data

Right now I can't find any possibility to force Spark to make partition 
discovery. So every query is executed on the same data.
I tried --conf spark.sql.parquet.cacheMetadata=false but without success.

Is there any option to make this happen ?


App 1 - periodically (eg. every hour)
{code}
df.write.partitionBy("day").mode("append").parquet("some_location")
{code}


App 2 
{code}
sqlContext.read.parquet("some_location").registerTempTable("t")
sqlContext.sql("select * from t where day = 20151027").count()
{code}



> Option to refresh information about parquet partitions
> --
>
> Key: SPARK-11356
> URL: https://issues.apache.org/jira/browse/SPARK-11356
> Project: Spark
>  Issue Type: Improvement
>Reporter: Maciej Bryński
>
> I have two apps:
> 1) First one periodically append data to parquet (which creates new partition)
> 2) Second one executes query on data
> Right now I can't find any possibility to force Spark to make partition 
> discovery. So every query is executed on the same data.
> I tried --conf spark.sql.parquet.cacheMetadata=false but without success.
> Is there any option to make this happen ?
> App 1 - periodically (eg. every hour)
> {code}
> df.write.partitionBy("day").mode("append").parquet("some_location")
> {code}
> App 2 - example
> {code}
> sqlContext.read.parquet("some_location").registerTempTable("t")
> sqlContext.sql("select * from t where day = 20151027").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11355) Spark 1.5.1 compile failure with scala 2.11

2015-10-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11355.
---
Resolution: Cannot Reproduce

I can't reproduce this, and it looks like a failure from within the compiler 
plugin, not Spark. Try killing any stray zinc that's running.
(Also I'm not clear you ran the script to set up for 2.11 compilation, but 
that's not the problem.)

> Spark 1.5.1 compile failure with scala 2.11
> ---
>
> Key: SPARK-11355
> URL: https://issues.apache.org/jira/browse/SPARK-11355
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
> Environment: - ubuntu 15.04
> - maven 3.3.3
> - encrypted hard drive
> - no zinc server installed
>Reporter: Lauri Niskanen
>
> The log is from  a checkout of v1.5.2-rc1 but it was no different to the 1.5.1
> mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package -X
> 
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:42 min
> [INFO] Finished at: 2015-10-27T18:45:04+02:00
> [INFO] Final Memory: 53M/726M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
> project spark-core_2.11: Execution scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
> (scala-compile-first) on project spark-core_2.11: Execution 
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:224)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>   at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>   at 
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>   at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
>   at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
>   at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
> scala-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
>   at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
>   ... 20 more
> Caused by: Compile failed via zinc server
>   at 
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
>   at 
> sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
>   at 
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
>   at 
> scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
>   at 
> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
>   at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
>   at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>   ... 21 

  1   2   >