date:20150712


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704
 ] 

Yanbo Liang edited comment on SPARK-9003 at 7/12/15 8:19 AM:
-

Yes, I can provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)




was (Author: yanboliang):
Yes, I can provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArry.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)



 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector is short of map/update function which is inconvenience for some 
 Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update function. I think 
 Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704
 ] 

Yanbo Liang commented on SPARK-9003:


Yes, I can provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArry.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)



 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector is short of map/update function which is inconvenience for some 
 Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update function. I think 
 Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame

2015-07-12 Thread Davies Liu (JIRA)

Davies Liu created SPARK-9006:
-

 Summary: TimestampType may loss a microsecond after a round trip 
in Python DataFrame
 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


This bug causes SQLTests.test_time_with_timezone flaky.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame

2015-07-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-9006:
--
Description: This bug causes SQLTests.test_time_with_timezone flaky in 
Python 3.  (was: This bug causes SQLTests.test_time_with_timezone flaky.)

 TimestampType may loss a microsecond after a round trip in Python DataFrame
 ---

 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 This bug causes SQLTests.test_time_with_timezone flaky in Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9006:
---

Assignee: Apache Spark  (was: Davies Liu)

 TimestampType may loss a microsecond after a round trip in Python DataFrame
 ---

 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 This bug causes SQLTests.test_time_with_timezone flaky in Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame


 [ 
https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9006:
---

Assignee: Davies Liu  (was: Apache Spark)

 TimestampType may loss a microsecond after a round trip in Python DataFrame
 ---

 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 This bug causes SQLTests.test_time_with_timezone flaky in Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame


[ 
https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624144#comment-14624144
 ] 

Apache Spark commented on SPARK-9006:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7363

 TimestampType may loss a microsecond after a round trip in Python DataFrame
 ---

 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 This bug causes SQLTests.test_time_with_timezone flaky in Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9007) start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API

Jesper Lundgren created SPARK-9007:
--

 Summary: start-slave.sh changed API in 1.4 and the documentation 
got updated to mention the old API
 Key: SPARK-9007
 URL: https://issues.apache.org/jira/browse/SPARK-9007
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0
Reporter: Jesper Lundgren


In Spark version  1.4 start-slave.sh accepted two parameters. worker# and a 
list of master addresses.

With Spark 1.4 the start-slave.sh worker# parameter was removed, which broke 
our custom standalone cluster setup.

With Spark 1.4 the documentation was also updated to mention spark-slave.sh 
(not previously mentioned), but it describes the old API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface

Jesper Lundgren created SPARK-9008:
--

 Summary: Stop and remove driver from supervised mode in 
spark-master interface
 Key: SPARK-9008
 URL: https://issues.apache.org/jira/browse/SPARK-9008
 Project: Spark
  Issue Type: New Feature
Reporter: Jesper Lundgren


The cluster will automatically restart failing drivers when launched in 
supervised cluster mode. However there is no official way for a operation team 
to stop and remove a driver from restarting in case  it is malfunctioning. 

I know there is bin/spark-class org.apache.spark.deploy.Client kill but this 
is undocumented and does not always work so well.

It would be great if there was a way to remove supervised mode to allow kill -9 
to work on a driver program.

The documentation surrounding this could also see some improvements. It would 
be nice to have some best practice examples on how to work with supervised 
mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will 
end with wrong exit code and trigger restart when using supervised mode unless 
you change the exit code in the application logic)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624152#comment-14624152
 ] 

Feynman Liang commented on SPARK-5571:
--

[~a...@jivesoftware.com], are you still working on this? I wanted to point out 
[CountVectorizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala]
 was recently merged and seems appropriate for this task.

If you aren't working on this anymore, I would be happy to take this task.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch


[ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624151#comment-14624151
 ] 

Jesper Lundgren commented on SPARK-8941:


I've created two new JIRA tickets, can you review them to see if they are OK?

https://issues.apache.org/jira/browse/SPARK-9007

https://issues.apache.org/jira/browse/SPARK-9008

Thanks!

 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface


 [ 
https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Lundgren updated SPARK-9008:
---
Description: 
The cluster will automatically restart failing drivers when launched in 
supervised cluster mode. However there is no official way for a operation team 
to stop and remove a driver from restarting in case  it is malfunctioning. 

I know there is bin/spark-class org.apache.spark.deploy.Client kill but this 
is undocumented and does not always work so well.

It would be great if there was a way to remove supervised mode to allow kill -9 
to work on a driver program.

The documentation surrounding this could also see some improvements. It would 
be nice to have some best practice examples on how to work with supervised 
mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will 
end with an exit code that triggers restart in supervised mode unless you 
change the exit code in the application logic)

  was:
The cluster will automatically restart failing drivers when launched in 
supervised cluster mode. However there is no official way for a operation team 
to stop and remove a driver from restarting in case  it is malfunctioning. 

I know there is bin/spark-class org.apache.spark.deploy.Client kill but this 
is undocumented and does not always work so well.

It would be great if there was a way to remove supervised mode to allow kill -9 
to work on a driver program.

The documentation surrounding this could also see some improvements. It would 
be nice to have some best practice examples on how to work with supervised 
mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will 
end with wrong exit code and trigger restart when using supervised mode unless 
you change the exit code in the application logic)


 Stop and remove driver from supervised mode in spark-master interface
 -

 Key: SPARK-9008
 URL: https://issues.apache.org/jira/browse/SPARK-9008
 Project: Spark
  Issue Type: New Feature
Reporter: Jesper Lundgren

 The cluster will automatically restart failing drivers when launched in 
 supervised cluster mode. However there is no official way for a operation 
 team to stop and remove a driver from restarting in case  it is 
 malfunctioning. 
 I know there is bin/spark-class org.apache.spark.deploy.Client kill but 
 this is undocumented and does not always work so well.
 It would be great if there was a way to remove supervised mode to allow kill 
 -9 to work on a driver program.
 The documentation surrounding this could also see some improvements. It would 
 be nice to have some best practice examples on how to work with supervised 
 mode, how to manage graceful shutdown and catch TERM signals. (TERM signal 
 will end with an exit code that triggers restart in supervised mode unless 
 you change the exit code in the application logic)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame

2015-07-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9006.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7363
[https://github.com/apache/spark/pull/7363]

 TimestampType may loss a microsecond after a round trip in Python DataFrame
 ---

 Key: SPARK-9006
 URL: https://issues.apache.org/jira/browse/SPARK-9006
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.5.0


 This bug causes SQLTests.test_time_with_timezone flaky in Python 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8488) HOG Feature Transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-8488:
-
Description: 
Histogram of oriented gradients (HOG) is method utilizing local orientation 
(gradients and edges) to transform images into dense image descriptors (Dalal  
Triggs, CVPR 2005, 
http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf).

HOG in Spark ML pipelines can be implemented as a 
org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the 
transformer should output an ArrayArray[[Numeric]] of the HOG features for the 
provided image.

HOG and SIFT are similar in that the both represent images using local 
orientation histograms. In contrast to SIFT, however, HOG uses overlapping 
spatial blocks and is computed densely across all pixels.

  was:
Histogram of oriented gradients (HOG) is method utilizing local orientation 
(gradients and edges) to transform images into dense image descriptors (Dalal  
Triggs, CVPR 2005, 
http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf).

HOG in Spark ML pipelines can be implemented as a 
org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT 
transformer should output an ArrayArray[[Numeric]] of the HOG features for the 
provided image.

HOG and SIFT are similar in that the both represent images using local 
orientation histograms. In contrast to SIFT, however, HOG uses overlapping 
spatial blocks and is computed densely across all pixels.


 HOG Feature Transformer
 ---

 Key: SPARK-8488
 URL: https://issues.apache.org/jira/browse/SPARK-8488
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Feynman Liang
Priority: Minor

 Histogram of oriented gradients (HOG) is method utilizing local orientation 
 (gradients and edges) to transform images into dense image descriptors (Dalal 
  Triggs, CVPR 2005, 
 http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf).
 HOG in Spark ML pipelines can be implemented as a 
 org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the 
 transformer should output an ArrayArray[[Numeric]] of the HOG features for 
 the provided image.
 HOG and SIFT are similar in that the both represent images using local 
 orientation histograms. In contrast to SIFT, however, HOG uses overlapping 
 spatial blocks and is computed densely across all pixels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution


[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623956#comment-14623956
 ] 

Josh Rosen commented on SPARK-4879:
---

[~darabos], do you think that this issue might have been resolved in an earlier 
Spark version but inadvertently broken in the upgrade to 1.4.0?  If you have an 
easy reproduction, it might be helpful to see whether the problem occurs on 
1.3.1.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-12 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624000#comment-14624000
 ] 

Daniel Darabos commented on SPARK-4879:
---

Good idea! I'll try with 1.3.1 next week.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if (fs.isFile(taskOutput)) {
 152  Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
 153  getTempTaskOutputPath(context));
 154  if (!fs.rename(taskOutput,

[jira] [Created] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2

Feynman Liang created SPARK-9005:


 Summary: RegressionMetrics computing incorrect explainedVariance 
and r2
 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang


{{RegressionMetrics}} currently computes explainedVariance using 
{{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses 
the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should 
change to be consistent.

The computation for r2 is also currently incorrect. Multiplying by 
{{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
lack of a DoF adjustment in the numerator makes the computation inconsistent 
with [Wikipedia's 
definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
{{RegresionMetrics}} is not given the number of regression variables, we should 
modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch


[ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623952#comment-14623952
 ] 

Josh Rosen commented on SPARK-8941:
---

SGTM; do you want to open a new JIRA to follow up on the documentation issues, 
plus separate issues for the other problems you've identified? If you do this, 
just link the issues here and I'll close this one out.  Thanks!

 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


[ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623944#comment-14623944
 ] 

Feynman Liang commented on SPARK-9005:
--

I will be working on this.

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We 
 should change to be consistent.
 The computation for r2 is also currently incorrect. Multiplying by 
 {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
 lack of a DoF adjustment in the numerator makes the computation inconsistent 
 with [Wikipedia's 
 definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
 {{RegresionMetrics}} is not given the number of regression variables, we 
 should modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()

2015-07-12 Thread Peter Hoffmann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623999#comment-14623999
 ] 

Peter Hoffmann commented on SPARK-8450:
---

I have tried it with todays spark-1.5.0-SNAPSHOT-bin-hadoop2.6 daily build from 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ and 
was able to save DecimalType(16,2) as parquet in python

Thanks for the quick fix!



 PySpark write.parquet raises Unsupported datatype DecimalType()
 ---

 Key: SPARK-8450
 URL: https://issues.apache.org/jira/browse/SPARK-8450
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
 Environment: Spark 1.4.0 on Debian
Reporter: Peter Hoffmann
Assignee: Davies Liu
 Fix For: 1.5.0


 I'm getting an Exception when I try to save a DataFrame with a DeciamlType as 
 an parquet file
 Minimal Example:
 {code}
 from decimal import Decimal
 from pyspark.sql import SQLContext
 from pyspark.sql.types import *
 sqlContext = SQLContext(sc)
 schema = StructType([
 StructField('id', LongType()),
 StructField('value', DecimalType())])
 rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]])
 df = sqlContext.createDataFrame(rdd, schema)
 df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite')
 {code}
 Stack Trace
 {code}
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-19-a77dac8de5f3 in module()
  1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 
 'overwrite')
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in 
 parquet(self, path, mode)
 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` 
 (default: error)
 368 
 -- 369 return self._jwrite.mode(mode).parquet(path)
 370 
 371 @since(1.4)
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o361.parquet.
 : org.apache.spark.SparkException: Job aborted.
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138)
   at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939)
   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
   at 
 org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at

[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2015-07-12 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624086#comment-14624086
]

Patrick Wendell commented on SPARK-2089:

Yeah - we can open it again later if someone who maintains this code is wanting
to work on this feature. I just want to have this JIRA reflect the current
status (i.e. for 5 versions there hasn't been any action in Spark) which is
that it is not actively being fixed and make sure the documentation correctly
reflects what we have now, to discourage the use of a feature that does not
work.

With YARN, preferredNodeLocalityData isn't honored
---

Key: SPARK-2089
URL: https://issues.apache.org/jira/browse/SPARK-2089
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical

When running in YARN cluster mode, apps can pass preferred locality data when
constructing a Spark context that will dictate where to request executor
containers.
This is currently broken because of a race condition. The Spark-YARN code
runs the user class and waits for it to start up a SparkContext. During its
initialization, the SparkContext will create a YarnClusterScheduler, which
notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then
immediately fetches the preferredNodeLocationData from the SparkContext and
uses it to start requesting containers.
But in the SparkContext constructor that takes the preferredNodeLocationData,
setting preferredNodeLocationData comes after the rest of the initialization,
so, if the Spark-YARN code comes around quickly enough after being notified,
the data that's fetched is the empty unset version. The occurred during all
of my runs.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


 [ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9005:
-
Description: {{RegressionMetrics}} currently computes explainedVariance 
using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses 
the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
coincide only when the predictor is unbiased (e.g. an intercept term is 
included in a linear model), but this is not always the case. We should change 
to be consistent.  (was: {{RegressionMetrics}} currently computes 
explainedVariance using {{summary.variance(1)}} (variance of the residuals) 
where the [Wikipedia 
definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses 
the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
coincide only when the predictor is unbiased (e.g. an intercept term is 
included in a linear model), but this is not always the case. We should change 
to be consistent.

The computation for r2 is also currently incorrect. Multiplying by 
{{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
lack of a DoF adjustment in the numerator makes the computation inconsistent 
with [Wikipedia's 
definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
{{RegresionMetrics}} is not given the number of regression variables, we should 
modify and explicitly document that this computes unadjusted R2.)

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
 coincide only when the predictor is unbiased (e.g. an intercept term is 
 included in a linear model), but this is not always the case. We should 
 change to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed


[ 
https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624113#comment-14624113
 ] 

Apache Spark commented on SPARK-8743:
-

User 'nssalian' has created a pull request for this issue:
https://github.com/apache/spark/pull/7362

 Deregister Codahale metrics for streaming when StreamingContext is closed 
 --

 Key: SPARK-8743
 URL: https://issues.apache.org/jira/browse/SPARK-8743
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das
Assignee: Neelesh Srinivas Salian
  Labels: starter

 Currently, when the StreamingContext is closed, the registered metrics are 
 not deregistered. If another streaming context is started, it throws a 
 warning saying that the metrics are already registered. 
 The solution is to deregister the metrics when streamingcontext is stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8880) Fix confusing Stage.attemptId member variable

2015-07-12 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-8880.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

 Fix confusing Stage.attemptId member variable
 -

 Key: SPARK-8880
 URL: https://issues.apache.org/jira/browse/SPARK-8880
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor
 Fix For: 1.5.0


 This variable very confusingly refers to the *next* stageId that should be 
 used, making this code especially hard to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8956) Rollup produces incorrect result when group by contains expressions

2015-07-12 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624121#comment-14624121
 ] 

Cheng Hao commented on SPARK-8956:
--

Sorry, I didn't notice this jira issue when I created this issue SPARK-8972.

 Rollup produces incorrect result when group by contains expressions
 ---

 Key: SPARK-8956
 URL: https://issues.apache.org/jira/browse/SPARK-8956
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yana Kadiyska

 Rollup produces incorrect results when group clause contains an expression
 {code}case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from 
 foo group by key%100 with rollup”).show(100)
 {code}
 As a workaround, this works correctly:
 {code}
 val df1=df.withColumn(newkey,df(key)%100)
 df1.registerTempTable(foo1)
 sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp 
 from foo1 group by newkey with rollup).show(100)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8972) Incorrect result for rollup

2015-07-12 Thread Cheng Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-8972:
-
Description: 
{code:java}
import sqlContext.implicits._
case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
df.registerTempTable(foo)
sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
by key%100 with rollup).show(100)
// output
+---+---++
|cnt|_c1|GROUPING__ID|
+---+---++
|  1|  4|   0|
|  1|  4|   1|
|  1|  5|   0|
|  1|  5|   1|
|  1|  1|   0|
|  1|  1|   1|
|  1|  2|   0|
|  1|  2|   1|
|  1|  3|   0|
|  1|  3|   1|
+---+---++
{code}
After checking with the code, seems we does't support the complex expressions 
(not just simple column names) for GROUP BY keys for rollup, as well as the 
cube. And it even will not report it if we have complex expression in the 
rollup keys, hence we get very confusing result as the example above.

  was:
{code:java}
import sqlContext.implicits._
case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
df.registerTempTable(foo)
sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
by key%100 with rollup).show(100)
// output
+---+---++
|cnt|_c1|GROUPING__ID|
+---+---++
|  1|  4|   0|
|  1|  4|   1|
|  1|  5|   0|
|  1|  5|   1|
|  1|  1|   0|
|  1|  1|   1|
|  1|  2|   0|
|  1|  2|   1|
|  1|  3|   0|
|  1|  3|   1|
+---+---++
{code}


 Incorrect result for rollup
 ---

 Key: SPARK-8972
 URL: https://issues.apache.org/jira/browse/SPARK-8972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Critical

 {code:java}
 import sqlContext.implicits._
 case class KeyValue(key: Int, value: String)
 val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF
 df.registerTempTable(foo)
 sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group 
 by key%100 with rollup).show(100)
 // output
 +---+---++
 |cnt|_c1|GROUPING__ID|
 +---+---++
 |  1|  4|   0|
 |  1|  4|   1|
 |  1|  5|   0|
 |  1|  5|   1|
 |  1|  1|   0|
 |  1|  1|   1|
 |  1|  2|   0|
 |  1|  2|   1|
 |  1|  3|   0|
 |  1|  3|   1|
 +---+---++
 {code}
 After checking with the code, seems we does't support the complex expressions 
 (not just simple column names) for GROUP BY keys for rollup, as well as the 
 cube. And it even will not report it if we have complex expression in the 
 rollup keys, hence we get very confusing result as the example above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock


[ 
https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624058#comment-14624058
 ] 

Josh Rosen commented on SPARK-8415:
---

I figured out how to configure AMPLab Jenkins to use a separate ivy cache for 
each pull request builder workspace.  In the Jenkins environment / properties 
injection, I adeded the following lines

{code}
HOME=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER}
SBT_OPTS=-Duser.home=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER} 
-Dsbt.ivy.home=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER}/.ivy2
{code}

Here, {{/home/sparkivy}} is a directory that's outside of the build workspace 
so it won't be deleted by the {{git clean -fdx}} in our Jenkins build.  The 
substitutions ensure that each build gets its own independent directory.  I'm 
going to mark this issue as resolved since I'm switching the main 
SparkPullRequestBuilder to use this configuration change. 

 Jenkins compilation spends lots of time re-resolving dependencies and waiting 
 to acquire Ivy cache lock
 ---

 Key: SPARK-8415
 URL: https://issues.apache.org/jira/browse/SPARK-8415
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Josh Rosen

 When watching a pull request build, I noticed that the compilation + 
 packaging + test compilation phases spent huge amounts of time waiting to 
 acquire the Ivy cache lock.  We should see whether we can tell SBT to skip 
 the resolution steps for some of these commands, since this could speed up 
 the compilation process when Jenkins is heavily loaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock


 [ 
https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8415.
---
Resolution: Fixed
  Assignee: Josh Rosen

 Jenkins compilation spends lots of time re-resolving dependencies and waiting 
 to acquire Ivy cache lock
 ---

 Key: SPARK-8415
 URL: https://issues.apache.org/jira/browse/SPARK-8415
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen

 When watching a pull request build, I noticed that the compilation + 
 packaging + test compilation phases spent huge amounts of time waiting to 
 acquire the Ivy cache lock.  We should see whether we can tell SBT to skip 
 the resolution steps for some of these commands, since this could speed up 
 the compilation process when Jenkins is heavily loaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock


[ 
https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624061#comment-14624061
 ] 

Josh Rosen commented on SPARK-8415:
---

Oh, and I also added a {{mkdir -p $HOME}} to the execute shell command.

 Jenkins compilation spends lots of time re-resolving dependencies and waiting 
 to acquire Ivy cache lock
 ---

 Key: SPARK-8415
 URL: https://issues.apache.org/jira/browse/SPARK-8415
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Josh Rosen
Assignee: Josh Rosen

 When watching a pull request build, I noticed that the compilation + 
 packaging + test compilation phases spent huge amounts of time waiting to 
 acquire the Ivy cache lock.  We should see whether we can tell SBT to skip 
 the resolution steps for some of these commands, since this could speed up 
 the compilation process when Jenkins is heavily loaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


 [ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9005:
---

Assignee: (was: Apache Spark)

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We 
 should change to be consistent.
 The computation for r2 is also currently incorrect. Multiplying by 
 {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
 lack of a DoF adjustment in the numerator makes the computation inconsistent 
 with [Wikipedia's 
 definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
 {{RegresionMetrics}} is not given the number of regression variables, we 
 should modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


[ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624077#comment-14624077
 ] 

Apache Spark commented on SPARK-9005:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7361

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We 
 should change to be consistent.
 The computation for r2 is also currently incorrect. Multiplying by 
 {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
 lack of a DoF adjustment in the numerator makes the computation inconsistent 
 with [Wikipedia's 
 definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
 {{RegresionMetrics}} is not given the number of regression variables, we 
 should modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


 [ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9005:
---

Assignee: Apache Spark

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Apache Spark

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We 
 should change to be consistent.
 The computation for r2 is also currently incorrect. Multiplying by 
 {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
 lack of a DoF adjustment in the numerator makes the computation inconsistent 
 with [Wikipedia's 
 definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
 {{RegresionMetrics}} is not given the number of regression variables, we 
 should modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2


 [ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9005:
-
Description: 
{{RegressionMetrics}} currently computes explainedVariance using 
{{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses 
the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
coincide only when the predictor is unbiased (e.g. an intercept term is 
included in a linear model), but this is not always the case. We should change 
to be consistent.

The computation for r2 is also currently incorrect. Multiplying by 
{{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
lack of a DoF adjustment in the numerator makes the computation inconsistent 
with [Wikipedia's 
definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
{{RegresionMetrics}} is not given the number of regression variables, we should 
modify and explicitly document that this computes unadjusted R2.

  was:
{{RegressionMetrics}} currently computes explainedVariance using 
{{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses 
the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should 
change to be consistent.

The computation for r2 is also currently incorrect. Multiplying by 
{{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
lack of a DoF adjustment in the numerator makes the computation inconsistent 
with [Wikipedia's 
definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
{{RegresionMetrics}} is not given the number of regression variables, we should 
modify and explicitly document that this computes unadjusted R2.


 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
 coincide only when the predictor is unbiased (e.g. an intercept term is 
 included in a linear model), but this is not always the case. We should 
 change to be consistent.
 The computation for r2 is also currently incorrect. Multiplying by 
 {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the 
 lack of a DoF adjustment in the numerator makes the computation inconsistent 
 with [Wikipedia's 
 definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since 
 {{RegresionMetrics}} is not given the number of regression variables, we 
 should modify and explicitly document that this computes unadjusted R2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8997) Improve LocalPrefixSpan performance


[ 
https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623292#comment-14623292
 ] 

Feynman Liang edited comment on SPARK-8997 at 7/12/15 11:43 PM:


Why PrimitiveKeyOpenHashMap if keys will be Array[Int] (and later 
Array[Array[Item]]), which are not primitive and will not benefit from 
@specialized annotations?

I'm also not clear on what is meant by 3; aren't list and array both eager (did 
you mean to use a Stream (lazy) or ArrayBuffer (in-place update))? Which part 
of the code exactly are you referring to?


was (Author: fliang):
Why PrimitiveKeyOpenHashMap if keys will be Array[Int] (and later 
Array[Array[Item]]), which are not primitive and will not benefit from 
@specialized annotations?

I'm also not clear on what is meant by 3; aren't list and array both eager (did 
you mean to use a Stream)? Which part of the code exactly are you referring to?

 Improve LocalPrefixSpan performance
 ---

 Key: SPARK-8997
 URL: https://issues.apache.org/jira/browse/SPARK-8997
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Feynman Liang
   Original Estimate: 24h
  Remaining Estimate: 24h

 We can improve the performance by:
 1. run should output Iterator instead of Array
 2. Local count shouldn't use groupBy, which creates too many arrays. We can 
 use PrimitiveKeyOpenHashMap
 3. We can use list to avoid materialize frequent sequences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8997) Improve LocalPrefixSpan performance


 [ 
https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8997:
---

Assignee: Apache Spark  (was: Feynman Liang)

 Improve LocalPrefixSpan performance
 ---

 Key: SPARK-8997
 URL: https://issues.apache.org/jira/browse/SPARK-8997
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
   Original Estimate: 24h
  Remaining Estimate: 24h

 We can improve the performance by:
 1. run should output Iterator instead of Array
 2. Local count shouldn't use groupBy, which creates too many arrays. We can 
 use PrimitiveKeyOpenHashMap
 3. We can use list to avoid materialize frequent sequences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8997) Improve LocalPrefixSpan performance


[ 
https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624068#comment-14624068
 ] 

Apache Spark commented on SPARK-8997:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7360

 Improve LocalPrefixSpan performance
 ---

 Key: SPARK-8997
 URL: https://issues.apache.org/jira/browse/SPARK-8997
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Feynman Liang
   Original Estimate: 24h
  Remaining Estimate: 24h

 We can improve the performance by:
 1. run should output Iterator instead of Array
 2. Local count shouldn't use groupBy, which creates too many arrays. We can 
 use PrimitiveKeyOpenHashMap
 3. We can use list to avoid materialize frequent sequences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8997) Improve LocalPrefixSpan performance


 [ 
https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8997:
---

Assignee: Feynman Liang  (was: Apache Spark)

 Improve LocalPrefixSpan performance
 ---

 Key: SPARK-8997
 URL: https://issues.apache.org/jira/browse/SPARK-8997
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Feynman Liang
   Original Estimate: 24h
  Remaining Estimate: 24h

 We can improve the performance by:
 1. run should output Iterator instead of Array
 2. Local count shouldn't use groupBy, which creates too many arrays. We can 
 use PrimitiveKeyOpenHashMap
 3. We can use list to avoid materialize frequent sequences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-12 Thread kumar ranganathan (JIRA)

kumar ranganathan created SPARK-9009:


 Summary: SPARK Encryption FileNotFoundException for truststore
 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.4.0
Reporter: kumar ranganathan


I got FileNotFoundException in the application master when running the SparkPi 
example in windows machine.

The problem is that the truststore file found in C:\Spark\conf\spark.truststore 
location but getting below exception as

{code}
15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
cannot find the path specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
at scala.Option.map(Option.scala:145)
at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
C:\Spark\conf\spark.truststore (The system cannot find the path specified))
15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
{code}

This exception throws from SecurityManager.scala at the line of openstream() 
shown below

{code:title=SecurityManager.scala|borderStyle=solid}
val trustStoreManagers =
  for (trustStore - fileServerSSLOptions.trustStore) yield {
val input = 
Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()

try {
{code}

The same problem occurs for the keystore file when removed truststore property 
in spark-defaults.conf.

When disabled the encryption property to spark.ssl.enabled as false then the 
job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-12 Thread kumar ranganathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar ranganathan updated SPARK-9009:
-
Description: 
I got FileNotFoundException in the application master when running the SparkPi 
example in windows machine.

The problem is that the truststore file found in C:\Spark\conf\spark.truststore 
location but getting below exception as

{code}
15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
cannot find the path specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
at scala.Option.map(Option.scala:145)
at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
C:\Spark\conf\spark.truststore (The system cannot find the path specified))
15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
{code}

If i change the truststore file location to different drive 
(d:\spark_conf\spark.truststore) then getting exception as

{code}
java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
not ready)
{code}

This exception throws from SecurityManager.scala at the line of openstream() 
shown below

{code:title=SecurityManager.scala|borderStyle=solid}
val trustStoreManagers =
  for (trustStore - fileServerSSLOptions.trustStore) yield {
val input = 
Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()

try {
{code}

The same problem occurs for the keystore file when removed truststore property 
in spark-defaults.conf.

When disabled the encryption property to set spark.ssl.enabled as false then 
the job completed successfully. 

  was:
I got FileNotFoundException in the application master when running the SparkPi 
example in windows machine.

The problem is that the truststore file found in C:\Spark\conf\spark.truststore 
location but getting below exception as

{code}
15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
cannot find the path specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
at 
org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
at 
org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
at scala.Option.map(Option.scala:145)
at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at

[jira] [Assigned] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads


 [ 
https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8761:
---

Assignee: (was: Apache Spark)

 Master.removeApplication is not thread-safe but is called from multiple 
 threads
 ---

 Key: SPARK-8761
 URL: https://issues.apache.org/jira/browse/SPARK-8761
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Shixiong Zhu

 Master.removeApplication is not thread-safe. But it's called both in the 
 message loop of Master and MasterPage.handleAppKillRequest which runs in 
 threads of the Web server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads


[ 
https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624236#comment-14624236
 ] 

Apache Spark commented on SPARK-8761:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/7364

 Master.removeApplication is not thread-safe but is called from multiple 
 threads
 ---

 Key: SPARK-8761
 URL: https://issues.apache.org/jira/browse/SPARK-8761
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Shixiong Zhu

 Master.removeApplication is not thread-safe. But it's called both in the 
 message loop of Master and MasterPage.handleAppKillRequest which runs in 
 threads of the Web server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads


 [ 
https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8761:
---

Assignee: Apache Spark

 Master.removeApplication is not thread-safe but is called from multiple 
 threads
 ---

 Key: SPARK-8761
 URL: https://issues.apache.org/jira/browse/SPARK-8761
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Shixiong Zhu
Assignee: Apache Spark

 Master.removeApplication is not thread-safe. But it's called both in the 
 message loop of Master and MasterPage.handleAppKillRequest which runs in 
 threads of the Web server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9003:
---
Description: 
MLlib/Vector only support foreachActive function and is short of map/update 
which is inconvenience for some Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update/foreachActive function. 
I think Vector should also has map/update.

  was:
MLlib/Vector is short of map/update function which is inconvenience for some 
Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update function. I think Vector 
should also has map/update.


 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9003) Add map/update function to MLlib/Vector


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704
 ] 

Yanbo Liang edited comment on SPARK-9003 at 7/12/15 8:38 AM:
-

Yes, I agree that this is not supposed to become yet another vector/matrix 
libaray. But I think map/update function is important enough to become the 
interface of vector just like the foreachActive which is supported at present.
I can also provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)

However, MLlib/Matrix has already supported map/update/foreachActive function, 
and we can compare two Matrices use ~== effortless.


was (Author: yanboliang):
Yes, I can provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)



 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector is short of map/update function which is inconvenience for some 
 Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update function. I think 
 Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623703#comment-14623703
 ] 

Sean Owen commented on SPARK-8981:
--

I think it's OK if you can do this via the slf4j API and it doesn't add 
overhead. I am not sure Logging is actually going to be removed; it's not to be 
used by apps though. Logging can't use a SparkContext; it's not used where a 
SparkContext is used, always. I don't think that's important. MDC has static 
methods.

Are you proposing to change the default log message or just make these values 
available? it might be less intrusive to not change the log output

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9003) Add map/update function to MLlib/Vector


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704
 ] 

Yanbo Liang edited comment on SPARK-9003 at 7/12/15 9:36 AM:
-

Yes, I agree that this is not supposed to become yet another vector/matrix 
libaray. But I think map/update function is important enough to become the 
interface of vector just like foreachActive which is supported at present.
I can also provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)

However, MLlib/Matrix has already supported map/update/foreachActive function, 
and we can compare two Matrices use ~== effortless.


was (Author: yanboliang):
Yes, I agree that this is not supposed to become yet another vector/matrix 
libaray. But I think map/update function is important enough to become the 
interface of vector just like the foreachActive which is supported at present.
I can also provide an example which may be benefit of these function.
For example:
val originalPrediction = Vectors.dense(Array(1, 2, 3))
val expected = Vectors.dense(Array(10, 20, 30))

In some cases, we can use ~== to compare two Vector/Matrix which is defined 
in org.apache.spark.mllib.util.TestingUtils.

So currently we can only code as following:
val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10))
assert(prediction ~== expected absTol 0.01, prediction error)

If we support map/update for Vector, we can code as:
assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction 
error)

However, MLlib/Matrix has already supported map/update/foreachActive function, 
and we can compare two Matrices use ~== effortless.

 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and fromBreeze make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9003:
---

Assignee: Apache Spark

 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Assignee: Apache Spark
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and fromBreeze make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9004) Add s3 bytes read/written metrics

2015-07-12 Thread Abhishek Modi (JIRA)

Abhishek Modi created SPARK-9004:


 Summary: Add s3 bytes read/written metrics
 Key: SPARK-9004
 URL: https://issues.apache.org/jira/browse/SPARK-9004
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Abhishek Modi
Priority: Minor


s3 read/write metrics can be pretty useful in finding the total aggregate data 
processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics

2015-07-12 Thread Abhishek Modi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Modi updated SPARK-9004:
-
Affects Version/s: (was: 1.4.0)

 Add s3 bytes read/written metrics
 -

 Key: SPARK-9004
 URL: https://issues.apache.org/jira/browse/SPARK-9004
 Project: Spark
  Issue Type: Improvement
Reporter: Abhishek Modi
Priority: Minor

 s3 read/write metrics can be pretty useful in finding the total aggregate 
 data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9003:
---
Description: 
MLlib/Vector is short of map/update function which is inconvenience for some 
Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update function. I think Vector 
should also has map/update.

  was:
MLlib/Vector is short of map/update function which is inconvenience for some 
Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get a Vector as 
return value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update function. I think Vector 
should also has map/update.


 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector is short of map/update function which is inconvenience for some 
 Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update function. I think 
 Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-9003) Add map/update function to MLlib/Vector

Yanbo Liang created SPARK-9003:
--

 Summary: Add map/update function to MLlib/Vector
 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor


MLlib/Vector is short of map/update function which is inconvenience for some 
Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get a Vector as 
return value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update function. I think Vector 
should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9003:
---
Description: 
MLlib/Vector only support foreachActive function and is short of map/update 
which is inconvenience for some Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
or we can use toBreeze and make transformation with breeze API.
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update/foreachActive function. 
I think Vector should also has map/update.

  was:
MLlib/Vector only support foreachActive function and is short of map/update 
which is inconvenience for some Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update/foreachActive function. 
I think Vector should also has map/update.


 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2015-07-12 Thread Walter Petersen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623719#comment-14623719
]

Walter Petersen commented on SPARK-3155:

Ok, fine. Thanks a lot.

Support DecisionTree pruning

Key: SPARK-3155
URL: https://issues.apache.org/jira/browse/SPARK-3155
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Joseph K. Bradley

Improvement: accuracy, computation
Summary: Pruning is a common method for preventing overfitting with decision
trees. A smart implementation can prune the tree during training in order to
avoid training parts of the tree which would be pruned eventually anyways.
DecisionTree does not currently support pruning.
Pruning: A “pruning” of a tree is a subtree with the same root node, but
with zero or more branches removed.
A naive implementation prunes as follows:
(1) Train a depth K tree using a training set.
(2) Compute the optimal prediction at each node (including internal nodes)
based on the training set.
(3) Take a held-out validation set, and use the tree to make predictions for
each validation example. This allows one to compute the validation error
made at each node in the tree (based on the predictions computed in step (2).)
(4) For each pair of leafs with the same parent, compare the total error on
the validation set made by the leafs’ predictions with the error made by the
parent’s predictions. Remove the leafs if the parent has lower error.
A smarter implementation prunes during training, computing the error on the
validation set made by each node as it is trained. Whenever two children
increase the validation error, they are pruned, and no more training is
required on that branch.
It is common to use about 1/3 of the data for pruning. Note that pruning is
important when using a tree directly for prediction. It is less important
when combining trees via ensemble methods.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623726#comment-14623726
 ] 

Sean Owen commented on SPARK-8981:
--

The constructors? Have a look through org.apache.spark.executor. The app ID 
should be in env.conf

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8671) Add isotonic regression to the pipeline API

2015-07-12 Thread Martin Zapletal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623786#comment-14623786
 ] 

Martin Zapletal commented on SPARK-8671:


I am on it.

 Add isotonic regression to the pipeline API
 ---

 Key: SPARK-8671
 URL: https://issues.apache.org/jira/browse/SPARK-8671
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
   Original Estimate: 48h
  Remaining Estimate: 48h

 It is useful to have IsotonicRegression under the pipeline API for score 
 calibration. The parameters should be the same as the implementation in 
 spark.mllib package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics


 [ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9004:
-
Target Version/s:   (was: 1.4.0)

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Don't set Target version.

This sounds specific to S3 though. Where are you proposing to change this?

 Add s3 bytes read/written metrics
 -

 Key: SPARK-9004
 URL: https://issues.apache.org/jira/browse/SPARK-9004
 Project: Spark
  Issue Type: Improvement
Reporter: Abhishek Modi
Priority: Minor

 s3 read/write metrics can be pretty useful in finding the total aggregate 
 data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623712#comment-14623712
 ] 

Sean Owen commented on SPARK-8981:
--

Can MDC methods be invoked during executor initialization? where the app name 
is available?

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3155) Support DecisionTree pruning

2015-07-12 Thread Walter Petersen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623721#comment-14623721
]

Walter Petersen commented on SPARK-3155:

Ok, fine. Thanks a lot [~josephkb].

Support DecisionTree pruning

Key: SPARK-3155
URL: https://issues.apache.org/jira/browse/SPARK-3155
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning

2015-07-12 Thread Walter Petersen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Walter Petersen updated SPARK-3155:
---
Comment: was deleted

(was: Ok, fine. Thanks a lot.)

Support DecisionTree pruning

Key: SPARK-3155
URL: https://issues.apache.org/jira/browse/SPARK-3155
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623720#comment-14623720
 ] 

Paweł Kopiczko commented on SPARK-8981:
---

I think so. Would you mind pointing me to executor initialization code?

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623700#comment-14623700
 ] 

Paweł Kopiczko commented on SPARK-8981:
---

slf4j supports MDC as well: http://www.slf4j.org/api/org/slf4j/MDC.html

I've analysed how {{Logging}} trait is implemented. If I'm correct every 
executor process calls {{initializeLogging}} method because of transient 
{{log_}} field. It looks to me that right now it's impossible to pass there 
{{SparkContext}} instance (or any other value) without breaking the API. Do you 
agree? Have you any idea how to bypass that?

Int terms of this comment: ??This will likely be changed or removed in future 
releases.??, are you considering any change right now? 

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623702#comment-14623702
 ] 

Sean Owen commented on SPARK-9003:
--

I think the idea was that this is not supposed to become yet another 
vector/matrix library, and that you can manipulate the underlying breeze vector 
if needed. I don't know how strong that convention is. The use case you show 
doesn't really benefit except for maybe saving a method call; is there a case 
where this would be a bigger win?

 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector is short of map/update function which is inconvenience for some 
 Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update function. I think 
 Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623707#comment-14623707
 ] 

Paweł Kopiczko commented on SPARK-8981:
---

??MDC has static methods??
Yes, but I'm not sure how to invoke these in executor thread. Any idea?

??Are you proposing to change the default log message or just make these values 
available???
Available only. I think it may be needed especially by standalone mode users. 
YARN users don't need that functionality, because CM stores logs in HDFS by 
applicationId. I'm not familiar with Mesos, but probably it has ability to 
store separated logs for each container. I believe the overhead is minimal 
since it's only two String values in a static map.

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC


[ 
https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623707#comment-14623707
 ] 

Paweł Kopiczko edited comment on SPARK-8981 at 7/12/15 8:26 AM:


??MDC has static methods??
Yes, but I'm not sure how to invoke these in executor thread. Any idea?

??Are you proposing to change the default log message or just make these values 
available???
Available only. I think they may be needed especially by standalone mode users. 
YARN users don't need that functionality, because CM stores logs in HDFS by 
applicationId. I'm not familiar with Mesos, but probably it has ability to 
store separated logs for each container. I believe the overhead is minimal 
since it's only two String values in a static map.


was (Author: kopiczko):
??MDC has static methods??
Yes, but I'm not sure how to invoke these in executor thread. Any idea?

??Are you proposing to change the default log message or just make these values 
available???
Available only. I think it may be needed especially by standalone mode users. 
YARN users don't need that functionality, because CM stores logs in HDFS by 
applicationId. I'm not familiar with Mesos, but probably it has ability to 
store separated logs for each container. I believe the overhead is minimal 
since it's only two String values in a static map.

 Set applicationId and appName in log4j MDC
 --

 Key: SPARK-8981
 URL: https://issues.apache.org/jira/browse/SPARK-8981
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Paweł Kopiczko
Priority: Minor

 It would be nice to have, because it's good to have logs in one file when 
 using log agents (like logentires) in standalone mode. Also allows 
 configuring rolling file appender without a mess when multiple applications 
 are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9003:
---
Description: 
MLlib/Vector only support foreachActive function and is short of map/update 
which is inconvenience for some Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
or we can use toBreeze and fromBreeze make transformation with breeze API.
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update/foreachActive function. 
I think Vector should also has map/update.

  was:
MLlib/Vector only support foreachActive function and is short of map/update 
which is inconvenience for some Vector operations.
For example:
val a = Vectors.dense(...)
If we want to compute math.log for each elements of a and get Vector as return 
value, we can only code as:
val b = Vectors.dense(a.toArray.map(math.log))
or we can use toBreeze and make transformation with breeze API.
The code snippet is not elegant, we want it can implement:
val c = a.map(math.log)
Also currently MLlib/Matrix has implemented map/update/foreachActive function. 
I think Vector should also has map/update.


 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and fromBreeze make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector


[ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623724#comment-14623724
 ] 

Apache Spark commented on SPARK-9003:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7357

 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and fromBreeze make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-9003) Add map/update function to MLlib/Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9003:
---

Assignee: (was: Apache Spark)

 Add map/update function to MLlib/Vector
 ---

 Key: SPARK-9003
 URL: https://issues.apache.org/jira/browse/SPARK-9003
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Yanbo Liang
Priority: Minor

 MLlib/Vector only support foreachActive function and is short of map/update 
 which is inconvenience for some Vector operations.
 For example:
 val a = Vectors.dense(...)
 If we want to compute math.log for each elements of a and get Vector as 
 return value, we can only code as:
 val b = Vectors.dense(a.toArray.map(math.log))
 or we can use toBreeze and fromBreeze make transformation with breeze API.
 The code snippet is not elegant, we want it can implement:
 val c = a.map(math.log)
 Also currently MLlib/Matrix has implemented map/update/foreachActive 
 function. I think Vector should also has map/update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics

2015-07-12 Thread Abhishek Modi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623834#comment-14623834
 ] 

Abhishek Modi commented on SPARK-9004:
--

Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters. 
Spark combines all of them in its metrics. Separating them could give a better 
idea of IO distribution.

Here's how it works in MR: 

1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to 
the RM which then launches the AM etc.
2. After job submission, Client continuously monitors the job to see if it is 
finished. 
3. Once the job is finished, the client gets the counters of the job via the 
getCounters() function. 
4. It logs on the client using Counters= format.

I don't really know how to implement it. Can it be done by modifying 
NewHadoopRDD because i guess that's where the Job object is being used ?


 Add s3 bytes read/written metrics
 -

 Key: SPARK-9004
 URL: https://issues.apache.org/jira/browse/SPARK-9004
 Project: Spark
  Issue Type: Improvement
Reporter: Abhishek Modi
Priority: Minor

 s3 read/write metrics can be pretty useful in finding the total aggregate 
 data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics


 [ 
https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9004:
-
Component/s: Input/Output

The metrics are tracked by the InputFormat / OutputFormat right? that might 
already be available then since Spark uses the same classes. I think you'd have 
to investigate and propose a PR if you want this done. NewHadoopRDD is not 
specific to S3, no.

 Add s3 bytes read/written metrics
 -

 Key: SPARK-9004
 URL: https://issues.apache.org/jira/browse/SPARK-9004
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Abhishek Modi
Priority: Minor

 s3 read/write metrics can be pretty useful in finding the total aggregate 
 data processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh


 [ 
https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8982:
-
Target Version/s:   (was: 1.4.0)

 Worker hostnames not showing in Master web ui when launched with 
 start-slaves.sh
 

 Key: SPARK-8982
 URL: https://issues.apache.org/jira/browse/SPARK-8982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Ben Zimmer
Priority: Minor

 If a --host argument is not provided to Worker, WorkerArguments uses 
 Utils.localHostName to find the host name. SPARK-6440 changed the 
 functionality of Utils.localHostName to retrieve the local IP address instead 
 of host name.
 Since start-slave.sh does not provide the --host argument, clusters started 
 with start-slaves.sh now show IP addresses instead of hostnames in the Master 
 web UI. This is inconvenient when starting and debugging small clusters.
 A simple fix would be to find the local machine's hostname in start-slave.sh 
 and pass it as the --host argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true


 [ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5159:
---

Assignee: (was: Apache Spark)

 Thrift server does not respect hive.server2.enable.doAs=true
 

 Key: SPARK-5159
 URL: https://issues.apache.org/jira/browse/SPARK-5159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andrew Ray

 I'm currently testing the spark sql thrift server on a kerberos secured 
 cluster in YARN mode. Currently any user can access any table regardless of 
 HDFS permissions as all data is read as the hive user. In HiveServer2 the 
 property hive.server2.enable.doAs=true causes all access to be done as the 
 submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true


 [ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5159:
---

Assignee: Apache Spark

 Thrift server does not respect hive.server2.enable.doAs=true
 

 Key: SPARK-5159
 URL: https://issues.apache.org/jira/browse/SPARK-5159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andrew Ray
Assignee: Apache Spark

 I'm currently testing the spark sql thrift server on a kerberos secured 
 cluster in YARN mode. Currently any user can access any table regardless of 
 HDFS permissions as all data is read as the hive user. In HiveServer2 the 
 property hive.server2.enable.doAs=true causes all access to be done as the 
 submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true