[jira] [Assigned] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5681:
---

Assignee: (was: Apache Spark)

 Calling graceful stop() immediately after start() on StreamingContext should 
 not get stuck indefinitely
 ---

 Key: SPARK-5681
 URL: https://issues.apache.org/jira/browse/SPARK-5681
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Liang-Chi Hsieh

 Sometimes the receiver will be registered into tracker after ssc.stop is 
 called. Especially when stop() is called immediately after start(). So the 
 receiver doesn't get the StopReceiver message from the tracker. In this case, 
 when you call stop() in graceful mode, stop() would get stuck indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5681:
---

Assignee: Apache Spark

 Calling graceful stop() immediately after start() on StreamingContext should 
 not get stuck indefinitely
 ---

 Key: SPARK-5681
 URL: https://issues.apache.org/jira/browse/SPARK-5681
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 Sometimes the receiver will be registered into tracker after ssc.stop is 
 called. Especially when stop() is called immediately after start(). So the 
 receiver doesn't get the StopReceiver message from the tracker. In this case, 
 when you call stop() in graceful mode, stop() would get stuck indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6694:
---

Assignee: (was: Apache Spark)

 SparkSQL CLI must be able to specify an option --database on the command line.
 --

 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi

 SparkSQL CLI has an option --database as follows.
 But, an option --database doesn't work properly.
 {code:}
 $ spark-sql --help
 :
 CLI options:
 :
 --database databasename Specify the database to use
 ```
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394191#comment-14394191
 ] 

Apache Spark commented on SPARK-6694:
-

User 'adachij2002' has created a pull request for this issue:
https://github.com/apache/spark/pull/5345

 SparkSQL CLI must be able to specify an option --database on the command line.
 --

 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi

 SparkSQL CLI has an option --database as follows.
 But, an option --database doesn't work properly.
 {code:}
 $ spark-sql --help
 :
 CLI options:
 :
 --database databasename Specify the database to use
 ```
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6428) Add to style checker public method must have explicit type defined

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394121#comment-14394121
 ] 

Apache Spark commented on SPARK-6428:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5342

 Add to style checker public method must have explicit type defined
 

 Key: SPARK-6428
 URL: https://issues.apache.org/jira/browse/SPARK-6428
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin

 Otherwise it is too easy to accidentally leak or define an incorrect return 
 type in user facing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6428) Add to style checker public method must have explicit type defined

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6428:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Add to style checker public method must have explicit type defined
 

 Key: SPARK-6428
 URL: https://issues.apache.org/jira/browse/SPARK-6428
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Apache Spark

 Otherwise it is too easy to accidentally leak or define an incorrect return 
 type in user facing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-1095) Ensure all public methods return explicit types

2015-04-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-1095:

  Assignee: Reynold Xin  (was: prashant)

 Ensure all public methods return explicit types
 ---

 Key: SPARK-1095
 URL: https://issues.apache.org/jira/browse/SPARK-1095
 Project: Spark
  Issue Type: Sub-task
Reporter: Patrick Wendell
Assignee: Reynold Xin
 Fix For: 1.0.0


 This talk explains some of the challenges around typing for binary 
 compatibility:
 http://www.slideshare.net/mircodotta/managing-binary-compatibility-in-scala
 For public methods we should always declare the type. We've had this as a 
 guideline in the past but we need to make sure we obey it in all public 
 interfaces. Also, we should return the most general type possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated

2015-04-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6692:
---
Assignee: Cheolsoo Park

 Make it possible to kill AM in YARN cluster mode when the client is terminated
 --

 Key: SPARK-6692
 URL: https://issues.apache.org/jira/browse/SPARK-6692
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I understand that the yarn-cluster mode is designed for fire-and-forget 
 model; therefore, terminating the yarn client doesn't kill AM.
 However, it is very common that users submit Spark jobs via job scheduler 
 (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is 
 expected that killing the yarn client will terminate AM. 
 It is true that the yarn-client mode can be used in such cases. But then, the 
 yarn client sometimes needs lots of heap memory for big jobs if it runs in 
 the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs 
 because AM can be given arbitrary heap memory unlike the yarn client. So it 
 would be very useful to make it possible to kill AM even in the yarn-cluster 
 mode.
 In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
 as they're accepted (but not yet running). Although they're eventually 
 shutdown after AM timeout, it would be nice if AM could immediately get 
 killed in such cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6693) add to string with max lines and width for matrix

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394146#comment-14394146
 ] 

Apache Spark commented on SPARK-6693:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/5344

 add to string with max lines and width for matrix
 -

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6693) add to string with max lines and width for matrix

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6693:
---

Assignee: (was: Apache Spark)

 add to string with max lines and width for matrix
 -

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394147#comment-14394147
 ] 

Florian Verhein commented on SPARK-6664:


I guess the other thing is - we can union RDDs, so why not be able to 'undo' 
that?

 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 This is very similar, except that we have to handle entire (or parts) of 
 partitions belonging to more than one output RDD, since they are no longer 
 mutually exclusive. But since RDDs are immutable(??), the decorator idea 
 should still work?
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6693) add to string with max lines and width for matrix

2015-04-03 Thread yuhao yang (JIRA)
yuhao yang created SPARK-6693:
-

 Summary: add to string with max lines and width for matrix
 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor


It's kind of annoying when debugging and found you cannot print out the matrix 
as you want.

original toString of Matrix only print like following, 
0.178101025969091830.5616906241468385... (100 total)
0.9692861997823815 0.015558159784155756  ...
0.8513015122819192 0.031523763918528847  ...
0.5396875653953941 0.3267864552779176...

The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6693) add to string with max lines and width for matrix

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6693:
---

Assignee: Apache Spark

 add to string with max lines and width for matrix
 -

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Assignee: Apache Spark
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6693) add toString with max lines and width for matrix

2015-04-03 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-6693:
--
Summary: add toString with max lines and width for matrix  (was: add to 
string with max lines and width for matrix)

 add toString with max lines and width for matrix
 

 Key: SPARK-6693
 URL: https://issues.apache.org/jira/browse/SPARK-6693
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: yuhao yang
Priority: Minor
   Original Estimate: 2h
  Remaining Estimate: 2h

 It's kind of annoying when debugging and found you cannot print out the 
 matrix as you want.
 original toString of Matrix only print like following, 
 0.178101025969091830.5616906241468385... (100 total)
 0.9692861997823815 0.015558159784155756  ...
 0.8513015122819192 0.031523763918528847  ...
 0.5396875653953941 0.3267864552779176...
 The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, 
 logging and saving matrix to files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6211) Test Python Kafka API using Python unit tests

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6211:
---

Assignee: Saisai Shao  (was: Apache Spark)

 Test Python Kafka API using Python unit tests
 -

 Key: SPARK-6211
 URL: https://issues.apache.org/jira/browse/SPARK-6211
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Saisai Shao
Priority: Critical

 This is tricky in python because the KafkaStreamSuiteBase (which has the 
 functionality of creating embedded kafka clusters) is in the test package, 
 which is not in the python path. To fix that, we have to ways. 
 1. Add test jar to classpath in python test. Thats kind of trickier.
 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and 
 then wrap that in python to use it from python. 
 If (2) does not add any extra test dependencies to the main Kafka pom, then 2 
 should be simpler to do.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6211) Test Python Kafka API using Python unit tests

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6211:
---

Assignee: Apache Spark  (was: Saisai Shao)

 Test Python Kafka API using Python unit tests
 -

 Key: SPARK-6211
 URL: https://issues.apache.org/jira/browse/SPARK-6211
 Project: Spark
  Issue Type: Test
  Components: Streaming, Tests
Reporter: Tathagata Das
Assignee: Apache Spark
Priority: Critical

 This is tricky in python because the KafkaStreamSuiteBase (which has the 
 functionality of creating embedded kafka clusters) is in the test package, 
 which is not in the python path. To fix that, we have to ways. 
 1. Add test jar to classpath in python test. Thats kind of trickier.
 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and 
 then wrap that in python to use it from python. 
 If (2) does not add any extra test dependencies to the main Kafka pom, then 2 
 should be simpler to do.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-03 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-6691:
--

 Summary: Abstract and add a dynamic RateLimiter for Spark Streaming
 Key: SPARK-6691
 URL: https://issues.apache.org/jira/browse/SPARK-6691
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao


Flow control (or rate control) for input data is very important in streaming 
system, especially for Spark Streaming to keep stable and up-to-date. The 
unexpected flood of incoming data or the high ingestion rate of input data 
which beyond the computation power of cluster will make the system unstable and 
increase the delay time. For Spark Streaming’s job generation and processing 
pattern, this delay will be accumulated and introduce unacceptable exceptions.



Currently in Spark Streaming’s receiver based input stream, there’s a 
RateLimiter in BlockGenerator which controls the ingestion rate of input data, 
but the current implementation has several limitations:

# The max ingestion rate is set by user through configuration beforehand, user 
may lack the experience of how to set an appropriate value before the 
application is running.
# This configuration is fixed through the life-time of application, which means 
you need to consider the worst scenario to set a reasonable configuration.
# Input stream like DirectKafkaInputStream need to maintain another solution to 
achieve the same functionality.
# Lack of slow start control makes the whole system easily trapped into large 
processing and scheduling delay at the very beginning.



So here we propose a new dynamic RateLimiter as well as the new interface for 
the RateLimiter to better improve the whole system's stability. The target is:


* Dynamically adjust the ingestion rate according to processing rate of 
previous finished jobs.
* Offer an uniform solution not only for receiver based input stream, but also 
for direct stream like DirectKafkaInputStream and new ones.
* Slow start rate to control the network congestion when job is started.
* Pluggable framework to make the maintenance of extension more easy.



Here is the design doc 
(https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
 and working branch 
(https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).

Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6428) Add to style checker public method must have explicit type defined

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6428:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Add to style checker public method must have explicit type defined
 

 Key: SPARK-6428
 URL: https://issues.apache.org/jira/browse/SPARK-6428
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin

 Otherwise it is too easy to accidentally leak or define an incorrect return 
 type in user facing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5523:
---

Assignee: (was: Apache Spark)

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394141#comment-14394141
 ] 

Florian Verhein commented on SPARK-6664:


Thanks [~sowen]. I disagree :-) 

...If you think there's non-stationarity you most certainly want to see how 
well a model trained in the past holds up in the future (possibly with more 
than one out of time sample if one is used for pruning, etc), and you can do 
this for temporal data by adjusting the way you do cross validation... 
actually, the exact method you describe is one common approach in time series 
data, e.g. see 
http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
Doing this multiple times does exactly what is does for normal cross-validation 
- gives you a distribution of your error estimate, rather than a single value 
(a sample of it). So it's quite important. The size of the data isn't really 
relevant to this argument (also consider that I might like to employ larger 
datasets to remove the risk of overfitting a more complex but better fitting 
model, rather than to improve my error estimates). 

Note that this proposal doesn't define how the split RDDs are used (i.e. 
unioned) to create training sets and test sets. So the test set can be a single 
RDD, or multiple ones. It's entirely up to the user.

Allowing overlapping partitions (i.e. part 2) is a little different, because 
you probably wouldn't union the resulting RDDs due to duplication. It would be 
more useful for as a primitive for bootstrapping the performance measures of 
streaming models or simulations (so, you're not resampling records, but 
resampling subsequences). 
Alternatively if you have big data but a class imbalance problem, you might 
need to resort to overlaps in the training sets to get multiple test sets with 
enough examples of your minority class.

From what I understand MLUtils.kFold is standard randomised k-fold cross 
validation *but without shuffling* (from a cursory look at the code, It looks 
like ordering will always be maintained... which should probably be documented 
if it is the case because it can lead to bad things... and adds another 
argument for #6665). Either way, since elements of its splits are 
non-consecutive, it's not applicable for time series. 

Do you know how the performance of filterByRange would compare? It should be 
pretty performant if and only if the data is RangePartitioned right? 


 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 

[jira] [Created] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Jin Adachi (JIRA)
Jin Adachi created SPARK-6694:
-

 Summary: SparkSQL CLI must be able to specify an option --database 
on the command line.
 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi


SparkSQL CLI has an option --database as follows.
But, an option --database doesn't work properly.

{code:}
$ spark-sql --help
:
CLI options:
:
--database databasename Specify the database to use
```
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated

2015-04-03 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created SPARK-6692:


 Summary: Make it possible to kill AM in YARN cluster mode when the 
client is terminated
 Key: SPARK-6692
 URL: https://issues.apache.org/jira/browse/SPARK-6692
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor


I understand that the yarn-cluster mode is designed for fire-and-forget model; 
therefore, terminating the yarn client doesn't kill AM.

However, it is very common that users submit Spark jobs via job scheduler (e.g. 
Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected 
that killing the yarn client will terminate AM. 

It is true that the yarn-client mode can be used in such cases. But then, the 
yarn client sometimes needs lots of heap memory for big jobs if it runs in the 
yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because 
AM can be given arbitrary heap memory unlike the yarn client. So it would be 
very useful to make it possible to kill AM even in the yarn-cluster mode.

In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
as they're accepted (but not yet running). Although they're eventually shutdown 
after AM timeout, it would be nice if AM could immediately get killed in such 
cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5523:
---

Assignee: Apache Spark

 TaskMetrics and TaskInfo have innumerable copies of the hostname string
 ---

 Key: SPARK-5523
 URL: https://issues.apache.org/jira/browse/SPARK-5523
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Tathagata Das
Assignee: Apache Spark

  TaskMetrics and TaskInfo objects have the hostname associated with the task. 
 As these are created (directly or through deserialization of RPC messages), 
 each of them have a separate String object for the hostname even though most 
 of them have the same string data in them. This results in thousands of 
 string objects, increasing memory requirement of the driver. 
 This can be easily deduped when deserializing a TaskMetrics object, or when 
 creating a TaskInfo object.
 This affects streaming particularly bad due to the rate of job/stage/task 
 generation. 
 For solution, see how this dedup is done for StorageLevel. 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394211#comment-14394211
 ] 

Sean Owen commented on SPARK-6694:
--

What problem do you encounter? You only showed the help message.

 SparkSQL CLI must be able to specify an option --database on the command line.
 --

 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi

 SparkSQL CLI has an option --database as follows.
 But, an option --database doesn't work properly.
 {code:}
 $ spark-sql --help
 :
 CLI options:
 :
 --database databasename Specify the database to use
 ```
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6560) PairRDDFunctions suppresses exceptions in writeFile

2015-04-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6560.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5223
[https://github.com/apache/spark/pull/5223]

 PairRDDFunctions suppresses exceptions in writeFile
 ---

 Key: SPARK-6560
 URL: https://issues.apache.org/jira/browse/SPARK-6560
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Stephen Haberman
Priority: Minor
 Fix For: 1.4.0


 In PairRDDFunctions, saveAsHadoopDataset uses a try/finally to manage 
 SparkHadoopWriter. Briefly:
 {code}
 try {
   ... writer.write(...)
 } finally {
   writer.close()
 }
 {code}
 However, if an exception happens in writer.write, and then writer.close is 
 called, and an exception in writer.close happens, the original (real) 
 exception from writer.write is suppressed.
 This makes debugging very painful, as the exception that is shown in the logs 
 (from writer.close) is spurious, and the original, real exception has been 
 lost and not logged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path

2015-04-03 Thread Masayoshi TSUZUKI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394240#comment-14394240
 ] 

Masayoshi TSUZUKI commented on SPARK-6568:
--

{code}
bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
{code}
{code}
Exception in thread main java.net.URISyntaxException: Illegal character in 
path at index 10: C:/Program Files/some/jar1.jar
at java.net.URI$Parser.fail(URI.java:2829)
at java.net.URI$Parser.checkChars(URI.java:3002)
at java.net.URI$Parser.parseHierarchical(URI.java:3086)
at java.net.URI$Parser.parse(URI.java:3034)
at java.net.URI.init(URI.java:595)
at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1721)
at 
org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1745)
at 
org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1745)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1745)
at 
org.apache.spark.deploy.SparkSubmitArguments.handle(SparkSubmitArguments.scala:367)
at 
org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:155)
at 
org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:92)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/04/02 14:23:46 DEBUG Utils: Shutdown hook called
{code}

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path
 

 Key: SPARK-6568
 URL: https://issues.apache.org/jira/browse/SPARK-6568
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path.
 The path of jar sometimes containes space in Windows.
 {code}
 bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
 {code}
 this gets
 {code}
 Exception in thread main java.net.URISyntaxException: Illegal character in 
 path at index 10: C:/Program Files/some/jar1.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Jin Adachi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jin Adachi updated SPARK-6694:
--
Description: 
SparkSQL CLI has an option --database as follows.
But, the option --database is ignored.

{code:}
$ spark-sql --help
:
CLI options:
:
--database databasename Specify the database to use
```
{code}

  was:
SparkSQL CLI has an option --database as follows.
But, an option --database doesn't work properly.

{code:}
$ spark-sql --help
:
CLI options:
:
--database databasename Specify the database to use
```
{code}


 SparkSQL CLI must be able to specify an option --database on the command line.
 --

 Key: SPARK-6694
 URL: https://issues.apache.org/jira/browse/SPARK-6694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
Reporter: Jin Adachi

 SparkSQL CLI has an option --database as follows.
 But, the option --database is ignored.
 {code:}
 $ spark-sql --help
 :
 CLI options:
 :
 --database databasename Specify the database to use
 ```
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394262#comment-14394262
 ] 

Sean Owen commented on SPARK-6569:
--

[~c...@koeninger.org] what do you think about just removing this log or making 
it debug level?

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor

 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: (was: stages.png)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: (was: taskDetails.png)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: (was: stage-timeline.png)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: (was: executors.png)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: (was: tasks.png)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta

 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Comment: was deleted

(was: Sorry for pending this ticket for a long time. I've re considered how and 
what to visualize.
One of my ideas,  Timeline based visualization for each task at a stage,  is 
taking a shape and almost be implemented.

!stage-timeline.png!

This feature is integrated into existing WebUI, zoomable and scrollable.
Now the code of this feature is a little bit messy but I'll cleanup and show 
the code soon.)

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: tasks.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature

2015-04-03 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3468:
--
Attachment: TaskAssignmentTimelineView.png
JobTimelineView.png
ApplicationTimeliView.png

I've attached the screen shots of this feature.

 WebUI Timeline-View feature
 ---

 Key: SPARK-3468
 URL: https://issues.apache.org/jira/browse/SPARK-3468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Kousuke Saruta
 Attachments: ApplicationTimeliView.png, JobTimelineView.png, 
 TaskAssignmentTimelineView.png


 I sometimes trouble-shoot and analyse the cause of long time spending job.
 At the time, I find the stages which spends long time or fails, then I find 
 the tasks which spends long time or fails, next I analyse the proportion of 
 each phase in a task.
 Another case, I find executors which spends long time for running a task and 
 analyse the details of a task.
 In such situation, I think it's helpful to visualize timeline view of stages 
 / tasks / executors and visualize details of proportion of activity for each 
 task.
 Now I'm developing prototypes like captures I attached.
 I'll integrate these viewer into WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6489:
---

Assignee: (was: Apache Spark)

 Optimize lateral view with explode to not read unnecessary columns
 --

 Key: SPARK-6489
 URL: https://issues.apache.org/jira/browse/SPARK-6489
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Konstantin Shaposhnikov
  Labels: starter

 Currently a query with lateral view explode(...) results in an execution 
 plan that reads all columns of the underlying RDD.
 E.g. given *ppl* table is DF created from Person case class:
 {code}
 case class Person(val name: String, val age: Int, val data: Array[Int])
 {code}
 the following SQL:
 {code}
 select name, sum(d) from ppl lateral view explode(data) d as d group by name
 {code}
 executes as follows:
 {noformat}
 == Physical Plan ==
 Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L]
  Exchange (HashPartitioning [name#0], 200)
   Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS 
 PartialSum#38L]
Project [name#0,d#21]
 Generate explode(data#2), true, false
  InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation 
 [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), 
 (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:35), Some(ppl))
 {noformat}
 Note that *age* column is not needed to produce the output but it is still 
 read from the underlying RDD.
 A sample program to demonstrate the issue:
 {code}
 case class Person(val name: String, val age: Int, val data: Array[Int])
 object ExplodeDemo extends App {
   val ppl = Array(
 Person(A, 20, Array(10, 12, 19)),
 Person(B, 25, Array(7, 8, 4)),
 Person(C, 19, Array(12, 4, 232)))
   
   val conf = new SparkConf().setMaster(local[2]).setAppName(sql)
   val sc = new SparkContext(conf)
   val sqlCtx = new HiveContext(sc)
   import sqlCtx.implicits._
   val df = sc.makeRDD(ppl).toDF
   df.registerTempTable(ppl)
   sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used 
 that do not support column pruning
   val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) 
 d as d group by name)
   s.explain(true)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-03 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-6695:

Description: 
In practical use, we usually need to create a big iterator, which means too big 
in `memory usage` or too long in `array size`. On the one hand, it leads to too 
much memory consumption. On the other hand, one `Array` may not hold all the 
elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, 
IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, 
and could spill data into disk. The use case may like:

{code: borderStyle=solid}

   rdd.mapPartition { it = 
  ...
  val collector = new ExternalCollector()
  collector.collect(a)
  ...
  collector.iterator
  }
   
{code}

I have done some related works, and I need your opinions, thanks!

  was:
In practical use, we usually need to create a big iterator, which means too big 
in `memory usage` or too long in `array size`. On the one hand, it leads to too 
much memory consumption. On the other hand, one `Array` may not hold all the 
elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, 
IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, 
and could spill data into disk. The use case may like:

```

   rdd.mapPartition { it = 
  ...
  val collector = new ExternalCollector()
  collector.collect(a)
  ...
  collector.iterator
  }
   
```

I have done some related works, and I need your opinions, thanks!


 Add an external iterator: a hadoop-like output collector
 

 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen

 In practical use, we usually need to create a big iterator, which means too 
 big in `memory usage` or too long in `array size`. On the one hand, it leads 
 to too much memory consumption. On the other hand, one `Array` may not hold 
 all the elements, as java array indices are of type 'int' (4 bytes or 32 
 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
 any others, and could spill data into disk. The use case may like:
 {code: borderStyle=solid}
rdd.mapPartition { it = 
   ...
   val collector = new ExternalCollector()
   collector.collect(a)
   ...
   collector.iterator
   }

 {code}
 I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6568:
---

Assignee: (was: Apache Spark)

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path
 

 Key: SPARK-6568
 URL: https://issues.apache.org/jira/browse/SPARK-6568
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path.
 The path of jar sometimes containes space in Windows.
 {code}
 bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
 {code}
 this gets
 {code}
 Exception in thread main java.net.URISyntaxException: Illegal character in 
 path at index 10: C:/Program Files/some/jar1.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6568:
---

Assignee: Apache Spark

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path
 

 Key: SPARK-6568
 URL: https://issues.apache.org/jira/browse/SPARK-6568
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI
Assignee: Apache Spark

 spark-shell.cmd --jars option does not accept the jar that has space in its 
 path.
 The path of jar sometimes containes space in Windows.
 {code}
 bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar
 {code}
 this gets
 {code}
 Exception in thread main java.net.URISyntaxException: Illegal character in 
 path at index 10: C:/Program Files/some/jar1.jar
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394247#comment-14394247
 ] 

Apache Spark commented on SPARK-6239:
-

User 'kretes' has created a pull request for this issue:
https://github.com/apache/spark/pull/5246

 Spark MLlib fpm#FPGrowth minSupport should use long instead
 ---

 Key: SPARK-6239
 URL: https://issues.apache.org/jira/browse/SPARK-6239
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Littlestar
Priority: Minor

 Spark MLlib fpm#FPGrowth minSupport should use long instead
 ==
 val minCount = math.ceil(minSupport * count).toLong
 because:
 1. [count]numbers of datasets is not kown before read.
 2. [minSupport ]double precision.
 from mahout#FPGrowthDriver.java
 addOption(minSupport, s, (Optional) The minimum number of times a 
 co-occurrence must be present.
   +  Default Value: 3, 3);
 I just want to set minCount=2 for test.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6687) In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394265#comment-14394265
 ] 

Sean Owen commented on SPARK-6687:
--

Does this cause any problem? I expect a lot of things to be different. This is 
also a very old version of Hadoop.

 In the hadoop 0.23 profile, hadoop pulls in an older version of netty which 
 conflicts with akka's netty 
 

 Key: SPARK-6687
 URL: https://issues.apache.org/jira/browse/SPARK-6687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sai Nishanth Parepally

 excerpt from mvn -Dverbose dependency:tree of spark-core, note the 
 org.jboss.netty:netty dependency:
 [INFO] |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-app:jar:0.23.10:compile
 [INFO] |  |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-common:jar:0.23.10:compile
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  +- 
 org.apache.hadoop:hadoop-yarn-server-common:jar:0.23.10:compile
 [INFO] |  |  |  |  |  +- 
 (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  |  +- (org.apache.zookeeper:zookeeper:jar:3.4.5:compile - 
 version managed from 3.4.2; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - 
 version managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  |  +- 
 (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate)
 [INFO] |  |  |  |  |  +- (commons-io:commons-io:jar:2.1:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  |  +- (com.google.inject:guice:jar:3.0:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  |  +- 
 (com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.8:compile
  - omitted for duplicate)
 [INFO] |  |  |  |  |  +- (com.sun.jersey:jersey-server:jar:1.8:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  |  \- 
 (com.sun.jersey.contribs:jersey-guice:jar:1.8:compile - omitted for duplicate)
 [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:1.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:0.23.10:compile
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed 
 from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  \- org.jboss.netty:netty:jar:3.2.4.Final:compile



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394290#comment-14394290
 ] 

Sean Owen commented on SPARK-6681:
--

That literal doesn't occur in Spark. That looks like how YARN writes 
placeholders to be expanded locally (see {{ApplicationConstants}}). My guess is 
that you don't have {{JAVA_HOME}} exposed to the local YARN workers, or, 
somehow you have some YARN version mismatch, maybe caused by bundling YARN with 
your app. YARN stuff changed in general and might have uncovered a problem; at 
this point I doubt it's a Spark issue as otherwise YARN wouldn't really work at 
all.

 JAVA_HOME error with upgrade to Spark 1.3.0
 ---

 Key: SPARK-6681
 URL: https://issues.apache.org/jira/browse/SPARK-6681
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
 Environment: Client is Mac OS X version 10.10.2, cluster is running 
 HDP 2.1 stack.
Reporter: Ken Williams

 I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 
 1.3.0, so I changed my `build.sbt` like so:
 {code}
 -libraryDependencies += org.apache.spark %% spark-core % 1.2.1 % 
 provided
 +libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % 
 provided
 {code}
 then make an `assembly` jar, and submit it:
 {code}
 HADOOP_CONF_DIR=/etc/hadoop/conf \
 spark-submit \
 --driver-class-path=/etc/hbase/conf \
 --conf spark.hadoop.validateOutputSpecs=false \
 --conf 
 spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
 --deploy-mode=cluster \
 --master=yarn \
 --class=TestObject \
 --num-executors=54 \
 target/scala-2.11/myapp-assembly-1.2.jar
 {code}
 The job fails to submit, with the following exception in the terminal:
 {code}
 15/03/19 10:30:07 INFO yarn.Client: 
 15/03/19 10:20:03 INFO yarn.Client: 
client token: N/A
diagnostics: Application application_1420225286501_4698 failed 2 times 
 due to AM 
  Container for appattempt_1420225286501_4698_02 exited with  
 exitCode: 127 
  due to: Exception from container-launch: 
 org.apache.hadoop.util.Shell$ExitCodeException: 
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
   at org.apache.hadoop.util.Shell.run(Shell.java:379)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {code}
 Finally, I go and check the YARN app master’s web interface (since the job is 
 there, I know it at least made it that far), and the only logs it shows are 
 these:
 {code}
 Log Type: stderr
 Log Length: 61
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory
 
 Log Type: stdout
 Log Length: 0
 {code}
 I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal 
 (including the brackets) that’s somehow making it into a script?  Is this 
 coming from the worker nodes or the driver?  Anything I can do to experiment 
  troubleshoot?
 I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of 
 the cluster:
 {code}
 % grep JAVA_HOME /etc/hadoop/conf/*.sh
 /etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
 /etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31
 {code}
 Has this behavior changed in 1.3.0 since 1.2.1?  Using 1.2.1 and making no 
 other changes, the job completes fine.
 (Note: I originally posted this on the Spark mailing list and also on Stack 
 Overflow, I'll update both places if/when I find a solution.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-03 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-6691:
---
Issue Type: Improvement  (was: New Feature)

 Abstract and add a dynamic RateLimiter for Spark Streaming
 --

 Key: SPARK-6691
 URL: https://issues.apache.org/jira/browse/SPARK-6691
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao

 Flow control (or rate control) for input data is very important in streaming 
 system, especially for Spark Streaming to keep stable and up-to-date. The 
 unexpected flood of incoming data or the high ingestion rate of input data 
 which beyond the computation power of cluster will make the system unstable 
 and increase the delay time. For Spark Streaming’s job generation and 
 processing pattern, this delay will be accumulated and introduce unacceptable 
 exceptions.
 
 Currently in Spark Streaming’s receiver based input stream, there’s a 
 RateLimiter in BlockGenerator which controls the ingestion rate of input 
 data, but the current implementation has several limitations:
 # The max ingestion rate is set by user through configuration beforehand, 
 user may lack the experience of how to set an appropriate value before the 
 application is running.
 # This configuration is fixed through the life-time of application, which 
 means you need to consider the worst scenario to set a reasonable 
 configuration.
 # Input stream like DirectKafkaInputStream need to maintain another solution 
 to achieve the same functionality.
 # Lack of slow start control makes the whole system easily trapped into large 
 processing and scheduling delay at the very beginning.
 
 So here we propose a new dynamic RateLimiter as well as the new interface for 
 the RateLimiter to better improve the whole system's stability. The target is:
 * Dynamically adjust the ingestion rate according to processing rate of 
 previous finished jobs.
 * Offer an uniform solution not only for receiver based input stream, but 
 also for direct stream like DirectKafkaInputStream and new ones.
 * Slow start rate to control the network congestion when job is started.
 * Pluggable framework to make the maintenance of extension more easy.
 
 Here is the design doc 
 (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
  and working branch 
 (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
 Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394135#comment-14394135
 ] 

Apache Spark commented on SPARK-6692:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/5343

 Make it possible to kill AM in YARN cluster mode when the client is terminated
 --

 Key: SPARK-6692
 URL: https://issues.apache.org/jira/browse/SPARK-6692
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I understand that the yarn-cluster mode is designed for fire-and-forget 
 model; therefore, terminating the yarn client doesn't kill AM.
 However, it is very common that users submit Spark jobs via job scheduler 
 (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is 
 expected that killing the yarn client will terminate AM. 
 It is true that the yarn-client mode can be used in such cases. But then, the 
 yarn client sometimes needs lots of heap memory for big jobs if it runs in 
 the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs 
 because AM can be given arbitrary heap memory unlike the yarn client. So it 
 would be very useful to make it possible to kill AM even in the yarn-cluster 
 mode.
 In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
 as they're accepted (but not yet running). Although they're eventually 
 shutdown after AM timeout, it would be nice if AM could immediately get 
 killed in such cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6692:
---

Assignee: (was: Apache Spark)

 Make it possible to kill AM in YARN cluster mode when the client is terminated
 --

 Key: SPARK-6692
 URL: https://issues.apache.org/jira/browse/SPARK-6692
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I understand that the yarn-cluster mode is designed for fire-and-forget 
 model; therefore, terminating the yarn client doesn't kill AM.
 However, it is very common that users submit Spark jobs via job scheduler 
 (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is 
 expected that killing the yarn client will terminate AM. 
 It is true that the yarn-client mode can be used in such cases. But then, the 
 yarn client sometimes needs lots of heap memory for big jobs if it runs in 
 the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs 
 because AM can be given arbitrary heap memory unlike the yarn client. So it 
 would be very useful to make it possible to kill AM even in the yarn-cluster 
 mode.
 In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
 as they're accepted (but not yet running). Although they're eventually 
 shutdown after AM timeout, it would be nice if AM could immediately get 
 killed in such cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array

2015-04-03 Thread Ishaaq Chandy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394326#comment-14394326
 ] 

Ishaaq Chandy commented on SPARK-2489:
--

I see [~joesu]'s pull request got closed without being merged in. Does this 
mean that there is currently no solution/workaround to this issue?

 Unsupported parquet datatype optional fixed_len_byte_array
 --

 Key: SPARK-2489
 URL: https://issues.apache.org/jira/browse/SPARK-2489
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee

 tested against commit 9fe693b5
 {noformat}
 scala sqlContext.parquetFile(/tmp/foo)
 java.lang.RuntimeException: Unsupported parquet datatype optional 
 fixed_len_byte_array(4) b
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282)
   at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279)
 {noformat}
 example avro schema
 {noformat}
 protocol Test {
 fixed Bytes4(4);
 record Foo {
 union {null, Bytes4} b;
 }
 }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6689) MiniYarnCLuster still test failed with hadoop-2.2

2015-04-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6689:
-
Priority: Minor  (was: Major)

I imagine this is a problem because you are building with SBT, and it can't 
fully parse the Maven build. The fix depends on some Maven profiles which may 
not fully affect the SBT build in the same way. I'm not 100% sure, but I know 
there is some difference.

The build also fails for me with your build command, but succeeds with Maven. 
Since Maven is the build of reference I am not sure if this is such a big deal 
except to developers who have to work specifically with Hadoop 2.2 and want to 
use SBT.

It'd be great if you can figure out a fix but it's not affecting the main build.

 MiniYarnCLuster still test failed with hadoop-2.2
 -

 Key: SPARK-6689
 URL: https://issues.apache.org/jira/browse/SPARK-6689
 Project: Spark
  Issue Type: Test
  Components: Tests, YARN
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 when running unit test *YarnClusterSuite* with *hadoop-2.2*, exception will 
 throw because *Timed out waiting for RM to come up*. Some previously related 
 discussion can be traced in 
 [spark-3710|https://issues.apache.org/jira/browse/SPARK-3710] 
 ([PR2682|https://github.com/apache/spark/pull/2682]) and 
 [spark-2778|https://issues.apache.org/jira/browse/SPARK-2778] 
 ([PR2605|https://github.com/apache/spark/pull/2605]). 
 With command *build/sbt -Pyarn -Phadoop-2.2 test-only 
 org.apache.spark.deploy.yarn.YarnClusterSuite*, will get following 
 exceptions: 
 {noformat}
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (15 seconds, 
 799 milliseconds)
 [info]   java.lang.IllegalStateException: Timed out waiting for RM to come up.
 [info]   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:114)
 [info]   at 
 org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
 [info]   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:44)
 [info]   at 
 org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
 [info]   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:44)
 [info]   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
 [info]   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
 [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 [info]   at java.lang.Thread.run(Thread.java:745)
 {noformat}
 And without *-Phadoop-2.2* or replace it with *-Dhadoop.version* (e.g. 
 build/sbt -Pyarn test-only org.apache.spark.deploy.yarn.YarnClusterSuite) 
 more info will come out:
 {noformat}
 Exception in thread Thread-7 java.lang.NoClassDefFoundError: 
 org/mortbay/jetty/servlet/Context
   at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:602)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:655)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper$2.run(MiniYARNCluster.java:219)
 Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.servlet.Context
   at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 [info] Resolving org.apache.hadoop#hadoop-yarn-server-common;2.2.0 ...
 Exception in thread Thread-18 java.lang.NoClassDefFoundError: 
 org/mortbay/jetty/servlet/Context
   at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309)
   at 
 org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer.serviceStart(WebServer.java:62)
   at 
 org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
   at 
 org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
   at 

[jira] [Created] (SPARK-6697) PeriodicGraphCheckpointer is not clear Edges.

2015-04-03 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-6697:
--

 Summary: PeriodicGraphCheckpointer is not clear Edges.
 Key: SPARK-6697
 URL: https://issues.apache.org/jira/browse/SPARK-6697
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.3.0
Reporter: Guoqiang Li


When I run this [branch(lrGraphxSGD)| 
https://github.com/witgo/spark/tree/lrGraphxSGD] .
PeriodicGraphCheckpointer only clear the vertices.

{code} 
def run(iterations: Int): Unit = {
for (iter - 1 to iterations) {
  logInfo(sStart train (Iteration $iter/$iterations))
  val margin = forward()
  margin.setName(smargin-$iter).persist(storageLevel)
  println(strain (Iteration $iter/$iterations) cost : ${error(margin)})
  var gradient = backward(margin)
  gradient = updateDeltaSum(gradient, iter)

  dataSet = updateWeight(gradient, iter)
  dataSet.vertices.setName(svertices-$iter)
  dataSet.edges.setName(sedges-$iter)
  dataSet.persist(storageLevel)
  graphCheckpointer.updateGraph(dataSet)

  margin.unpersist(blocking = false)
  gradient.unpersist(blocking = false)
  logInfo(sEnd train (Iteration $iter/$iterations))
  innerIter += 1
}
graphCheckpointer.deleteAllCheckpoints()
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394231#comment-14394231
 ] 

Sean Owen commented on SPARK-6664:
--

Yes _k_ estimates is better than 1; this is both more expensive and less 
important when the data size is large. But, yes it has value and I can see the 
argument that it's more important if the response is actually time-dependent. I 
wasn't suggesting that {{MLUtils.kFold}} implements this, but that it was a 
related piece of code. If ordering matters and the input has an ordering that 
biases the result, then yes you would randomly permute the partition or RDD. 
This isn't true for every algorithm but for some. 

Same thing here really, I think you can order, bucket by range, and union in 
the straightforward way and it will be as performant as anything I can think 
of. You have to write some code, but it's flexible. 

The question is how much it's worth adding another method versus how often this 
is used. I can see this being useful for time series. I suppose that if it 
turns out there's a much fast-er way to do this but it's complex, and it is 
used, then it does need to be wrapped up in a utility method.

 Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
 --

 Key: SPARK-6664
 URL: https://issues.apache.org/jira/browse/SPARK-6664
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Florian Verhein

 I can't find this functionality (if I missed something, apologies!), but it 
 would be very useful for evaluating ml models.  
 *Use case example* 
 suppose you have pre-processed web logs for a few months, and now want to 
 split it into a training set (where you train a model to predict some aspect 
 of site accesses, perhaps per user) and an out of time test set (where you 
 evaluate how well your model performs in the future). This example has just a 
 single split, but in general you could want more for cross validation. You 
 may also want to have multiple overlaping intervals.   
 *Specification* 
 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), 
 return n+1 RDDs such that values in the ith RDD are within the (i-1)th and 
 ith boundary.
 2. More complex alternative (but similar under the hood): provide a sequence 
 of possibly overlapping intervals (ordered by the start key of the interval), 
 and return the RDDs containing values within those intervals. 
 *Implementation ideas / notes for 1*
 - The ordered RDDs are likely RangePartitioned (or there should be a simple 
 way to find ranges from partitions in an ordered RDD)
 - Find the partitions containing the boundary, and split them in two.  
 - Construct the new RDDs from the original partitions (and any split ones)
 I suspect this could be done by launching only a few jobs to split the 
 partitions containing the boundaries. 
 Alternatively, it might be possible to decorate these partitions and use them 
 in more than one RDD. I.e. let one of these partitions (for boundary i) be p. 
 Apply two decorators p' and p'', where p' is masks out values above the ith 
 boundary, and p'' masks out values below the ith boundary. Any operations on 
 these partitions apply only to values not masked out. Then assign p' to the 
 ith output RDD and p'' to the (i+1)th output RDD.
 If I understand Spark correctly, this should not require any jobs. Not sure 
 whether it's worth trying this optimisation.
 *Implementation ideas / notes for 2*
 This is very similar, except that we have to handle entire (or parts) of 
 partitions belonging to more than one output RDD, since they are no longer 
 mutually exclusive. But since RDDs are immutable(??), the decorator idea 
 should still work?
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-03 Thread uncleGen (JIRA)
uncleGen created SPARK-6695:
---

 Summary: Add an external iterator: a hadoop-like output collector
 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen


In practical use, we usually need to create a big iterator, which means too big 
in `memory usage` or too long in `array size`. On the one hand, it leads to too 
much memory consumption. On the other hand, one `Array` may not hold all the 
elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, 
IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, 
and could spill data into disk. The use case may like:

```

   rdd.mapPartition { it = 
  ...
  val collector = new ExteranalCollector()
  collector.collect(a)
  ...
  collector.iterator
  }
   
```

I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-03 Thread uncleGen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-6695:

Description: 
In practical use, we usually need to create a big iterator, which means too big 
in `memory usage` or too long in `array size`. On the one hand, it leads to too 
much memory consumption. On the other hand, one `Array` may not hold all the 
elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, 
IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, 
and could spill data into disk. The use case may like:

```

   rdd.mapPartition { it = 
  ...
  val collector = new ExternalCollector()
  collector.collect(a)
  ...
  collector.iterator
  }
   
```

I have done some related works, and I need your opinions, thanks!

  was:
In practical use, we usually need to create a big iterator, which means too big 
in `memory usage` or too long in `array size`. On the one hand, it leads to too 
much memory consumption. On the other hand, one `Array` may not hold all the 
elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, 
IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, 
and could spill data into disk. The use case may like:

```

   rdd.mapPartition { it = 
  ...
  val collector = new ExteranalCollector()
  collector.collect(a)
  ...
  collector.iterator
  }
   
```

I have done some related works, and I need your opinions, thanks!


 Add an external iterator: a hadoop-like output collector
 

 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen

 In practical use, we usually need to create a big iterator, which means too 
 big in `memory usage` or too long in `array size`. On the one hand, it leads 
 to too much memory consumption. On the other hand, one `Array` may not hold 
 all the elements, as java array indices are of type 'int' (4 bytes or 32 
 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
 any others, and could spill data into disk. The use case may like:
 ```
rdd.mapPartition { it = 
   ...
   val collector = new ExternalCollector()
   collector.collect(a)
   ...
   collector.iterator
   }

 ```
 I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector

2015-04-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394236#comment-14394236
 ] 

Sean Owen commented on SPARK-6695:
--

I am not sure what the use case is here. You already have an iterator; why does 
spilling it to disk then re-iterating over it help?

 Add an external iterator: a hadoop-like output collector
 

 Key: SPARK-6695
 URL: https://issues.apache.org/jira/browse/SPARK-6695
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: uncleGen

 In practical use, we usually need to create a big iterator, which means too 
 big in `memory usage` or too long in `array size`. On the one hand, it leads 
 to too much memory consumption. On the other hand, one `Array` may not hold 
 all the elements, as java array indices are of type 'int' (4 bytes or 32 
 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or 
 any others, and could spill data into disk. The use case may like:
 {code: borderStyle=solid}
rdd.mapPartition { it = 
   ...
   val collector = new ExternalCollector()
   collector.collect(a)
   ...
   collector.iterator
   }

 {code}
 I have done some related works, and I need your opinions, thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.

2015-04-03 Thread Jin Adachi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394251#comment-14394251
 ] 

Jin Adachi commented on SPARK-6694:
---

SparkSQL CLI doesn't work option --database, and that forced database to 
default.
For example, It is said that I have a database 'test_db' and a table 't_user'.
I caught the error as follows.

{code:}
$ spark-sql --database test_db -e 'select * from t_user order by id'
:
15/04/03 13:26:30 ERROR metadata.Hive: 
NoSuchObjectException(message:default.t_user table not found)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy9.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:180)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:252)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:252)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:175)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:187)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:182)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:236)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at 

[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD

2015-04-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394291#comment-14394291
 ] 

Florian Verhein commented on SPARK-6665:


Fair enough. I'll have to implement it because I need it so may as well report 
back when I've had the chance to (perhaps there's a better place for it - e.g. 
not in the core API). 


 Randomly Shuffle an RDD 
 

 Key: SPARK-6665
 URL: https://issues.apache.org/jira/browse/SPARK-6665
 Project: Spark
  Issue Type: New Feature
  Components: Spark Shell
Reporter: Florian Verhein
Priority: Minor

 *Use case* 
 RDD created in a way that has some ordering, but you need to shuffle it 
 because the ordering would cause problems downstream. E.g.
 - will be used to train a ML algorithm that makes stochastic assumptions 
 (like SGD) 
 - used as input for cross validation. e.g. after the shuffle, you could just 
 grab partitions (or part files if saved to hdfs) as folds
 Related question in mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html
 *Possible implementation*
 As mentioned by [~sowen] in the above thread, could sort by( a good  hash of( 
 the element (or key if it's paired) and a random salt)). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6638) optimize StringType in SQL

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394614#comment-14394614
 ] 

Apache Spark commented on SPARK-6638:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5350

 optimize StringType in SQL
 --

 Key: SPARK-6638
 URL: https://issues.apache.org/jira/browse/SPARK-6638
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu

 java.lang.String is encoded in UTF-16, it's not efficient for IO. We could 
 change to use Array[Byte] of UTF-8 internally for better performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394839#comment-14394839
 ] 

Apache Spark commented on SPARK-6330:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5353

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
 Fix For: 1.3.1, 1.4.0


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6697) PeriodicGraphCheckpointer is not clear edges.

2015-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394785#comment-14394785
 ] 

Joseph K. Bradley commented on SPARK-6697:
--

Thanks for pointing this out.  I don't recall this happening in LDA, but I 
think that's because LDA's edges do not change.  We may need to add an option 
for this to PeriodicGraphCheckpointer in order to make it generally usable 
beyond LDA.

 PeriodicGraphCheckpointer is not clear edges.
 -

 Key: SPARK-6697
 URL: https://issues.apache.org/jira/browse/SPARK-6697
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.3.0
Reporter: Guoqiang Li
 Attachments: QQ20150403-1.png


 When I run this [branch(lrGraphxSGD)| 
 https://github.com/witgo/spark/tree/lrGraphxSGD] .
 PeriodicGraphCheckpointer only clear the vertices.
 {code} 
 def run(iterations: Int): Unit = {
 for (iter - 1 to iterations) {
   logInfo(sStart train (Iteration $iter/$iterations))
   val margin = forward()
   margin.setName(smargin-$iter).persist(storageLevel)
   println(strain (Iteration $iter/$iterations) cost : ${error(margin)})
   var gradient = backward(margin)
   gradient = updateDeltaSum(gradient, iter)
   dataSet = updateWeight(gradient, iter)
   dataSet.vertices.setName(svertices-$iter)
   dataSet.edges.setName(sedges-$iter)
   dataSet.persist(storageLevel)
   graphCheckpointer.updateGraph(dataSet)
   margin.unpersist(blocking = false)
   gradient.unpersist(blocking = false)
   logInfo(sEnd train (Iteration $iter/$iterations))
   innerIter += 1
 }
 graphCheckpointer.deleteAllCheckpoints()
   }
   // Updater for L1 regularized problems
   private def updateWeight(delta: VertexRDD[Double], iter: Int): Graph[VD, 
 ED] = {
 val thisIterStepSize = if (useAdaGrad) stepSize else stepSize / sqrt(iter)
 val thisIterL1StepSize = stepSize / sqrt(iter)
 val newVertices = dataSet.vertices.leftJoin(delta) { (_, attr, gradient) 
 =
   gradient match {
 case Some(gard) = {
   var weight = attr
   weight -= thisIterStepSize * gard
   if (regParam  0.0  weight != 0.0) {
 val shrinkageVal = regParam * thisIterL1StepSize
 weight = signum(weight) * max(0.0, abs(weight) - shrinkageVal)
   }
   assert(!weight.isNaN)
   weight
 }
 case None = attr
   }
 }
 GraphImpl(newVertices, dataSet.edges)
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6615) Add missing methods to Word2Vec's Python API

2015-04-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6615:
-
Assignee: Kai Sasaki

 Add missing methods to Word2Vec's Python API
 

 Key: SPARK-6615
 URL: https://issues.apache.org/jira/browse/SPARK-6615
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor
  Labels: MLLib,, Python
 Fix For: 1.4.0


 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6615) Python API for Word2Vec

2015-04-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6615.
--
Resolution: Fixed

Issue resolved by pull request 5296
[https://github.com/apache/spark/pull/5296]

 Python API for Word2Vec
 ---

 Key: SPARK-6615
 URL: https://issues.apache.org/jira/browse/SPARK-6615
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: MLLib,, Python
 Fix For: 1.4.0


 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6615) Add missing methods to Word2Vec's Python API

2015-04-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6615:
-
Summary: Add missing methods to Word2Vec's Python API  (was: Python API for 
Word2Vec)

 Add missing methods to Word2Vec's Python API
 

 Key: SPARK-6615
 URL: https://issues.apache.org/jira/browse/SPARK-6615
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
  Labels: MLLib,, Python
 Fix For: 1.4.0


 This is the sub-task of 
 [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].
 Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Michael Bieniosek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Bieniosek updated SPARK-6698:
-
Attachment: SPARK-6698.patch

Attaching proposed patch to copy StorageLevel from input RDD

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
 Attachments: SPARK-6698.patch


 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Michael Bieniosek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Bieniosek updated SPARK-6698:
-
Attachment: (was: SPARK-6698.patch)

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
Priority: Minor

 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6682:
-
Description: 
In MLlib, we have for some time been unofficially moving away from the old 
static train() methods and moving towards builder patterns.  This JIRA is to 
discuss this move and (hopefully) make it official.

Old static train() API:
{code}
val myModel = NaiveBayes.train(myData, ...)
{code}

New builder pattern API:
{code}
val nb = new NaiveBayes().setLambda(0.1)
val myModel = nb.train(myData)
{code}

Pros of the builder pattern:
* Much less code when algorithms have many parameters.  Since Java does not 
support default arguments, we required *many* duplicated static train() methods 
(for each prefix set of arguments).
* Helps to enforce default parameters.  Users should ideally not have to even 
think about setting parameters if they just want to try an algorithm quickly.
* Matches spark.ml API

Cons of the builder pattern:
* In Python APIs, static train methods are more Pythonic.

Proposal:
* Scala/Java: We should start deprecating the old static train() methods.  We 
must keep them for API stability, but deprecating will help with API 
consistency, making it clear that everyone should use the builder pattern.  As 
we deprecate them, we should make sure that the builder pattern supports all 
parameters.
* Python: Keep static train methods.

CC: [~mengxr]

  was:
In MLlib, we have for some time been unofficially moving away from the old 
static train() methods and moving towards builder patterns.  This JIRA is to 
discuss this move and (hopefully) make it official.

Old static train() API:
{code}
val myModel = NaiveBayes.train(myData, ...)
{code}

New builder pattern API:
{code}
val nb = new NaiveBayes().setLambda(0.1)
val myModel = nb.train(myData)
{code}

Pros of the builder pattern:
* Much less code when algorithms have many parameters.  Since Java does not 
support default arguments, we required *many* duplicated static train() methods 
(for each prefix set of arguments).
* Helps to enforce default parameters.  Users should ideally not have to even 
think about setting parameters if they just want to try an algorithm quickly.
* Matches spark.ml API

Cons:
* In Python APIs, static train methods are more Pythonic.

Proposal:
* Scala/Java: We should start deprecating the old static train() methods.  We 
must keep them for API stability, but deprecating will help with API 
consistency, making it clear that everyone should use the builder pattern.  As 
we deprecate them, we should make sure that the builder pattern supports all 
parameters.
* Python: Keep static train methods.

CC: [~mengxr]


 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-04-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6330.
-
Resolution: Fixed

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-04-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394864#comment-14394864
 ] 

Yin Huai commented on SPARK-6330:
-

Please ignore my comment. 

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394776#comment-14394776
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

Note: We could keep 2 APIs for Scala/Java, but this is not a great solution for 
2 reasons:
* 2 APIs means more code to maintain, and they are confusing to users figuring 
out which API to use  whether the APIs are the same.
* The static train() methods are not workable for some algorithms with  10 
parameters (because of Scala style constraints).

Also, once we add SparkR, we will not be able to keep uniform APIs everywhere 
since R has such different syntax.  We can make a best effort, but I feel we 
should tailor it to the particular language when it makes sense.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-03 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394843#comment-14394843
 ] 

Yu Ishikawa commented on SPARK-6682:


Hi [~josephkb],
Thank you for your proposal. That sounds good. But I think how to call the 
python train () method should be the same way as Scala/Java builder method for 
users.
It would be nice if there is any mechanism to keep a builder method consistent 
between Scala/Java and Python automatically. However, if that is very difficult 
or impossible, I totally agree with your proposal.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-04-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-6330:
-

I am reopening the issue since for s3n, {{fs.makeQualified(qualifiedPath)}} 
does not. It will throw a very confusing error message.
{code}
java.lang.IllegalArgumentException: Wrong FS: s3n://ID:KEY@bucket/path, 
expected: s3n://ID:KEY@bucket.
{code}

When I put a relative path, it is fine. Also, if I use 
qualifiedPath.makeQualified(fs). It is fine.

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
 Fix For: 1.3.1, 1.4.0


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-04-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6492.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5277
[https://github.com/apache/spark/pull/5277]

 SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
 ---

 Key: SPARK-6492
 URL: https://issues.apache.org/jira/browse/SPARK-6492
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.4.0
Reporter: Josh Rosen
Priority: Critical
 Fix For: 1.4.0


 A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
 down while user code is concurrently racing to stop the SparkContext in a 
 finally block.
 For example:
 {code}
 try {
   sc = new SparkContext(local, test)
   // start running a job that causes the DAGSchedulerEventProcessor to 
 crash
   someRDD.doStuff()
 }
 } finally {
   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
 the above job to fail with an exception
 }
 {code}
 This leads to a deadlock.  The event processor thread tries to lock on the 
 {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
 the thread that holds that lock is waiting for the event processor thread to 
 join:
 {code}
 dag-scheduler-event-loop daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
 waiting for monitor entry [0x0001223ad000]
java.lang.Thread.State: BLOCKED (on object monitor)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
   - waiting to lock 0x0007f5037b08 (a java.lang.Object)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
 {code}
 {code}
 pool-1-thread-1-ScalaTest-running-SparkContextSuite prio=5 
 tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x0007f4b28000 (a 
 org.apache.spark.util.EventLoop$$anon$1)
   at java.lang.Thread.join(Thread.java:1281)
   - locked 0x0007f4b28000 (a 
 org.apache.spark.util.EventLoop$$anon$1)
   at java.lang.Thread.join(Thread.java:1355)
   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
   - locked 0x0007f5037b08 (a java.lang.Object)
 [...]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-04-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6492:
-
Assignee: Ilya Ganelin

 SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
 ---

 Key: SPARK-6492
 URL: https://issues.apache.org/jira/browse/SPARK-6492
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.4.0
Reporter: Josh Rosen
Assignee: Ilya Ganelin
Priority: Critical
 Fix For: 1.4.0


 A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
 down while user code is concurrently racing to stop the SparkContext in a 
 finally block.
 For example:
 {code}
 try {
   sc = new SparkContext(local, test)
   // start running a job that causes the DAGSchedulerEventProcessor to 
 crash
   someRDD.doStuff()
 }
 } finally {
   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
 the above job to fail with an exception
 }
 {code}
 This leads to a deadlock.  The event processor thread tries to lock on the 
 {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
 the thread that holds that lock is waiting for the event processor thread to 
 join:
 {code}
 dag-scheduler-event-loop daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
 waiting for monitor entry [0x0001223ad000]
java.lang.Thread.State: BLOCKED (on object monitor)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
   - waiting to lock 0x0007f5037b08 (a java.lang.Object)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
 {code}
 {code}
 pool-1-thread-1-ScalaTest-running-SparkContextSuite prio=5 
 tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on 0x0007f4b28000 (a 
 org.apache.spark.util.EventLoop$$anon$1)
   at java.lang.Thread.join(Thread.java:1281)
   - locked 0x0007f4b28000 (a 
 org.apache.spark.util.EventLoop$$anon$1)
   at java.lang.Thread.join(Thread.java:1355)
   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
   - locked 0x0007f5037b08 (a java.lang.Object)
 [...]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4205) Timestamp and Date objects with comparison operators

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4205:
---

Assignee: Apache Spark

 Timestamp and Date objects with comparison operators
 

 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4205) Timestamp and Date objects with comparison operators

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4205:
---

Assignee: (was: Apache Spark)

 Timestamp and Date objects with comparison operators
 

 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Michael Bieniosek (JIRA)
Michael Bieniosek created SPARK-6698:


 Summary: RandomForest.scala (et al) hardcodes usage of 
StorageLevel.MEMORY_AND_DISK
 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek


In RandomForest.scala the feature input is persisted with 
StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate 
is set at 100%.  This forces the RDD to be stored unserialized, which causes 
major JVM GC headaches if the RDD is sizable.  

Something similar happens in NodeIdCache.scala though I believe in this case 
the RDD is smaller.

A simple fix would be to use the same StorageLevel as the input RDD. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5203) union with different decimal type report error

2015-04-03 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-5203.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4004
[https://github.com/apache/spark/pull/4004]

 union with different decimal type report error
 --

 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei
 Fix For: 1.4.0


 Test case like this:
 {code:sql}
 create table test (a decimal(10,1));
 select a from test union all select a*2 from test;
 {code}
 Exception thown:
 {noformat}
 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
 all select a*2 from test]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 'Project [*]
  'Subquery _u1
   'Union 
Project [a#1]
 MetastoreRelation default, test, None
Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
 DecimalType())), DecimalType(21,1)) AS _c0#0]
 MetastoreRelation default, test, None
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6330) newParquetRelation gets incorrect FileSystem

2015-04-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6330:

Priority: Blocker  (was: Major)

 newParquetRelation gets incorrect FileSystem
 

 Key: SPARK-6330
 URL: https://issues.apache.org/jira/browse/SPARK-6330
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 Here's a snippet from newParquet.scala:
 def refresh(): Unit = {
   val fs = FileSystem.get(sparkContext.hadoopConfiguration)
   // Support either reading a collection of raw Parquet part-files, or a 
 collection of folders
   // containing Parquet files (e.g. partitioned Parquet table).
   val baseStatuses = paths.distinct.map { p =
 val qualified = fs.makeQualified(new Path(p))
 if (!fs.exists(qualified)  maybeSchema.isDefined) {
   fs.mkdirs(qualified)
   prepareMetadata(qualified, maybeSchema.get, 
 sparkContext.hadoopConfiguration)
 }
 fs.getFileStatus(qualified)
   }.toArray
 If we are running this locally and path points to S3, fs would be incorrect. 
 A fix is to construct fs for each file separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters

2015-04-03 Thread Yash Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394670#comment-14394670
 ] 

Yash Datta commented on SPARK-4258:
---

[~yhuai] No it does not. I fixed this in parquet master. Waiting for parquet to 
release the next version. Current version is 1.6.0rc3 (being used in spark)

 NPE with new Parquet Filters
 

 Key: SPARK-4258
 URL: https://issues.apache.org/jira/browse/SPARK-4258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
 java.lang.NullPointerException: 
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
 
 parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 {code}
 This occurs when reading parquet data encoded with the older version of the 
 library for TPC-DS query 34.  Will work on coming up with a smaller 
 reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side

2015-04-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6640.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

 Executor may connect to HeartbeartReceiver before it's setup in the driver 
 side
 ---

 Key: SPARK-6640
 URL: https://issues.apache.org/jira/browse/SPARK-6640
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.4.0


 Here is the current code about starting LocalBackend and creating 
 HeartbeatReceiver:
 {code}
   // Create and start the scheduler
   private[spark] var (schedulerBackend, taskScheduler) =
 SparkContext.createTaskScheduler(this, master)
   private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 When creating LocalBackend, it will start `LocalActor`. `LocalActor` will   
 create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`.
 So we should make sure this line:
 {code}
 private val heartbeatReceiver = env.actorSystem.actorOf(
 Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver)
 {code}
 happen before creating LocalActor.
 However, current codes can not guarantee that. Sometimes, creating Executor 
 will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in 
 http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6698:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Open a PR; changes aren't managed by patches here.)

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
Priority: Minor
 Attachments: SPARK-6698.patch


 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6698:
---

Assignee: (was: Apache Spark)

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
Priority: Minor
 Attachments: SPARK-6698.patch


 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6698:
---

Assignee: Apache Spark

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
Assignee: Apache Spark
Priority: Minor
 Attachments: SPARK-6698.patch


 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394681#comment-14394681
 ] 

Apache Spark commented on SPARK-6698:
-

User 'bien' has created a pull request for this issue:
https://github.com/apache/spark/pull/5351

 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
 --

 Key: SPARK-6698
 URL: https://issues.apache.org/jira/browse/SPARK-6698
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Michael Bieniosek
Priority: Minor
 Attachments: SPARK-6698.patch


 In RandomForest.scala the feature input is persisted with 
 StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging 
 rate is set at 100%.  This forces the RDD to be stored unserialized, which 
 causes major JVM GC headaches if the RDD is sizable.  
 Something similar happens in NodeIdCache.scala though I believe in this case 
 the RDD is smaller.
 A simple fix would be to use the same StorageLevel as the input RDD. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6688) EventLoggingListener should always operate on resolved URIs

2015-04-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6688.

   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
 Assignee: Marcelo Vanzin

 EventLoggingListener should always operate on resolved URIs
 ---

 Key: SPARK-6688
 URL: https://issues.apache.org/jira/browse/SPARK-6688
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 A small bug was introduced in 1.3.0, where a check in 
 EventLoggingListener.scala is performed on the non-resolved log path. This 
 means that if fs.defaultFS is not the local filesystem, and the user is 
 trying to store logs in the local filesystem by providing a path with no 
 file: protocol, thing will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter

2015-04-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6647.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5309
[https://github.com/apache/spark/pull/5309]

 Make trait StringComparison as BinaryPredicate and throw error when Predicate 
 can't translate to data source Filter
 ---

 Key: SPARK-6647
 URL: https://issues.apache.org/jira/browse/SPARK-6647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
 Fix For: 1.4.0


 Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should 
 be a {{BinaryPredicate}}.
 By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error 
 when a {{expressions.Predicate}} can't translate to a data source {{Filter}} 
 in function {{selectFilters}}.
 Without this modification, because we will wrap a {{Filter}} outside the 
 scanned results in {{pruneFilterProjectRaw}}, we can't detect about something 
 is wrong in translating predicates to filters in {{selectFilters}}.
 The unit test of SPARK-6625 demonstrates such problem. In that pr, even 
 {{expressions.Contains}} is not properly translated to 
 {{sources.StringContains}}, the filtering is still performed by the 
 {{Filter}} and so the test passes.
 Of course, by doing this modification, all {{expressions.Predicate}} classes 
 need to have its data source {{Filter}} correspondingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6683) Handling feature scaling properly for GLMs

2015-04-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6683:
-
Description: 
GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
* improves optimization behavior (essentially always improves behavior in 
practice)
* changes the optimal solution (often for the better in terms of standardizing 
feature importance)

Current problems:
* Inefficient implementation: We make a rescaled copy of the data.
* Surprising API: For algorithms which use feature scaling, users may get 
different solutions than with R or other libraries.  (Note: Feature scaling 
could be handled without changing the solution.)
* Inconsistent API: Not all algorithms have the same default for feature 
scaling, and not all expose the option.

This is a proposal discussed with [~mengxr] for an ideal solution.  This 
solution will require some breaking API changes, but I'd argue these are 
necessary for the long-term since it's the best API we have thought of.

Proposal:
* Implementation: Change to avoid making a rescaled copy of the data (described 
below).  No API issues here.
* API:
** Hide featureScaling from API. (breaking change)
** Internally, handle feature scaling to improve optimization, but modify it so 
it does not change the optimal solution. (breaking change, in terms of 
algorithm behavior)
** Externally, users who want to rescale feature (to change the solution) 
should do that scaling as a preprocessing step.

Details on implementation:
* GradientDescent could instead scale the step size separately for each feature 
(and adjust regularization as needed; see the PR linked above).  This would 
require storing a vector of length numFeatures, rather than making a full copy 
of the data.
* I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
here.


  was:
GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
* improves optimization behavior (essentially always improves behavior in 
practice)
* changes the optimal solution (often for the better in terms of standardizing 
feature importance)

Current problems:
* Inefficient implementation: We make a rescaled copy of the data.
* Surprising API: For algorithms which use feature scaling, users may get 
different solutions than with R or other libraries.  (Note: Feature scaling 
could be handled without changing the solution.)
* Inconsistent API: Not all algorithms have the same default for feature 
scaling, and not all expose the option.

This is a proposal discussed with [~mengxr] for an ideal solution.  This 
solution will require some breaking API changes, but I'll argue these are 
necessary for the long-term.

Proposal:
* Implementation: Change to avoid making a rescaled copy of the data (described 
below).  No API issues here.
* API:
** Hide featureScaling from API. (breaking change)
** Internally, handle feature scaling to improve optimization, but modify it so 
it does not change the optimal solution. (breaking change, in terms of 
algorithm behavior)
** Externally, users who want to rescale feature (to change the solution) 
should do that scaling as a preprocessing step.

Details on implementation:
* GradientDescent could instead scale the step size separately for each feature 
(and adjust regularization as needed; see the PR linked above).  This would 
require storing a vector of length numFeatures, rather than making a full copy 
of the data.
* I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
here.



 Handling feature scaling properly for GLMs
 --

 Key: SPARK-6683
 URL: https://issues.apache.org/jira/browse/SPARK-6683
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
 * improves optimization behavior (essentially always improves behavior in 
 practice)
 * changes the optimal solution (often for the better in terms of 
 standardizing feature importance)
 Current problems:
 * Inefficient implementation: We make a rescaled copy of the data.
 * Surprising API: For algorithms which use feature scaling, users may get 
 different solutions than with R or other libraries.  (Note: Feature scaling 
 could be handled without changing the solution.)
 * Inconsistent API: Not all algorithms have the same default for feature 
 scaling, and not all expose the option.
 This is a proposal discussed with [~mengxr] for an ideal solution.  This 
 solution will require some breaking API changes, but I'd argue these are 
 necessary for the long-term since it's the best API we have thought of.
 Proposal:
 * Implementation: Change to avoid making a rescaled copy of the data 
 (described below).  No API issues here.
 * API:
 ** 

[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs

2015-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395122#comment-14395122
 ] 

Joseph K. Bradley commented on SPARK-6683:
--

Great, it sounds like we're in agreement about the API and algorithm behavior.  
W.r.t. implementation, I haven't thought through it too carefully.  I would 
have thought squared error would be the easiest loss to handle since (I 
believe) it would reduce to scaling stepSize for each feature (applied to the 
loss gradient, not the regularization gradient).  I'm not sure about the 
others...

 Handling feature scaling properly for GLMs
 --

 Key: SPARK-6683
 URL: https://issues.apache.org/jira/browse/SPARK-6683
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
 * improves optimization behavior (essentially always improves behavior in 
 practice)
 * changes the optimal solution (often for the better in terms of 
 standardizing feature importance)
 Current problems:
 * Inefficient implementation: We make a rescaled copy of the data.
 * Surprising API: For algorithms which use feature scaling, users may get 
 different solutions than with R or other libraries.  (Note: Feature scaling 
 could be handled without changing the solution.)
 * Inconsistent API: Not all algorithms have the same default for feature 
 scaling, and not all expose the option.
 This is a proposal discussed with [~mengxr] for an ideal solution.  This 
 solution will require some breaking API changes, but I'd argue these are 
 necessary for the long-term since it's the best API we have thought of.
 Proposal:
 * Implementation: Change to avoid making a rescaled copy of the data 
 (described below).  No API issues here.
 * API:
 ** Hide featureScaling from API. (breaking change)
 ** Internally, handle feature scaling to improve optimization, but modify it 
 so it does not change the optimal solution. (breaking change, in terms of 
 algorithm behavior)
 ** Externally, users who want to rescale feature (to change the solution) 
 should do that scaling as a preprocessing step.
 Details on implementation:
 * GradientDescent could instead scale the step size separately for each 
 feature (and adjust regularization as needed; see the PR linked above).  This 
 would require storing a vector of length numFeatures, rather than making a 
 full copy of the data.
 * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
 here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6700:
---

Assignee: Lianhui Wang  (was: Apache Spark)

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395127#comment-14395127
 ] 

Apache Spark commented on SPARK-6700:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5356

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 

[jira] [Assigned] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6700:
---

Assignee: Apache Spark  (was: Lianhui Wang)

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Critical

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Updated] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-6700:
--
Labels: test yarn  (was: )

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394998#comment-14394998
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

I don't know of an automatic mechanism.  It might be possible to do code 
generation, but that's a bit hacky and might be more trouble than it is worth.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6577:
---

Assignee: Apache Spark

 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-04-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6577:
---

Assignee: (was: Apache Spark)

 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-04-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395099#comment-14395099
 ] 

Apache Spark commented on SPARK-6577:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/5355

 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-03 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6700.

Resolution: Fixed

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs

2015-04-03 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395088#comment-14395088
 ] 

DB Tsai commented on SPARK-6683:


I have this implemented in our lab including
handling the intercept without adding bias in the training dataset
which improves the performance a lot without doing extra caching.

In logistic regression, since the objective function is sum of logP
which is invariance under transformation, this implies that instead of
rescaling x, we can get the same result by rescaling the gradient. As
a result, this can be done just right before optimization.

However, in linear regression, the objective value will be changed
under transformation as well, so I need to handle them differently.

As a result, it will be changelling to come out with one framework
which works for all different type of generalized linear models.

I will like to have them implemented differently in each new SparkML
codebase instead of sharing the same GLM base class. What do you
think?


 Handling feature scaling properly for GLMs
 --

 Key: SPARK-6683
 URL: https://issues.apache.org/jira/browse/SPARK-6683
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
 * improves optimization behavior (essentially always improves behavior in 
 practice)
 * changes the optimal solution (often for the better in terms of 
 standardizing feature importance)
 Current problems:
 * Inefficient implementation: We make a rescaled copy of the data.
 * Surprising API: For algorithms which use feature scaling, users may get 
 different solutions than with R or other libraries.  (Note: Feature scaling 
 could be handled without changing the solution.)
 * Inconsistent API: Not all algorithms have the same default for feature 
 scaling, and not all expose the option.
 This is a proposal discussed with [~mengxr] for an ideal solution.  This 
 solution will require some breaking API changes, but I'd argue these are 
 necessary for the long-term since it's the best API we have thought of.
 Proposal:
 * Implementation: Change to avoid making a rescaled copy of the data 
 (described below).  No API issues here.
 * API:
 ** Hide featureScaling from API. (breaking change)
 ** Internally, handle feature scaling to improve optimization, but modify it 
 so it does not change the optimal solution. (breaking change, in terms of 
 algorithm behavior)
 ** Externally, users who want to rescale feature (to change the solution) 
 should do that scaling as a preprocessing step.
 Details on implementation:
 * GradientDescent could instead scale the step size separately for each 
 feature (and adjust regularization as needed; see the PR linked above).  This 
 would require storing a vector of length numFeatures, rather than making a 
 full copy of the data.
 * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
 here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-03 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-6703:
--

 Summary: Provide a way to discover existing SparkContext's
 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell


Right now it is difficult to write a Spark application in a way that can be run 
independently and also be composed with other Spark applications in an 
environment such as the JobServer, notebook servers, etc where there is a 
shared SparkContext.

It would be nice to have a way to write an application where you can get or 
create a SparkContext and have some standard type of synchronization point 
application authors can access. The most simple/surgical way I see to do this 
is to have an optional static SparkContext singleton that people can be 
retrieved as follows:

{code}
val sc = SparkContext.getOrCreate(conf = new SparkConf())
{code}

And you could also have a setter where some outer framework/server can set it 
for use by multiple downstream applications.

A more advanced version of this would have some named registry or something, 
but since we only support a single SparkContext in one JVM at this point 
anyways, this seems sufficient and much simpler. Another advanced option would 
be to allow plugging in some other notion of configuration you'd pass when 
retrieving an existing context.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5992:
-
Shepherd: Xiangrui Meng

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6701) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application

2015-04-03 Thread Andrew Or (JIRA)
Andrew Or created SPARK-6701:


 Summary: Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python 
application
 Key: SPARK-6701
 URL: https://issues.apache.org/jira/browse/SPARK-6701
 Project: Spark
  Issue Type: Bug
  Components: Tests, YARN
Affects Versions: 1.3.0
Reporter: Andrew Or
Priority: Critical


Observed in Master and 1.3, both in SBT and in Maven (with YARN).

{code}
Error Message

Process 
List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
 --master, yarn-cluster, --num-executors, 1, --properties-file, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
 --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
exited with code 1

sbt.ForkMain$ForkError: Process 
List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
 --master, yarn-cluster, --num-executors, 1, --properties-file, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties,
 --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, 
/tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) 
exited with code 1
at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
at 
org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs

2015-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395145#comment-14395145
 ] 

Joseph K. Bradley commented on SPARK-6683:
--

If you're referring to what I was saying about needing to rescale both step 
size and regularization for least squares, I agree.

 Handling feature scaling properly for GLMs
 --

 Key: SPARK-6683
 URL: https://issues.apache.org/jira/browse/SPARK-6683
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
 * improves optimization behavior (essentially always improves behavior in 
 practice)
 * changes the optimal solution (often for the better in terms of 
 standardizing feature importance)
 Current problems:
 * Inefficient implementation: We make a rescaled copy of the data.
 * Surprising API: For algorithms which use feature scaling, users may get 
 different solutions than with R or other libraries.  (Note: Feature scaling 
 could be handled without changing the solution.)
 * Inconsistent API: Not all algorithms have the same default for feature 
 scaling, and not all expose the option.
 This is a proposal discussed with [~mengxr] for an ideal solution.  This 
 solution will require some breaking API changes, but I'd argue these are 
 necessary for the long-term since it's the best API we have thought of.
 Proposal:
 * Implementation: Change to avoid making a rescaled copy of the data 
 (described below).  No API issues here.
 * API:
 ** Hide featureScaling from API. (breaking change)
 ** Internally, handle feature scaling to improve optimization, but modify it 
 so it does not change the optimal solution. (breaking change, in terms of 
 algorithm behavior)
 ** Externally, users who want to rescale feature (to change the solution) 
 should do that scaling as a preprocessing step.
 Details on implementation:
 * GradientDescent could instead scale the step size separately for each 
 feature (and adjust regularization as needed; see the PR linked above).  This 
 would require storing a vector of length numFeatures, rather than making a 
 full copy of the data.
 * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in 
 here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-03 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395001#comment-14395001
 ] 

Alexander Ulanov commented on SPARK-6673:
-

Probably similar issue: I am trying to execute unit tests in MLlib with 
LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log 
saying that: Cannot find any assembly build directories. If I do set 
SPARK_SCALA_VERSION=2.10 then I get No assemblies found in 
'C:\dev\spark\mllib\.\assembly\target\scala-2.10'

 spark-shell.cmd can't start even when spark was built in Windows
 

 Key: SPARK-6673
 URL: https://issues.apache.org/jira/browse/SPARK-6673
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.3.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Blocker

 spark-shell.cmd can't start.
 {code}
 bin\spark-shell.cmd --master local
 {code}
 will get
 {code}
 Failed to find Spark assembly JAR.
 You need to build Spark before running this program.
 {code}
 even when we have built spark.
 This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
 is used in {{spark-class2.cmd}}.
 In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
 {{load-spark-env.sh}}, but there are no equivalent script in Windows.
 As workaround, by executing
 {code}
 set SPARK_SCALA_VERSION=2.10
 {code}
 before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >