[jira] [Updated] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py

2015-04-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6636:
--
Assignee: Matt Aasted

> Use public DNS hostname everywhere in spark_ec2.py
> --
>
> Key: SPARK-6636
> URL: https://issues.apache.org/jira/browse/SPARK-6636
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Matt Aasted
>Assignee: Matt Aasted
>Priority: Minor
> Fix For: 1.3.2, 1.4.0
>
>
> The spark_ec2.py script uses public_dns_name everywhere in the script except 
> for testing ssh availability, which is done using the public ip address of 
> the instances. This breaks the script for users who are deploying the cluster 
> with a private-network-only security group. The fix is to use public_dns_name 
> in the remaining place.
> I am submitting a pull-request alongside this bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py

2015-04-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6636.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

Issue resolved by pull request 5302
[https://github.com/apache/spark/pull/5302]

> Use public DNS hostname everywhere in spark_ec2.py
> --
>
> Key: SPARK-6636
> URL: https://issues.apache.org/jira/browse/SPARK-6636
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Matt Aasted
>Priority: Minor
> Fix For: 1.3.2, 1.4.0
>
>
> The spark_ec2.py script uses public_dns_name everywhere in the script except 
> for testing ssh availability, which is done using the public ip address of 
> the instances. This breaks the script for users who are deploying the cluster 
> with a private-network-only security group. The fix is to use public_dns_name 
> in the remaining place.
> I am submitting a pull-request alongside this bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6716) Change SparkContext.DRIVER_IDENTIFIER from '' to 'driver'

2015-04-06 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6716.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5372
[https://github.com/apache/spark/pull/5372]

> Change SparkContext.DRIVER_IDENTIFIER from '' to 'driver'
> -
>
> Key: SPARK-6716
> URL: https://issues.apache.org/jira/browse/SPARK-6716
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.4.0
>
>
> Currently, the driver's executorId is set to {{}}.  This choice of ID 
> was present in older Spark versions, but it has started to cause problems now 
> that executorIds are used in more contexts, such as Ganglia metric names or 
> driver thread-dump links the web UI.  The angle brackets must be escaped when 
> embedding this ID in XML or as part of URLs and this has led to multiple 
> problems:
> - https://issues.apache.org/jira/browse/SPARK-6484
> - https://issues.apache.org/jira/browse/SPARK-4313
> The simplest solution seems to be to change this id to something that does 
> not contain any special characters, such as {{driver}}. 
> I'm not sure whether we can perform this change in a patch release, since 
> this ID may be considered a stable API by metrics users, but it's probably 
> okay to do this in a major release as long as we document it in the release 
> notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482685#comment-14482685
 ] 

Apache Spark commented on SPARK-6691:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5385

> Abstract and add a dynamic RateLimiter for Spark Streaming
> --
>
> Key: SPARK-6691
> URL: https://issues.apache.org/jira/browse/SPARK-6691
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>
> Flow control (or rate control) for input data is very important in streaming 
> system, especially for Spark Streaming to keep stable and up-to-date. The 
> unexpected flood of incoming data or the high ingestion rate of input data 
> which beyond the computation power of cluster will make the system unstable 
> and increase the delay time. For Spark Streaming’s job generation and 
> processing pattern, this delay will be accumulated and introduce unacceptable 
> exceptions.
> 
> Currently in Spark Streaming’s receiver based input stream, there’s a 
> RateLimiter in BlockGenerator which controls the ingestion rate of input 
> data, but the current implementation has several limitations:
> # The max ingestion rate is set by user through configuration beforehand, 
> user may lack the experience of how to set an appropriate value before the 
> application is running.
> # This configuration is fixed through the life-time of application, which 
> means you need to consider the worst scenario to set a reasonable 
> configuration.
> # Input stream like DirectKafkaInputStream need to maintain another solution 
> to achieve the same functionality.
> # Lack of slow start control makes the whole system easily trapped into large 
> processing and scheduling delay at the very beginning.
> 
> So here we propose a new dynamic RateLimiter as well as the new interface for 
> the RateLimiter to better improve the whole system's stability. The target is:
> * Dynamically adjust the ingestion rate according to processing rate of 
> previous finished jobs.
> * Offer an uniform solution not only for receiver based input stream, but 
> also for direct stream like DirectKafkaInputStream and new ones.
> * Slow start rate to control the network congestion when job is started.
> * Pluggable framework to make the maintenance of extension more easy.
> 
> Here is the design doc 
> (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
>  and working branch 
> (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
> Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6691:
---

Assignee: (was: Apache Spark)

> Abstract and add a dynamic RateLimiter for Spark Streaming
> --
>
> Key: SPARK-6691
> URL: https://issues.apache.org/jira/browse/SPARK-6691
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>
> Flow control (or rate control) for input data is very important in streaming 
> system, especially for Spark Streaming to keep stable and up-to-date. The 
> unexpected flood of incoming data or the high ingestion rate of input data 
> which beyond the computation power of cluster will make the system unstable 
> and increase the delay time. For Spark Streaming’s job generation and 
> processing pattern, this delay will be accumulated and introduce unacceptable 
> exceptions.
> 
> Currently in Spark Streaming’s receiver based input stream, there’s a 
> RateLimiter in BlockGenerator which controls the ingestion rate of input 
> data, but the current implementation has several limitations:
> # The max ingestion rate is set by user through configuration beforehand, 
> user may lack the experience of how to set an appropriate value before the 
> application is running.
> # This configuration is fixed through the life-time of application, which 
> means you need to consider the worst scenario to set a reasonable 
> configuration.
> # Input stream like DirectKafkaInputStream need to maintain another solution 
> to achieve the same functionality.
> # Lack of slow start control makes the whole system easily trapped into large 
> processing and scheduling delay at the very beginning.
> 
> So here we propose a new dynamic RateLimiter as well as the new interface for 
> the RateLimiter to better improve the whole system's stability. The target is:
> * Dynamically adjust the ingestion rate according to processing rate of 
> previous finished jobs.
> * Offer an uniform solution not only for receiver based input stream, but 
> also for direct stream like DirectKafkaInputStream and new ones.
> * Slow start rate to control the network congestion when job is started.
> * Pluggable framework to make the maintenance of extension more easy.
> 
> Here is the design doc 
> (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
>  and working branch 
> (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
> Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6691:
---

Assignee: Apache Spark

> Abstract and add a dynamic RateLimiter for Spark Streaming
> --
>
> Key: SPARK-6691
> URL: https://issues.apache.org/jira/browse/SPARK-6691
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> Flow control (or rate control) for input data is very important in streaming 
> system, especially for Spark Streaming to keep stable and up-to-date. The 
> unexpected flood of incoming data or the high ingestion rate of input data 
> which beyond the computation power of cluster will make the system unstable 
> and increase the delay time. For Spark Streaming’s job generation and 
> processing pattern, this delay will be accumulated and introduce unacceptable 
> exceptions.
> 
> Currently in Spark Streaming’s receiver based input stream, there’s a 
> RateLimiter in BlockGenerator which controls the ingestion rate of input 
> data, but the current implementation has several limitations:
> # The max ingestion rate is set by user through configuration beforehand, 
> user may lack the experience of how to set an appropriate value before the 
> application is running.
> # This configuration is fixed through the life-time of application, which 
> means you need to consider the worst scenario to set a reasonable 
> configuration.
> # Input stream like DirectKafkaInputStream need to maintain another solution 
> to achieve the same functionality.
> # Lack of slow start control makes the whole system easily trapped into large 
> processing and scheduling delay at the very beginning.
> 
> So here we propose a new dynamic RateLimiter as well as the new interface for 
> the RateLimiter to better improve the whole system's stability. The target is:
> * Dynamically adjust the ingestion rate according to processing rate of 
> previous finished jobs.
> * Offer an uniform solution not only for receiver based input stream, but 
> also for direct stream like DirectKafkaInputStream and new ones.
> * Slow start rate to control the network congestion when job is started.
> * Pluggable framework to make the maintenance of extension more easy.
> 
> Here is the design doc 
> (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing)
>  and working branch 
> (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter).
> Any comment would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-04-06 Thread Twinkle Sachdeva (JIRA)
Twinkle Sachdeva created SPARK-6735:
---

 Summary: Provide options to make maximum executor failure count ( 
which kills the application ) relative to a window duration or disable it.
 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.3.0, 1.2.1, 1.2.0
Reporter: Twinkle Sachdeva


Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
maximum number of executor failures, after which Application fails.
For long running applications, user can require not to kill the application at 
all or will require such setting relative to a window duration. This 
improvement is ti provide such options to make maximum executor failure count ( 
which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6733:
---

Assignee: Apache Spark

> Suppression of usage of Scala existential code should be done
> -
>
> Key: SPARK-6733
> URL: https://issues.apache.org/jira/browse/SPARK-6733
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.3.0
> Environment: OS: OSX Yosemite
> Hardware: Intel Core i7 with 16 GB RAM
>Reporter: Raymond Tay
>Assignee: Apache Spark
>
> The inclusion of this statement in the file 
> {code:scala}
> import scala.language.existentials
> {code}
> should have suppressed all warnings regarding the use of scala existential 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6733:
---

Assignee: (was: Apache Spark)

> Suppression of usage of Scala existential code should be done
> -
>
> Key: SPARK-6733
> URL: https://issues.apache.org/jira/browse/SPARK-6733
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.3.0
> Environment: OS: OSX Yosemite
> Hardware: Intel Core i7 with 16 GB RAM
>Reporter: Raymond Tay
>
> The inclusion of this statement in the file 
> {code:scala}
> import scala.language.existentials
> {code}
> should have suppressed all warnings regarding the use of scala existential 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482630#comment-14482630
 ] 

Apache Spark commented on SPARK-6733:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/5384

> Suppression of usage of Scala existential code should be done
> -
>
> Key: SPARK-6733
> URL: https://issues.apache.org/jira/browse/SPARK-6733
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.3.0
> Environment: OS: OSX Yosemite
> Hardware: Intel Core i7 with 16 GB RAM
>Reporter: Raymond Tay
>
> The inclusion of this statement in the file 
> {code:scala}
> import scala.language.existentials
> {code}
> should have suppressed all warnings regarding the use of scala existential 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6734:
---

Assignee: (was: Apache Spark)

> Support GenericUDTF.close for Generate
> --
>
> Key: SPARK-6734
> URL: https://issues.apache.org/jira/browse/SPARK-6734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Some third-party UDTF extension, will generate more rows in the 
> "GenericUDTF.close()" method, which is supported by Hive.
> https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
> However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while 
> porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6734:
---

Assignee: Apache Spark

> Support GenericUDTF.close for Generate
> --
>
> Key: SPARK-6734
> URL: https://issues.apache.org/jira/browse/SPARK-6734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>
> Some third-party UDTF extension, will generate more rows in the 
> "GenericUDTF.close()" method, which is supported by Hive.
> https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
> However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while 
> porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482567#comment-14482567
 ] 

Apache Spark commented on SPARK-6734:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/5383

> Support GenericUDTF.close for Generate
> --
>
> Key: SPARK-6734
> URL: https://issues.apache.org/jira/browse/SPARK-6734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> Some third-party UDTF extension, will generate more rows in the 
> "GenericUDTF.close()" method, which is supported by Hive.
> https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
> However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while 
> porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-6734:


 Summary: Support GenericUDTF.close for Generate
 Key: SPARK-6734
 URL: https://issues.apache.org/jira/browse/SPARK-6734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


Some third-party UDTF extension, will generate more rows in the 
"GenericUDTF.close()" method, which is supported by Hive.

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF

However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while 
porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Raymond Tay (JIRA)
Raymond Tay created SPARK-6733:
--

 Summary: Suppression of usage of Scala existential code should be 
done
 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay


The inclusion of this statement in the file 

{code:scala}
import scala.language.existentials
{code}

should have suppressed all warnings regarding the use of scala existential code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6732) Scala existentials warning during compilation

2015-04-06 Thread Raymond Tay (JIRA)
Raymond Tay created SPARK-6732:
--

 Summary: Scala existentials warning during compilation
 Key: SPARK-6732
 URL: https://issues.apache.org/jira/browse/SPARK-6732
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
 Environment: operating system: OSX Yosemite
scala version: 2.10.4
hardware: 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3

Reporter: Raymond Tay
Priority: Minor


Certain parts of the Scala code was detected to have used existentials but the 
scala import can be included in the source file to prevent such warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6343) Make doc more explicit regarding network connectivity requirements

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6343:
---

Assignee: Apache Spark

> Make doc more explicit regarding network connectivity requirements
> --
>
> Key: SPARK-6343
> URL: https://issues.apache.org/jira/browse/SPARK-6343
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Peter Parente
>Assignee: Apache Spark
>Priority: Minor
>
> As a new user of Spark, I read through the official documentation before 
> attempting to stand-up my own cluster and write my own driver application. 
> But only after attempting to run my app remotely against my cluster did I 
> realize that full network connectivity (layer 3) is necessary between my 
> driver program and worker nodes (i.e., my driver was *listening* for 
> connections from my workers).
> I returned to the documentation to see how I had missed this requirement. On 
> a second read-through, I saw that the doc hints at it in a few places (e.g., 
> [driver 
> config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
> [submitting applications 
> suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
> [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
> but never outright says it.
> I think it would help would-be users better understand how Spark works to 
> state the network connectivity requirements right up-front in the overview 
> section of the doc. I suggest revising the diagram and accompanying text 
> found on the [overview 
> page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
> !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
> so that it depicts at least the directionality of the network connections 
> initiated (perhaps like so):
> !http://i.imgur.com/2dqGbCr.png!
> and states that the driver must listen for and accept connections from other 
> Spark components on a variety of ports.
> Please treat my diagram and text as strawmen: I expect more experienced Spark 
> users and developers will have better ideas on how to convey these 
> requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6343) Make doc more explicit regarding network connectivity requirements

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6343:
---

Assignee: (was: Apache Spark)

> Make doc more explicit regarding network connectivity requirements
> --
>
> Key: SPARK-6343
> URL: https://issues.apache.org/jira/browse/SPARK-6343
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Peter Parente
>Priority: Minor
>
> As a new user of Spark, I read through the official documentation before 
> attempting to stand-up my own cluster and write my own driver application. 
> But only after attempting to run my app remotely against my cluster did I 
> realize that full network connectivity (layer 3) is necessary between my 
> driver program and worker nodes (i.e., my driver was *listening* for 
> connections from my workers).
> I returned to the documentation to see how I had missed this requirement. On 
> a second read-through, I saw that the doc hints at it in a few places (e.g., 
> [driver 
> config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
> [submitting applications 
> suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
> [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
> but never outright says it.
> I think it would help would-be users better understand how Spark works to 
> state the network connectivity requirements right up-front in the overview 
> section of the doc. I suggest revising the diagram and accompanying text 
> found on the [overview 
> page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
> !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
> so that it depicts at least the directionality of the network connections 
> initiated (perhaps like so):
> !http://i.imgur.com/2dqGbCr.png!
> and states that the driver must listen for and accept connections from other 
> Spark components on a variety of ports.
> Please treat my diagram and text as strawmen: I expect more experienced Spark 
> users and developers will have better ideas on how to convey these 
> requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6343) Make doc more explicit regarding network connectivity requirements

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482496#comment-14482496
 ] 

Apache Spark commented on SPARK-6343:
-

User 'parente' has created a pull request for this issue:
https://github.com/apache/spark/pull/5382

> Make doc more explicit regarding network connectivity requirements
> --
>
> Key: SPARK-6343
> URL: https://issues.apache.org/jira/browse/SPARK-6343
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Peter Parente
>Priority: Minor
>
> As a new user of Spark, I read through the official documentation before 
> attempting to stand-up my own cluster and write my own driver application. 
> But only after attempting to run my app remotely against my cluster did I 
> realize that full network connectivity (layer 3) is necessary between my 
> driver program and worker nodes (i.e., my driver was *listening* for 
> connections from my workers).
> I returned to the documentation to see how I had missed this requirement. On 
> a second read-through, I saw that the doc hints at it in a few places (e.g., 
> [driver 
> config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
> [submitting applications 
> suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
> [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
> but never outright says it.
> I think it would help would-be users better understand how Spark works to 
> state the network connectivity requirements right up-front in the overview 
> section of the doc. I suggest revising the diagram and accompanying text 
> found on the [overview 
> page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
> !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
> so that it depicts at least the directionality of the network connections 
> initiated (perhaps like so):
> !http://i.imgur.com/2dqGbCr.png!
> and states that the driver must listen for and accept connections from other 
> Spark components on a variety of ports.
> Please treat my diagram and text as strawmen: I expect more experienced Spark 
> users and developers will have better ideas on how to convey these 
> requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482479#comment-14482479
 ] 

Lianhui Wang commented on SPARK-6700:
-

i had used hadoop2.3.0 to test and that is ok. Can you report 
spark/yarn/target/unit-test.log? i think that has more information about 
failure.

> flaky test: run Python application in yarn-cluster mode 
> 
>
> Key: SPARK-6700
> URL: https://issues.apache.org/jira/browse/SPARK-6700
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Lianhui Wang
>Priority: Critical
>  Labels: test, yarn
>
> org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
> yarn-cluster mode
> Failing for the past 1 build (Since Failed#2025 )
> Took 12 sec.
> Error Message
> {code}
> Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
>  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
>  exited with code 1
> Stacktrace
> sbt.ForkMain$ForkError: Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
>  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
>  exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 

[jira] [Commented] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482460#comment-14482460
 ] 

Apache Spark commented on SPARK-6731:
-

User 'punya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5380

> Upgrade Apache commons-math3 to 3.4.1
> -
>
> Key: SPARK-6731
> URL: https://issues.apache.org/jira/browse/SPARK-6731
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Punya Biswal
>
> Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. 
> The current version (3.4.1) includes approximate percentile statistics (among 
> other things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6731:
---

Assignee: (was: Apache Spark)

> Upgrade Apache commons-math3 to 3.4.1
> -
>
> Key: SPARK-6731
> URL: https://issues.apache.org/jira/browse/SPARK-6731
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Punya Biswal
>
> Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. 
> The current version (3.4.1) includes approximate percentile statistics (among 
> other things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6731:
---

Assignee: Apache Spark

> Upgrade Apache commons-math3 to 3.4.1
> -
>
> Key: SPARK-6731
> URL: https://issues.apache.org/jira/browse/SPARK-6731
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Punya Biswal
>Assignee: Apache Spark
>
> Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. 
> The current version (3.4.1) includes approximate percentile statistics (among 
> other things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Punya Biswal (JIRA)
Punya Biswal created SPARK-6731:
---

 Summary: Upgrade Apache commons-math3 to 3.4.1
 Key: SPARK-6731
 URL: https://issues.apache.org/jira/browse/SPARK-6731
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Punya Biswal


Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. The 
current version (3.4.1) includes approximate percentile statistics (among other 
things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482455#comment-14482455
 ] 

Chris Fregly commented on SPARK-6514:
-

we may want to inspect the streamURL for the region

otherwise, we would need to make the new regionName param be more explicit 
about its meaning.  ie. dynamoRegion, but this exposes the implementation which 
is not good.

> For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
> the Kinesis stream itself  
> 
>
> Key: SPARK-6514
> URL: https://issues.apache.org/jira/browse/SPARK-6514
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Chris Fregly
>
> context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
> then finished on KCL 1.1 (supported) without realizing that it's supported.
> also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
> currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6730) Can't have table as identifier in OPTIONS

2015-04-06 Thread Alex Liu (JIRA)
Alex Liu created SPARK-6730:
---

 Summary: Can't have table as identifier in OPTIONS
 Key: SPARK-6730
 URL: https://issues.apache.org/jira/browse/SPARK-6730
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Alex Liu


The following query fails because there is an  identifier "table" in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table "test1",
 keyspace "test"
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table "test1",  keyspace "dstest"  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
[info]   at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6730) Can't have table as identifier in OPTIONS

2015-04-06 Thread Alex Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Liu updated SPARK-6730:

Description: 
The following query fails because there is an  identifier "table" in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table "test1",
 keyspace "test"
)
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table "test1",  keyspace "dstest"  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
[info]   at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
{code}

  was:
The following query fails because there is an  identifier "table" in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table "test1",
 keyspace "test"
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table "test1",  keyspace "dstest"  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at

[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-04-06 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482414#comment-14482414
 ] 

Kostas Sakellis commented on SPARK-6506:


I ran into this issue too by running:
bq. spark-submit  --master yarn-cluster examples/pi.py 4

it looks like I only had to set: spark.yarn.appMasterEnv.SPARK_HOME=/bogus to 
get it going:
bq. spark-submit --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --master 
yarn-cluster pi.py 4


> python support yarn cluster mode requires SPARK_HOME to be set
> --
>
> Key: SPARK-6506
> URL: https://issues.apache.org/jira/browse/SPARK-6506
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> We added support for python running in yarn cluster mode in 
> https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
> SPARK_HOME be set in the environment variables for application master and 
> executor.  It doesn't have to be set to anything real but it fails if its not 
> set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6514:

Description: 
context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
then finished on KCL 1.1 (supported) without realizing that it's supported.

also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
currently v1.2 right now, i believe.

  was:
context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
then finished on KCL 1.1 (supported).

also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
currently v1.2 right now, i believe.


> For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
> the Kinesis stream itself  
> 
>
> Key: SPARK-6514
> URL: https://issues.apache.org/jira/browse/SPARK-6514
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Chris Fregly
>
> context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
> then finished on KCL 1.1 (supported) without realizing that it's supported.
> also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
> currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6514:

Description: 
context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
then finished on KCL 1.1 (supported).

also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
currently v1.2 right now, i believe.

  was:
this was not supported when i originally wrote this receiver.

this is now supported.  also, upgrade to the latest Kinesis Client Library 
(KCL) which is 1.2, i believe.


> For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
> the Kinesis stream itself  
> 
>
> Key: SPARK-6514
> URL: https://issues.apache.org/jira/browse/SPARK-6514
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Chris Fregly
>
> context:  i started the original Kinesis impl with KCL 1.0 (not supported), 
> then finished on KCL 1.1 (supported).
> also, we should upgrade to the latest Kinesis Client Library (KCL) which is 
> currently v1.2 right now, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6599:

Summary: Improve reliability and usability of Kinesis-based Spark Streaming 
 (was: Add Kinesis Direct API)

> Improve reliability and usability of Kinesis-based Spark Streaming
> --
>
> Key: SPARK-6599
> URL: https://issues.apache.org/jira/browse/SPARK-6599
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2960:
-
Component/s: Deploy

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6514:

Target Version/s: 1.4.0  (was: 1.3.1)

> For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
> the Kinesis stream itself  
> 
>
> Key: SPARK-6514
> URL: https://issues.apache.org/jira/browse/SPARK-6514
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Chris Fregly
>
> this was not supported when i originally wrote this receiver.
> this is now supported.  also, upgrade to the latest Kinesis Client Library 
> (KCL) which is 1.2, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482367#comment-14482367
 ] 

Sean Owen commented on SPARK-6721:
--

(Also "IllegalStateException" isn't a useful JIRA name -- please edit it to 
something more meaningful, like including "mongo")

> IllegalStateException
> -
>
> Key: SPARK-6721
> URL: https://issues.apache.org/jira/browse/SPARK-6721
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
>Reporter: Luis Rodríguez Trejo
>  Labels: MongoDB, java.lang.IllegalStateexception, 
> saveAsNewAPIHadoopFile
>
> I get the following exception when using saveAsNewAPIHadoopFile:
> {code}
> 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
> 10.0.2.15): java.lang.IllegalStateException: open
> at org.bson.util.Assertions.isTrue(Assertions.java:36)
> at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
> at com.mongodb.DBCollection.insert(DBCollection.java:161)
> at com.mongodb.DBCollection.insert(DBCollection.java:107)
> at com.mongodb.DBCollection.save(DBCollection.java:1049)
> at com.mongodb.DBCollection.save(DBCollection.java:1014)
> at 
> com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Before Spark 1.3.0 this would result in the application crashing, but now the 
> data just remains unprocessed.
> There is no "close" instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482366#comment-14482366
 ] 

Sean Owen commented on SPARK-6721:
--

Isn't this an error / config problem in Mongo rather than Spark?

> IllegalStateException
> -
>
> Key: SPARK-6721
> URL: https://issues.apache.org/jira/browse/SPARK-6721
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
>Reporter: Luis Rodríguez Trejo
>  Labels: MongoDB, java.lang.IllegalStateexception, 
> saveAsNewAPIHadoopFile
>
> I get the following exception when using saveAsNewAPIHadoopFile:
> {code}
> 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
> 10.0.2.15): java.lang.IllegalStateException: open
> at org.bson.util.Assertions.isTrue(Assertions.java:36)
> at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
> at com.mongodb.DBCollection.insert(DBCollection.java:161)
> at com.mongodb.DBCollection.insert(DBCollection.java:107)
> at com.mongodb.DBCollection.save(DBCollection.java:1049)
> at com.mongodb.DBCollection.save(DBCollection.java:1014)
> at 
> com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Before Spark 1.3.0 this would result in the application crashing, but now the 
> data just remains unprocessed.
> There is no "close" instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-6729:
--
Assignee: Volodymyr Lyubinets

> DriverQuirks get can get OutOfBounds exception is some cases
> 
>
> Key: SPARK-6729
> URL: https://issues.apache.org/jira/browse/SPARK-6729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Assignee: Volodymyr Lyubinets
>Priority: Minor
> Fix For: 1.4.0
>
>
> The function uses .substring(0, X), which will trigger OutOfBoundsException 
> if string length is less than X. A better way to do this is to use 
> startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-6729.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

> DriverQuirks get can get OutOfBounds exception is some cases
> 
>
> Key: SPARK-6729
> URL: https://issues.apache.org/jira/browse/SPARK-6729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Assignee: Volodymyr Lyubinets
>Priority: Minor
> Fix For: 1.4.0
>
>
> The function uses .substring(0, X), which will trigger OutOfBoundsException 
> if string length is less than X. A better way to do this is to use 
> startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482329#comment-14482329
 ] 

Michael Armbrust commented on SPARK-5281:
-

I'll add that this is the trick we use if you run {{build/sbt sparkShell}} from 
the spark distribution.

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  {{rdd.registerTempTable("temp")}}  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace:
> {code}
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-04-06 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482297#comment-14482297
 ] 

Sai Nishanth Parepally commented on SPARK-3219:
---

[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use "jaccard distance" as a 
distance metric for kmeans clustering?

> K-Means clusterer should support Bregman distance functions
> ---
>
> Key: SPARK-3219
> URL: https://issues.apache.org/jira/browse/SPARK-3219
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The K-Means clusterer supports the Euclidean distance metric.  However, it is 
> rather straightforward to support Bregman 
> (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
> distance functions which would increase the utility of the clusterer 
> tremendously.
> I have modified the clusterer to support pluggable distance functions.  
> However, I notice that there are hundreds of outstanding pull requests.  If 
> someone is willing to work with me to sponsor the work through the process, I 
> will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482291#comment-14482291
 ] 

William Benton commented on SPARK-5281:
---

As [~marmbrus] recently pointed out on the user list, this happens when you 
don't have all of the dependencies for Scala reflection loaded by the 
primordial classloader.  For running apps from sbt, setting {{fork := true}} 
should do the trick.  For running a REPL from sbt, try [this 
workaround|http://chapeau.freevariable.com/2015/04/spark-sql-repl.html].  
(Sorry to not have a solution for Eclipse.)

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  {{rdd.registerTempTable("temp")}}  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace:
> {code}
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread Patrick Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482244#comment-14482244
 ] 

Patrick Walsh commented on SPARK-5281:
--

I also have this issue with spark 1.3.0.  Even example snippets where case 
classes are used in the rrd's trigger the problem.  For me, this happens from 
eclipse and from sbt.

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  {{rdd.registerTempTable("temp")}}  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace:
> {code}
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6729:
---

Assignee: Apache Spark

> DriverQuirks get can get OutOfBounds exception is some cases
> 
>
> Key: SPARK-6729
> URL: https://issues.apache.org/jira/browse/SPARK-6729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Assignee: Apache Spark
>Priority: Minor
>
> The function uses .substring(0, X), which will trigger OutOfBoundsException 
> if string length is less than X. A better way to do this is to use 
> startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482193#comment-14482193
 ] 

Apache Spark commented on SPARK-6729:
-

User 'vlyubin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5378

> DriverQuirks get can get OutOfBounds exception is some cases
> 
>
> Key: SPARK-6729
> URL: https://issues.apache.org/jira/browse/SPARK-6729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Priority: Minor
>
> The function uses .substring(0, X), which will trigger OutOfBoundsException 
> if string length is less than X. A better way to do this is to use 
> startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6729:
---

Assignee: (was: Apache Spark)

> DriverQuirks get can get OutOfBounds exception is some cases
> 
>
> Key: SPARK-6729
> URL: https://issues.apache.org/jira/browse/SPARK-6729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Volodymyr Lyubinets
>Priority: Minor
>
> The function uses .substring(0, X), which will trigger OutOfBoundsException 
> if string length is less than X. A better way to do this is to use 
> startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Volodymyr Lyubinets (JIRA)
Volodymyr Lyubinets created SPARK-6729:
--

 Summary: DriverQuirks get can get OutOfBounds exception is some 
cases
 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Priority: Minor


The function uses .substring(0, X), which will trigger OutOfBoundsException if 
string length is less than X. A better way to do this is to use startsWith, 
which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482157#comment-14482157
 ] 

Apache Spark commented on SPARK-6229:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5377

> Support SASL encryption in network/common module
> 
>
> Key: SPARK-6229
> URL: https://issues.apache.org/jira/browse/SPARK-6229
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> After SASL support has been added to network/common, supporting encryption 
> should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
> Since the latter requires a valid kerberos login to work (and so doesn't 
> really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6229:
---

Assignee: Apache Spark

> Support SASL encryption in network/common module
> 
>
> Key: SPARK-6229
> URL: https://issues.apache.org/jira/browse/SPARK-6229
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> After SASL support has been added to network/common, supporting encryption 
> should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
> Since the latter requires a valid kerberos login to work (and so doesn't 
> really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6229:
---

Assignee: (was: Apache Spark)

> Support SASL encryption in network/common module
> 
>
> Key: SPARK-6229
> URL: https://issues.apache.org/jira/browse/SPARK-6229
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> After SASL support has been added to network/common, supporting encryption 
> should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
> Since the latter requires a valid kerberos login to work (and so doesn't 
> really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6728:

Affects Version/s: 1.3.0

> Improve performance of py4j for large bytearray
> ---
>
> Key: SPARK-6728
> URL: https://issues.apache.org/jira/browse/SPARK-6728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>
> PySpark relies on py4j to transfer function arguments and return between 
> Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
> In MLlib, it's possible to have a Vector with more than 100M bytes, which 
> will need few GB memory, may crash.
> The reason is that py4j use text protocol, it will encode the bytearray as 
> base64, and do multiple string concat. 
> Binary will help a lot, create a issue for py4j: 
> https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6728:

Priority: Critical  (was: Major)
Target Version/s: 1.4.0

> Improve performance of py4j for large bytearray
> ---
>
> Key: SPARK-6728
> URL: https://issues.apache.org/jira/browse/SPARK-6728
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Priority: Critical
>
> PySpark relies on py4j to transfer function arguments and return between 
> Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
> In MLlib, it's possible to have a Vector with more than 100M bytes, which 
> will need few GB memory, may crash.
> The reason is that py4j use text protocol, it will encode the bytearray as 
> base64, and do multiple string concat. 
> Binary will help a lot, create a issue for py4j: 
> https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6728:
-

 Summary: Improve performance of py4j for large bytearray
 Key: SPARK-6728
 URL: https://issues.apache.org/jira/browse/SPARK-6728
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


PySpark relies on py4j to transfer function arguments and return between Python 
and JVM, it's very slow to pass a large bytearray (larger than 10M). 

In MLlib, it's possible to have a Vector with more than 100M bytes, which will 
need few GB memory, may crash.

The reason is that py4j use text protocol, it will encode the bytearray as 
base64, and do multiple string concat. 

Binary will help a lot, create a issue for py4j: 
https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus

2015-04-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14482063#comment-14482063
 ] 

Reynold Xin commented on SPARK-6710:


[~michaelmalak] would you like to submit a pull request for this?

> Wrong initial bias in GraphX SVDPlusPlus
> 
>
> Key: SPARK-6710
> URL: https://issues.apache.org/jira/browse/SPARK-6710
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
>Reporter: Michael Malak
>  Labels: easyfix
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In the initialization portion of GraphX SVDPlusPluS, the initialization of 
> biases appears to be incorrect. Specifically, in line 
> https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
>  
> instead of 
> (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) 
> it should probably be 
> (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / 
> scala.math.sqrt(msg.get._1)) 
> That is, the biases bu and bi (both represented as the third component of the 
> Tuple4[] above, depending on whether the vertex is a user or an item), 
> described in equation (1) of the Koren paper, are supposed to be small 
> offsets to the mean (represented by the variable u, signifying the Greek 
> letter mu) to account for peculiarities of individual users and items. 
> Initializing these biases to wrong values should theoretically not matter 
> given enough iterations of the algorithm, but some quick empirical testing 
> shows it has trouble converging at all, even after many orders of magnitude 
> additional iterations. 
> This perhaps could be the source of previously reported trouble with 
> SVDPlusPlus. 
> http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6704) integrate SparkR docs build tool into Spark doc build

2015-04-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481972#comment-14481972
 ] 

Davies Liu commented on SPARK-6704:
---

Great, thanks!

> integrate SparkR docs build tool into Spark doc build
> -
>
> Key: SPARK-6704
> URL: https://issues.apache.org/jira/browse/SPARK-6704
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Davies Liu
>Priority: Blocker
>
> We should integrate the SparkR docs build tool into Spark one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6727) Model export/import for spark.ml: HashingTF

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6727:


 Summary: Model export/import for spark.ml: HashingTF
 Key: SPARK-6727
 URL: https://issues.apache.org/jira/browse/SPARK-6727
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6726:


 Summary: Model export/import for spark.ml: LogisticRegression
 Key: SPARK-6726
 URL: https://issues.apache.org/jira/browse/SPARK-6726
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6725) Model export/import for Pipeline API

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6725:


 Summary: Model export/import for Pipeline API
 Key: SPARK-6725
 URL: https://issues.apache.org/jira/browse/SPARK-6725
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical


This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6724) Model import/export for FPGrowth

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6724:


 Summary: Model import/export for FPGrowth
 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6723) Model import/export for ChiSqSelector

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6723:


 Summary: Model import/export for ChiSqSelector
 Key: SPARK-6723
 URL: https://issues.apache.org/jira/browse/SPARK-6723
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6722) Model import/export for StreamingKMeansModel

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6722:


 Summary: Model import/export for StreamingKMeansModel
 Key: SPARK-6722
 URL: https://issues.apache.org/jira/browse/SPARK-6722
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
(which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481891#comment-14481891
 ] 

Joseph K. Bradley commented on SPARK-5988:
--

Feel free to go ahead!  I just assigned it to you.  Thanks!

> Model import/export for PowerIterationClusteringModel
> -
>
> Key: SPARK-5988
> URL: https://issues.apache.org/jira/browse/SPARK-5988
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5988:
-
Assignee: Xusen Yin

> Model import/export for PowerIterationClusteringModel
> -
>
> Key: SPARK-5988
> URL: https://issues.apache.org/jira/browse/SPARK-5988
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6718) Improve the test on normL1/normL2 of summary statistics

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6718.

Resolution: Duplicate

> Improve the test on normL1/normL2 of summary statistics
> ---
>
> Key: SPARK-6718
> URL: https://issues.apache.org/jira/browse/SPARK-6718
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Minor
>
> As discussed on https://github.com/apache/spark/pull/5359, we should improve 
> the unit test there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Component/s: PySpark

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Kai Sasaki
>Priority: Minor
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Assignee: Kai Sasaki

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Kai Sasaki
>Assignee: Kai Sasaki
>Priority: Minor
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Target Version/s: 1.4.0
   Fix Version/s: (was: 1.4.0)

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Kai Sasaki
>Priority: Minor
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Affects Version/s: (was: 1.3.0)
   1.4.0

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Kai Sasaki
>Priority: Minor
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Issue Type: Improvement  (was: Bug)

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Kai Sasaki
>Priority: Minor
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481874#comment-14481874
 ] 

Burak Yavuz commented on SPARK-6407:


I actually worked on this over the weekend for fun and have a streaming, 
gradient descent based, matrix factorization model implemented here: 
https://github.com/brkyvz/streaming-matrix-factorization

It is a very naive implementation, but it might be something to work on top of. 
I will publish a Spark Package for it as soon as I get the tests in. The model 
it uses for predicting ratings for user `u` and product `p` is:
{code}
r = U(u) * P^T(p) + bu(u) + bp(p) + mu
{code}
where U(u) is the u'th row of the User matrix, P(p) is the p'th row for the 
product matrix, bu(u) is the u'th element of the user bias vector, bp(p) is the 
p'th element of the product bias vector and mu is the global average.

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill

2015-04-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6713:
-
Assignee: Reza Zadeh

> Iterators in columnSimilarities to allow flatMap spill
> --
>
> Key: SPARK-6713
> URL: https://issues.apache.org/jira/browse/SPARK-6713
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>Assignee: Reza Zadeh
> Fix For: 1.4.0
>
>
> We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex 
> to spill to disk. This could happen in a dense and large column - this way 
> Spark can spill the pairs onto disk instead of building all the pairs before 
> handing them to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill

2015-04-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6713.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5364
[https://github.com/apache/spark/pull/5364]

> Iterators in columnSimilarities to allow flatMap spill
> --
>
> Key: SPARK-6713
> URL: https://issues.apache.org/jira/browse/SPARK-6713
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Fix For: 1.4.0
>
>
> We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex 
> to spill to disk. This could happen in a dense and large column - this way 
> Spark can spill the pairs onto disk instead of building all the pairs before 
> handing them to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6711) Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-6711.

Resolution: Duplicate

> Support parallelized online matrix factorization for Collaborative Filtering 
> -
>
> Key: SPARK-6711
> URL: https://issues.apache.org/jira/browse/SPARK-6711
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chunnan Yao
>   Original Estimate: 840h
>  Remaining Estimate: 840h
>
> On-line Collaborative Filtering(CF) has been widely used and studied. To 
> re-train a CF model from scratch every time when new data comes in is very 
> inefficient 
> (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
>  However, in Spark community we see few discussion about collaborative 
> filtering on streaming data. Given streaming k-means, streaming logistic 
> regression, and the on-going incremental model training of Naive Bayes 
> Classifier (SPARK-4144), we think it is meaningful to consider streaming 
> Collaborative Filtering support on MLlib. 
> We have already been considering about this issue during the past week. We 
> plan to refer to this paper
> (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
> SGD instead of ALS, which is easier to be tackled under streaming data. 
> Fortunately, the authors of this paper have implemented their algorithm as a 
> Github Project, based on Storm:
> https://github.com/MrChrisJohnson/CollabStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481711#comment-14481711
 ] 

Xiangrui Meng commented on SPARK-6407:
--

Attached the comment from Chunnan Yao in SPARK-6711:

On-line Collaborative Filtering(CF) has been widely used and studied. To 
re-train a CF model from scratch every time when new data comes in is very 
inefficient 
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
 However, in Spark community we see few discussion about collaborative 
filtering on streaming data. Given streaming k-means, streaming logistic 
regression, and the on-going incremental model training of Naive Bayes 
Classifier (SPARK-4144), we think it is meaningful to consider streaming 
Collaborative Filtering support on MLlib. 

We have already been considering about this issue during the past week. We plan 
to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
SGD instead of ALS, which is easier to be tackled under streaming data. 

Fortunately, the authors of this paper have implemented their algorithm as a 
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481639#comment-14481639
 ] 

Apache Spark commented on SPARK-6606:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4145

> Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd 
> object.
> -
>
> Key: SPARK-6606
> URL: https://issues.apache.org/jira/browse/SPARK-6606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: SuYan
>
> 1. Use code like belows, will found accumulator deserialized twice.
> first: 
> {code}
> task = ser.deserialize[Task[Any]](taskBytes, 
> Thread.currentThread.getContextClassLoader)
> {code}
> second:
> {code}
> val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
>   ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> {code}
> which the first deserialized is not what expected.
> because ResultTask or ShuffleMapTask will have a partition object.
> in class 
> {code}
> CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: 
> Partitioner)
> {code}, the CogroupPartition may contains a  CoGroupDep:
> {code}
> NarrowCoGroupSplitDep(
> rdd: RDD[_],
> splitIndex: Int,
> var split: Partition
>   ) extends CoGroupSplitDep {
> {code}
> in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
> into the first deserialized.
> example:
> {code}
>val acc1 = sc.accumulator(0, "test1")
> val acc2 = sc.accumulator(0, "test2")
> val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
> val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
> val combine1 = rdd1.map { case a => (a, 1)}.combineByKey(a => {
>   acc1 += 1
>   a
> }, (a: Int, b: Int) => {
>   a + b
> },
>   (a: Int, b: Int) => {
> a + b
>   }, new HashPartitioner(3), mapSideCombine = false)
> val combine2 = rdd2.map { case a => (a, 1)}.combineByKey(
>   a => {
> acc2 += 1
> a
>   },
>   (a: Int, b: Int) => {
> a + b
>   },
>   (a: Int, b: Int) => {
> a + b
>   }, new HashPartitioner(3), mapSideCombine = false)
> combine1.cogroup(combine2, new HashPartitioner(3)).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6692) Add an option for client to kill AM when it is killed

2015-04-06 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated SPARK-6692:
-
Summary: Add an option for client to kill AM when it is killed  (was: Make 
it possible to kill AM in YARN cluster mode when the client is terminated)

> Add an option for client to kill AM when it is killed
> -
>
> Key: SPARK-6692
> URL: https://issues.apache.org/jira/browse/SPARK-6692
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
>
> I understand that the yarn-cluster mode is designed for fire-and-forget 
> model; therefore, terminating the yarn client doesn't kill AM.
> However, it is very common that users submit Spark jobs via job scheduler 
> (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is 
> expected that killing the yarn client will terminate AM. 
> It is true that the yarn-client mode can be used in such cases. But then, the 
> yarn client sometimes needs lots of heap memory for big jobs if it runs in 
> the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs 
> because AM can be given arbitrary heap memory unlike the yarn client. So it 
> would be very useful to make it possible to kill AM even in the yarn-cluster 
> mode.
> In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
> as they're accepted (but not yet running). Although they're eventually 
> shutdown after AM timeout, it would be nice if AM could immediately get 
> killed in such cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-04-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6222:
---
Fix Version/s: 1.4.0
   1.3.1

> [STREAMING] All data may not be recovered from WAL when driver is killed
> 
>
> Key: SPARK-6222
> URL: https://issues.apache.org/jira/browse/SPARK-6222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Hari Shreedharan
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
> Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch
>
>
> When testing for our next release, our internal tests written by [~wypoon] 
> caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
> FlumePolling stream to read data from Flume, then kills the Application 
> Master. Once YARN restarts it, the test waits until no more data is to be 
> written and verifies the original against the data on HDFS. This was passing 
> in 1.2.0, but is failing now.
> Since the test ties into Cloudera's internal infrastructure and build 
> process, it cannot be directly run on an Apache build. But I have been 
> working on isolating the commit that may have caused the regression. I have 
> confirmed that it was caused by SPARK-5147 (PR # 
> [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
> times using the test and the failure is consistently reproducible. 
> To re-confirm, I reverted just this one commit (and Clock consolidation one 
> to avoid conflicts), and the issue was no longer reproducible.
> Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
> /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481534#comment-14481534
 ] 

Davies Liu commented on SPARK-6700:
---

There is one failure here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2036/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/

and here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2025/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/

Is it related to hadoop2.3 ?

> flaky test: run Python application in yarn-cluster mode 
> 
>
> Key: SPARK-6700
> URL: https://issues.apache.org/jira/browse/SPARK-6700
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Davies Liu
>Assignee: Lianhui Wang
>Priority: Critical
>  Labels: test, yarn
>
> org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
> yarn-cluster mode
> Failing for the past 1 build (Since Failed#2025 )
> Took 12 sec.
> Error Message
> {code}
> Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
>  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
>  exited with code 1
> Stacktrace
> sbt.ForkMain$ForkError: Process 
> List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
>  --master, yarn-cluster, --num-executors, 1, --properties-file, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
>  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
> /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
>  exited with code 1
>   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scal

[jira] [Updated] (SPARK-6721) IllegalStateException

2015-04-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Rodríguez Trejo updated SPARK-6721:

Description: 
I get the following exception when using saveAsNewAPIHadoopFile:
{code}
15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
10.0.2.15): java.lang.IllegalStateException: open
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:107)
at com.mongodb.DBCollection.save(DBCollection.java:1049)
at com.mongodb.DBCollection.save(DBCollection.java:1014)
at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Before Spark 1.3.0 this would result in the application crashing, but now the 
data just remains unprocessed.

There is no "close" instruction at any part of the code.

  was:
I get the following exception when using saveAsNewAPIHadoopFile:
bq. 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
10.0.2.15): java.lang.IllegalStateException: open
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:107)
at com.mongodb.DBCollection.save(DBCollection.java:1049)
at com.mongodb.DBCollection.save(DBCollection.java:1014)
at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Before Spark 1.3.0 this would result in the application crashing, but now the 
data just remains unprocessed.

There is no "close" instruction at any part of the code.


> IllegalStateException
> -
>
> Key: SPARK-6721
> URL: https://issues.apache.org/jira/browse/SPARK-6721
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
>Reporter: Luis Rodríguez Trejo
>  Labels: MongoDB, java.lang.IllegalStateexception, 
> saveAsNewAPIHadoopFile
>
> I get the following exception when using saveAsNewAPIHadoopFile:
> {code}
> 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
> 10.0.2.15): java.lang.IllegalStateException: open
> at org.bson.util.Assertions.isTrue(Assertions.java:36)
> at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
> at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
> at com.mongodb.DBCollection.insert(DBCollection.java:161)
> at com.mongodb.DBCollection.insert(DBCollection.java:107)
> at com.mongodb.DBCollection.save(DBCollection.java:1049)
> at com.mongodb.DBCollection.save(DBCollection.java:1014)
> at 
> com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at 
> java.util.concurrent.ThreadPoolExecutor.r

[jira] [Created] (SPARK-6721) IllegalStateException

2015-04-06 Thread JIRA
Luis Rodríguez Trejo created SPARK-6721:
---

 Summary: IllegalStateException
 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.3.0, 1.2.1, 1.2.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo


I get the following exception when using saveAsNewAPIHadoopFile:
bq. 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
10.0.2.15): java.lang.IllegalStateException: open
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:107)
at com.mongodb.DBCollection.save(DBCollection.java:1049)
at com.mongodb.DBCollection.save(DBCollection.java:1014)
at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Before Spark 1.3.0 this would result in the application crashing, but now the 
data just remains unprocessed.

There is no "close" instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481464#comment-14481464
 ] 

Joseph K. Bradley commented on SPARK-3702:
--

Using Vector types is better since they store values as Array[Double], which 
avoids creating an object for every value.  If you're thinking about feature 
names/metadata, the Metadata capability in DataFrame will be able to handle 
metadata for each feature in Vector columns.

> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481455#comment-14481455
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

As you're suggesting, a wrapper mechanism like won't be an acceptable solution 
since it would be a confusing, difficult-to-document API.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-04-06 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481395#comment-14481395
 ] 

Manoj Kumar commented on SPARK-6577:


Let us please take the discussion to the Pull Request. Thanks!

> SparseMatrix should be supported in PySpark
> ---
>
> Key: SPARK-6577
> URL: https://issues.apache.org/jira/browse/SPARK-6577
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Manoj Kumar
>Assignee: Manoj Kumar
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
 0.13889
{code}

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889


> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(100)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(4

[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
 0.13889
{code}

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
 0.13889
{code}


> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(5)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitio

[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481378#comment-14481378
 ] 

Guoqiang Li commented on SPARK-5261:


I'm sorry, the  after  one 's  mincount is 100

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(100)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(5)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res3: Float = 1661285.2 
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
=> 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889


> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(100)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> wor

[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481342#comment-14481342
 ] 

Peter Rudenko edited comment on SPARK-3702 at 4/6/15 4:06 PM:
--

For trees based algorithms curious whether there would be performance benefit 
(assuming reimplementation of Decision tree) by passing directly Dataframe 
columns rather than single column with vector type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols("col1","col2", "col3, ...)
{code}

and split dataset using dataframe api.




was (Author: prudenko):
For trees based algorithms curious whether there would be performance benefit 
by passing directly Dataframe columns rather than single column with vector 
type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols("col1","col2", "col3, ...)
{code}





> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481342#comment-14481342
 ] 

Peter Rudenko commented on SPARK-3702:
--

For trees based algorithms curious whether there would be performance benefit 
by passing directly Dataframe columns rather than single column with vector 
type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols("col1","col2", "col3, ...)
{code}





> Standardize MLlib classes for learners, models
> --
>
> Key: SPARK-3702
> URL: https://issues.apache.org/jira/browse/SPARK-3702
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
> of subtasks).  See the "requires" links below for subtasks.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2960) Spark executables fail to start via symlinks

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-2960:
--

Not sure what happened there -- probably my fault in any event -- but this one 
is duplicated, rather than is the duplicate. There was a PR but it wasn't 
accepted, so that shouldn't resolve it either. As far as I know the issue is 
still valid

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6205) UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6205:
-
Fix Version/s: 1.3.2

> UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
> ---
>
> Key: SPARK-6205
> URL: https://issues.apache.org/jira/browse/SPARK-6205
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.3.2, 1.4.0
>
>
> {code}
> mvn -DskipTests -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 clean 
> install
> mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 test 
> -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite -Dtest=none -pl core/ 
> {code}
> will produce:
> {code}
> UISeleniumSuite:
> *** RUN ABORTED ***
>   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
>   ...
> {code}
> It doesn't seem to happen without the various profiles set above.
> The fix is simple, although sounds weird; Selenium's dependency on 
> {{xml-apis:xml-apis}} must be manually included in core's test dependencies. 
> This probably has something to do with Hadoop 2 vs 1 dependency changes and 
> the fact that Maven test deps aren't transitive, AFAIK.
> PR coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream

2015-04-06 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481266#comment-14481266
 ] 

Cody Koeninger commented on SPARK-6431:
---

I think this got mis-diagnosed on the mailing list, sorry for the confusion.

The only way I've been able to reproduce that exception is by trying to start a 
stream for a topic that doesn't exist at all.  Alberto, did you actually run 
kafka-topics.sh --create before starting the job, or in some other way create 
the topic?  Pretty sure what happened here is that your topic didn't exist the 
first time you ran the job.  Your brokers were set to auto-create topics, so it 
did exist the next time you ran the job.  Putting a message into the topic 
didn't have anything to do with it.

Here's why I think that's what happened.  Following console session is an 
example, where "empty" topic existed prior to starting the console, but had no 
messages.  Topic "hasonemesssage" existed and had one message in it.  Topic 
"doesntexistyet" didn't exist at the beginning of the console.

The metadata apis return the same info for existing-but-empty topics as they do 
for topics with messages in them:

scala> kc.getPartitions(Set("empty")).right
res0: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(
Set([empty,0], [empty,1])))

scala> kc.getPartitions(Set("hasonemessage")).right
res1: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set([hasonemessage,0], [hasonemessage,1])))


Leader offsets are both 0 for the empty topic, as you'd expect:

scala> kc.getLatestLeaderOffsets(kc.getPartitions(Set("empty")).right.get)
res5: 
Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]]
 = Right(Map([empty,1] -> LeaderOffset(localhost,9094,0), [empty,0] -> 
LeaderOffset(localhost,9093,0)))

And one of the leader offsets is 1 for the topic with one message:

scala> 
kc.getLatestLeaderOffsets(kc.getPartitions(Set("hasonemessage")).right.get)
res6: 
Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]]
 = Right(Map([hasonemessage,0] -> LeaderOffset(localhost,9092,1), 
[hasonemessage,1] -> LeaderOffset(localhost,9093,0)))


The first time a metadata request is made against the non-existing topic, it 
returns empty:

kc.getPartitions(Set("doesntexistyet")).right
res2: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set()))


But if your brokers are configured with auto.create.topics.enable set to true, 
that metadata request alone is enough to trigger creation of the topic.  
Requesting it again shows that the topic has been created:

scala> kc.getPartitions(Set("doesntexistyet")).right
res3: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set([doesntexistyet,0], [doesntexistyet,1])))


If you don't think that explains what happened, please let me know if you have 
a way of reproducing that exception against an existing-but-empty topic, 
because I cant.

As far as what to do about this, my instinct is to just improve the error 
handling for the getPartitions call.  If the topic doesn't exist yet, It 
shouldn't be returning an empty set, it should be returning an error.


> Couldn't find leader offsets exception when creating KafkaDirectStream
> --
>
> Key: SPARK-6431
> URL: https://issues.apache.org/jira/browse/SPARK-6431
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Alberto
>
> When I try to create an InputDStream using the createDirectStream method of 
> the KafkaUtils class and the kafka topic does not have any messages yet am 
> getting the following error:
> org.apache.spark.SparkException: Couldn't find leader offsets for Set()
> org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
> find leader offsets for Set()
>   at 
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
> If I put a message in the topic before creating the DirectStream everything 
> works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-04-06 Thread Danil Mironov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481228#comment-14481228
 ] 

Danil Mironov commented on SPARK-2960:
--

This now formed a loop of three tickets (SPARK-2960, SPARK-3482 and SPARK-4162) 
all three resolved as duplicate; two PR-s (#1875 and #2386) are closed but not 
merged. Apparently this issue doesn't progress at all.

Is there anything that can be done to burst through?

I could draft a new PR; can this ticket be re-opened?

> Spark executables fail to start via symlinks
> 
>
> Key: SPARK-2960
> URL: https://issues.apache.org/jira/browse/SPARK-2960
> Project: Spark
>  Issue Type: Bug
>Reporter: Shay Rojansky
>Priority: Minor
>
> The current scripts (e.g. pyspark) fail to run when they are executed via 
> symlinks. A common Linux scenario would be to have Spark installed somewhere 
> (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2991:
---

Assignee: Erik Erlandson  (was: Apache Spark)

> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2991:
---

Assignee: Apache Spark  (was: Erik Erlandson)

> RDD transforms for scan and scanLeft 
> -
>
> Key: SPARK-2991
> URL: https://issues.apache.org/jira/browse/SPARK-2991
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features
>
> Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
> and scanLeft(z)(f) (sequential prefix scan)
> Discussion of a scanLeft implementation:
> http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
> Discussion of scan:
> http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6720:
---

Assignee: (was: Apache Spark)

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Priority: Minor
> Fix For: 1.4.0
>
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481176#comment-14481176
 ] 

Apache Spark commented on SPARK-6720:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5374

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Priority: Minor
> Fix For: 1.4.0
>
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6720:
---

Assignee: Apache Spark

> PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
> --
>
> Key: SPARK-6720
> URL: https://issues.apache.org/jira/browse/SPARK-6720
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 1.4.0
>
>
> Implement correct normL1 and normL2 test.
> continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6720:
-

 Summary: PySpark MultivariateStatisticalSummary unit test for 
normL1 and normL2
 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


Implement correct normL1 and normL2 test.

continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481132#comment-14481132
 ] 

Sean Owen commented on SPARK-5261:
--

In the new code you pasted, I don't see a difference between the two runs. Is 
the point that the result isn't deterministic even with a fixed seed? that it 
might be sensitive to the order in which it encounters the words?

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(5)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(5)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res3: Float = 1661285.2 
> {code}
> The average absolute value of the word's vector representation is 60731.8
> {code}
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(1)
> {code}
> The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6687) In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6687.
--
Resolution: Not A Problem

I'm not sure what the problem is here, so closing until there's any follow up.

> In the hadoop 0.23 profile, hadoop pulls in an older version of netty which 
> conflicts with akka's netty 
> 
>
> Key: SPARK-6687
> URL: https://issues.apache.org/jira/browse/SPARK-6687
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Sai Nishanth Parepally
>
> excerpt from mvn -Dverbose dependency:tree of spark-core, note the 
> org.jboss.netty:netty dependency:
> [INFO] |  |  +- 
> org.apache.hadoop:hadoop-mapreduce-client-app:jar:0.23.10:compile
> [INFO] |  |  |  +- 
> org.apache.hadoop:hadoop-mapreduce-client-common:jar:0.23.10:compile
> [INFO] |  |  |  |  +- 
> (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
> duplicate)
> [INFO] |  |  |  |  +- 
> (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
> for duplicate)
> [INFO] |  |  |  |  +- 
> org.apache.hadoop:hadoop-yarn-server-common:jar:0.23.10:compile
> [INFO] |  |  |  |  |  +- 
> (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
> duplicate)
> [INFO] |  |  |  |  |  +- (org.apache.zookeeper:zookeeper:jar:3.4.5:compile - 
> version managed from 3.4.2; omitted for duplicate)
> [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - 
> version managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  |  +- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  |  +- 
> (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate)
> [INFO] |  |  |  |  |  +- (commons-io:commons-io:jar:2.1:compile - omitted for 
> duplicate)
> [INFO] |  |  |  |  |  +- (com.google.inject:guice:jar:3.0:compile - omitted 
> for duplicate)
> [INFO] |  |  |  |  |  +- 
> (com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.8:compile
>  - omitted for duplicate)
> [INFO] |  |  |  |  |  +- (com.sun.jersey:jersey-server:jar:1.8:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  |  \- 
> (com.sun.jersey.contribs:jersey-guice:jar:1.8:compile - omitted for duplicate)
> [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:1.23.10:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
> omitted for duplicate)
> [INFO] |  |  |  +- 
> org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:0.23.10:compile
> [INFO] |  |  |  |  +- 
> (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
> for duplicate)
> [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
> omitted for duplicate)
> [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
> omitted for duplicate)
> [INFO] |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
> omitted for duplicate)
> [INFO] |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed 
> from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
> managed from 1.6.1; omitted for duplicate)
> [INFO] |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
> omitted for duplicate)
> [INFO] |  |  |  \- org.jboss.netty:netty:jar:3.2.4.Final:compile



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6630.
--
Resolution: Won't Fix

Idea was good, just probably can't be reconciled with binary compatibility at 
this point without significantly more change, so closing. If there's a 
particularly expensive computation we want to avoid, we can fix those directly 
by checking the property's existence first before computing and setting a new 
value.

> SparkConf.setIfMissing should only evaluate the assigned value if indeed 
> missing
> 
>
> Key: SPARK-6630
> URL: https://issues.apache.org/jira/browse/SPARK-6630
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Svend Vanderveken
>Priority: Minor
>
> The method setIfMissing() in SparkConf is currently systematically evaluating 
> the right hand side of the assignment even if not used. This leads to 
> unnecessary computation, like in the case of 
> {code}
>   conf.setIfMissing("spark.driver.host", Utils.localHostName())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >