date:20150708


 [ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8600:
---

Assignee: Apache Spark  (was: Yanbo Liang)

 Naive Bayes API for spark.ml Pipelines
 --

 Key: SPARK-8600
 URL: https://issues.apache.org/jira/browse/SPARK-8600
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
 existing NaiveBayes implementation under spark.mllib package. Should also 
 keep the parameter names consistent. The output columns could include both 
 the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8600) Naive Bayes API for spark.ml Pipelines


[ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618219#comment-14618219
 ] 

Apache Spark commented on SPARK-8600:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7284

 Naive Bayes API for spark.ml Pipelines
 --

 Key: SPARK-8600
 URL: https://issues.apache.org/jira/browse/SPARK-8600
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang

 Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
 existing NaiveBayes implementation under spark.mllib package. Should also 
 keep the parameter names consistent. The output columns could include both 
 the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8600) Naive Bayes API for spark.ml Pipelines


 [ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8600:
---

Assignee: Yanbo Liang  (was: Apache Spark)

 Naive Bayes API for spark.ml Pipelines
 --

 Key: SPARK-8600
 URL: https://issues.apache.org/jira/browse/SPARK-8600
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang

 Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
 existing NaiveBayes implementation under spark.mllib package. Should also 
 keep the parameter names consistent. The output columns could include both 
 the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7050) Fix Python Kafka test assembly jar not found issue under Maven build


 [ 
https://issues.apache.org/jira/browse/SPARK-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7050.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 5632
[https://github.com/apache/spark/pull/5632]

 Fix Python Kafka test assembly jar not found issue under Maven build
 

 Key: SPARK-7050
 URL: https://issues.apache.org/jira/browse/SPARK-7050
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1
Reporter: Saisai Shao
Priority: Minor
 Fix For: 1.5.0


 The behavior of {{mvn package}} and {{sbt kafka-assembly/assembly}} under 
 kafka-assembly module is different, sbt will generate an assembly jar under 
 target/scala-version/, while mvn generates this jar under target/, which will 
 make python Kafka streaming unit test fail to find the related jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8894) Example code errors in SparkR documentation

2015-07-08 Thread Sun Rui (JIRA)

Sun Rui created SPARK-8894:
--

 Summary: Example code errors in SparkR documentation
 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui


There are errors in SparkR related documentation.
1. in 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
for R part, 
{code}
results = sqlContext.sql(FROM src SELECT key, value).collect()
{code}
should be
{code}
results - collect(sql(sqlContext, FROM src SELECT key, value))
{code}

2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
{code}
results - hiveContext.sql(FROM src SELECT key, value)
{code}
should be
{code}
results - sql(hiveContext, FROM src SELECT key, value)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop

Aniket Bhatnagar created SPARK-8895:
---

 Summary: MetricsSystem.removeSource not called in 
StreamingContext.stop
 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor


StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8896) StreamingSource should choose a unique name

Aniket Bhatnagar created SPARK-8896:
---

 Summary: StreamingSource should choose a unique name
 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar


If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7050) Fix Python Kafka test assembly jar not found issue under Maven build


 [ 
https://issues.apache.org/jira/browse/SPARK-7050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7050:
-
Assignee: Saisai Shao

 Fix Python Kafka test assembly jar not found issue under Maven build
 

 Key: SPARK-7050
 URL: https://issues.apache.org/jira/browse/SPARK-7050
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Minor
 Fix For: 1.5.0


 The behavior of {{mvn package}} and {{sbt kafka-assembly/assembly}} under 
 kafka-assembly module is different, sbt will generate an assembly jar under 
 target/scala-version/, while mvn generates this jar under target/, which will 
 make python Kafka streaming unit test fail to find the related jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8894) Example code errors in SparkR documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8894:
---

Assignee: (was: Apache Spark)

 Example code errors in SparkR documentation
 ---

 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 There are errors in SparkR related documentation.
 1. in 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
 for R part, 
 {code}
 results = sqlContext.sql(FROM src SELECT key, value).collect()
 {code}
 should be
 {code}
 results - collect(sql(sqlContext, FROM src SELECT key, value))
 {code}
 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
 {code}
 results - hiveContext.sql(FROM src SELECT key, value)
 {code}
 should be
 {code}
 results - sql(hiveContext, FROM src SELECT key, value)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8894) Example code errors in SparkR documentation


[ 
https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618508#comment-14618508
 ] 

Apache Spark commented on SPARK-8894:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/7287

 Example code errors in SparkR documentation
 ---

 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 There are errors in SparkR related documentation.
 1. in 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
 for R part, 
 {code}
 results = sqlContext.sql(FROM src SELECT key, value).collect()
 {code}
 should be
 {code}
 results - collect(sql(sqlContext, FROM src SELECT key, value))
 {code}
 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
 {code}
 results - hiveContext.sql(FROM src SELECT key, value)
 {code}
 should be
 {code}
 results - sql(hiveContext, FROM src SELECT key, value)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8894) Example code errors in SparkR documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8894:
---

Assignee: Apache Spark

 Example code errors in SparkR documentation
 ---

 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Assignee: Apache Spark

 There are errors in SparkR related documentation.
 1. in 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
 for R part, 
 {code}
 results = sqlContext.sql(FROM src SELECT key, value).collect()
 {code}
 should be
 {code}
 results - collect(sql(sqlContext, FROM src SELECT key, value))
 {code}
 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
 {code}
 results - hiveContext.sql(FROM src SELECT key, value)
 {code}
 should be
 {code}
 results - sql(hiveContext, FROM src SELECT key, value)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name


 [ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8896:

Description: 
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.


[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]


  was:
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

{quote}
[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}


 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop


 [ 
https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8895:

Description: 
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

??
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
??

  was:
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

{{
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
}}


 MetricsSystem.removeSource not called in StreamingContext.stop
 --

 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor

 StreamingContext calls env.metricsSystem.registerSource during its 
 construction but does not call env.metricsSystem.removeSource. Therefore, if 
 a user attempts to restart a Streaming job in the same JVM by creating a new 
 instance of StreamingContext with the same application name, it results in 
 exceptions like the following in the log:
 ??
 [info] o.a.s.m.MetricsSystem -Metrics already registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 ??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8896) StreamingSource should choose a unique name


 [ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8896:

Description: 
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

{quote}
[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}

  was:
If 2 instances of StreamingContext are created and run using the same 
SparkContext, it results the following exception in the logs and causes the 
latter StreamingContext's metrics to go unreported.

[ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
registered
java.lang.IllegalArgumentException: A metric named 
AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]



 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 {quote}
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib


[ 
https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618403#comment-14618403
 ] 

Apache Spark commented on SPARK-8068:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7286

 Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
 --

 Key: SPARK-8068
 URL: https://issues.apache.org/jira/browse/SPARK-8068
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He
Priority: Minor

 There is no confusionMatrix method at class MulticlassMetrics in 
 pyspark/mllib. This method is actually implemented in scala mllib. To achieve 
 this, we just need add a function call to the corresponding one in scala 
 mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib


 [ 
https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8068:
---

Assignee: Apache Spark

 Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
 --

 Key: SPARK-8068
 URL: https://issues.apache.org/jira/browse/SPARK-8068
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He
Assignee: Apache Spark
Priority: Minor

 There is no confusionMatrix method at class MulticlassMetrics in 
 pyspark/mllib. This method is actually implemented in scala mllib. To achieve 
 this, we just need add a function call to the corresponding one in scala 
 mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8895) MetricsSystem.removeSource not called in StreamingContext.stop


 [ 
https://issues.apache.org/jira/browse/SPARK-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar updated SPARK-8895:

Description: 
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

{quote}
[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
{quote}

  was:
StreamingContext calls env.metricsSystem.registerSource during its construction 
but does not call env.metricsSystem.removeSource. Therefore, if a user attempts 
to restart a Streaming job in the same JVM by creating a new instance of 
StreamingContext with the same application name, it results in exceptions like 
the following in the log:

[info] o.a.s.m.MetricsSystem -Metrics already registered
java.lang.IllegalArgumentException: A metric named 
.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already exists
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
at 
org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148) 
~[spark-core_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]
at 
org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
[spark-streaming_2.11-1.4.0.jar:1.4.0]



 MetricsSystem.removeSource not called in StreamingContext.stop
 --

 Key: SPARK-8895
 URL: https://issues.apache.org/jira/browse/SPARK-8895
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar
Priority: Minor

 StreamingContext calls env.metricsSystem.registerSource during its 
 construction but does not call env.metricsSystem.removeSource. Therefore, if 
 a user attempts to restart a Streaming job in the same JVM by creating a new 
 instance of StreamingContext with the same application name, it results in 
 exceptions like the following in the log:
 {quote}
 [info] o.a.s.m.MetricsSystem -Metrics already registered
 java.lang.IllegalArgumentException: A metric named 
 .StreamingMetrics.streaming.lastReceivedBatch_processingEndTime already 
 exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers


[ 
https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618313#comment-14618313
 ] 

Sean Owen commented on SPARK-8881:
--

Yes, the punchline is that each worker is asked for 48/4 = 12 cores, but 12 is 
less than the 16 cores each executor needs, so for every worker, 0 executors 
are allocated. Grabbing cores in chunks of 16 in this case works, as does only 
considering 3 workers to allocate 3 executors, since the problem is that it 
never makes sense to try allocating N executors over MN workers.

 Scheduling fails if num_executors  num_workers
 ---

 Key: SPARK-8881
 URL: https://issues.apache.org/jira/browse/SPARK-8881
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0, 1.5.0
Reporter: Nishkam Ravi

 Current scheduling algorithm (in Master.scala) has two issues:
 1. cores are allocated one at a time instead of spark.executor.cores at a time
 2. when spark.cores.max/spark.executor.cores  num_workers, executors are not 
 launched and the app hangs (due to 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code

2015-07-08 Thread Kashif Rasul (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618352#comment-14618352
 ] 

Kashif Rasul commented on SPARK-8872:
-

Ok should be ready for review.

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-08 Thread Daniel Darabos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-8893:
--
Description: 
What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.

  was:
What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{x}}  0. But the behavior for {{x}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.


 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType


[ 
https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618142#comment-14618142
 ] 

Apache Spark commented on SPARK-8866:
-

User 'yijieshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7283

 Use 1 microsecond (us) precision for TimestampType
 --

 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen

 100ns is slightly weird to compute. Let's use 1us to be more consistent with 
 other systems (e.g. Postgres) and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType


 [ 
https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8866:
---

Assignee: Apache Spark  (was: Yijie Shen)

 Use 1 microsecond (us) precision for TimestampType
 --

 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 100ns is slightly weird to compute. Let's use 1us to be more consistent with 
 other systems (e.g. Postgres) and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-8864:
-
Comment: was deleted

(was: Thanks for explanation. The design looks good to me now.)

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)


 [ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7917.
--
Resolution: Duplicate

A-ha, that's the ticket. I knew there was something like this already resolved.

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers

2015-07-08 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618362#comment-14618362
 ] 

Nishkam Ravi commented on SPARK-8881:
-

There's more to it. Consider the following: three workers with num_cores (8, 8, 
2). spark.cores.maximum = 12, spark.executor.cores = 4. Core allocation would 
be (5, 5, 2). num_executors = num_workers and nothing gets launched! 

Problem isn't that num_workers  num_executors (that's just a place it 
manifests in practice). Problem is we are allocating one core at a time and 
ignoring spark.executor.cores during allocation.

 Scheduling fails if num_executors  num_workers
 ---

 Key: SPARK-8881
 URL: https://issues.apache.org/jira/browse/SPARK-8881
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0, 1.5.0
Reporter: Nishkam Ravi

 Current scheduling algorithm (in Master.scala) has two issues:
 1. cores are allocated one at a time instead of spark.executor.cores at a time
 2. when spark.cores.max/spark.executor.cores  num_workers, executors are not 
 launched and the app hangs (due to 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8893) Require positive partition counts in RDD.repartition


[ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618402#comment-14618402
 ] 

Apache Spark commented on SPARK-8893:
-

User 'darabos' has created a pull request for this issue:
https://github.com/apache/spark/pull/7285

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8893) Require positive partition counts in RDD.repartition


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8893:
---

Assignee: Apache Spark

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Assignee: Apache Spark
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8893) Require positive partition counts in RDD.repartition


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8893:
---

Assignee: (was: Apache Spark)

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2015-07-08 Thread Qian Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618172#comment-14618172
 ] 

Qian Huang commented on SPARK-8514:
---

[~zhaoxiangyu] Not yet. [~shivaram] shared a communication efficient version, 
which looks a better version than scalapack's. I am reading the paper and 
working at this.

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced

 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618200#comment-14618200
 ] 

Cheng Hao commented on SPARK-8864:
--

Thanks for explanation. The design looks good to me now.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618201#comment-14618201
 ] 

Cheng Hao commented on SPARK-8864:
--

Thanks for explanation. The design looks good to me now.

 Date/time function and data type design
 ---

 Key: SPARK-8864
 URL: https://issues.apache.org/jira/browse/SPARK-8864
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0

 Attachments: SparkSQLdatetimeudfs (1).pdf


 Please see the attached design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7442) Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access

2015-07-08 Thread Vinay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618222#comment-14618222
 ] 

Vinay commented on SPARK-7442:
--

Tried and tested--

Steps to submit spark job when jar file resides in S3:

Step 1:  Add these dependencies in pom file
A. hadoop-common.jar(optional if already present in class path)
B. hadoop-aws.jar
C. aws-java-sdk.jar

These steps have to be followed for both master  slaves

Step 2:
export AWS_ACCESS_KEY_ID  AWS_SECRET_ACCESS_KEY in bash
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=

Note: These properties can also be set as per AWS environment.

Step 3: Add the the below dependencies in spark-env.sh” these steps to be 
followed in slaves 

SPARK_CLASSPATH=../lib/hadoop-aws-2.6.0.jar
SPARK_CLASSPATH=$SPARK_CLASSPATH:../lib/aws-java-sdk-1.7.4.jar 
SPARK_CLASSPATH=$SPARK_CLASSPATH:..lib/guava-11.0.2.jar
Note: will be aviable in hadoop or else can download it

Step 4: When running Spark job append --deploy-mode cluster

Sample command to submit spark job:
spark-submit --class com.x.y.z --master spark://master:7077 --deploy-mode 
cluster s3://bucket name/xyz.jar args

 Spark 1.3.1 / Hadoop 2.6 prebuilt pacakge has broken S3 filesystem access
 -

 Key: SPARK-7442
 URL: https://issues.apache.org/jira/browse/SPARK-7442
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.3.1
 Environment: OS X
Reporter: Nicholas Chammas

 # Download Spark 1.3.1 pre-built for Hadoop 2.6 from the [Spark downloads 
 page|http://spark.apache.org/downloads.html].
 # Add {{localhost}} to your {{slaves}} file and {{start-all.sh}}
 # Fire up PySpark and try reading from S3 with something like this:
 {code}sc.textFile('s3n://bucket/file_*').count(){code}
 # You will get an error like this:
 {code}py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : java.io.IOException: No FileSystem for scheme: s3n{code}
 {{file:///...}} works. Spark 1.3.1 prebuilt for Hadoop 2.4 works. Spark 1.3.0 
 works.
 It's just the combination of Spark 1.3.1 prebuilt for Hadoop 2.6 accessing S3 
 that doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2015-07-08 Thread Ma Xiaoyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618277#comment-14618277
 ] 

Ma Xiaoyu commented on SPARK-5159:
--

I was investigating this issue and it seems doAs in Hiveserver2 code was 
working. The problem is when it forwarding some event in DAGScheduler, the 
event goes through different thread and the ticket in receiving side thread is 
not the same as sending side.
The proxy user became the real user who started the hiveserver2 services. 
Is that the root cause?
I might be making patch if so.


 Thrift server does not respect hive.server2.enable.doAs=true
 

 Key: SPARK-5159
 URL: https://issues.apache.org/jira/browse/SPARK-5159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andrew Ray

 I'm currently testing the spark sql thrift server on a kerberos secured 
 cluster in YARN mode. Currently any user can access any table regardless of 
 HDFS permissions as all data is read as the hive user. In HiveServer2 the 
 property hive.server2.enable.doAs=true causes all access to be done as the 
 submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2015-07-08 Thread zhaoxiangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618155#comment-14618155
 ] 

zhaoxiangyu commented on SPARK-8514:


I want to know if you have implemented the LU factorization on BlockMatrix?

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced

 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2015-07-08 Thread zhaoxiangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618191#comment-14618191
 ] 

zhaoxiangyu commented on SPARK-8514:


can you give me the link of Shivaram Verkataraman's work?

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced

 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8892) Column.cast(LongType) does not work for large values

2015-07-08 Thread Jason Moore (JIRA)

Jason Moore created SPARK-8892:
--

 Summary: Column.cast(LongType) does not work for large values
 Key: SPARK-8892
 URL: https://issues.apache.org/jira/browse/SPARK-8892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Jason Moore


It seems that casting a column from String to Long seems to go through an 
intermediate step of being cast to a Double (hits Cast.scala line 328 in 
castToDecimal).  The result is that for large values, the wrong value is 
returned.

This test reveals this bug:

{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.FlatSpec

import scala.util.Random

class DataFrameCastBug extends FlatSpec {

  DataFrame should cast StringType to LongType correctly in {
val sc = new SparkContext(new 
SparkConf().setMaster(local).setAppName(app))
val qc = new SQLContext(sc)

val values = Seq.fill(10)(Random.nextLong)
val source = qc.createDataFrame(
  sc.parallelize(values.map(v = Row(v))),
  StructType(Seq(StructField(value, LongType

val result = source.select(source(value), 
source(value).cast(StringType).cast(LongType).as(castValue))

assert(result.where(result(value) !== result(castValue)).count() === 0)
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8885) libgplcompression.so already loaded in another classloader


 [ 
https://issues.apache.org/jira/browse/SPARK-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8885.
--
Resolution: Invalid

[~cenyuhai] There is a lot wrong with this JIRA, most importantly that it's a 
question for user@. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first 
before opening a JIRA

 libgplcompression.so already loaded in another classloader
 --

 Key: SPARK-8885
 URL: https://issues.apache.org/jira/browse/SPARK-8885
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
 Environment: CentOS 6.2
 JDK 1.7.0_51
 Spark version: 1.4.0
 Hadoop version: 2.2.0
 hadoop native lib 32bit
Reporter: cen yuhai
 Fix For: 1.4.2, 1.5.0


 Hi, all
 I found an Exception when using spark-sql
 java.lang.UnsatisfiedLinkError: Native Library 
 /data/lib/native/libgplcompression.so already loaded in another classloader 
 ...
 I set  spark.sql.hive.metastore.jars=.  in file spark-defaults.conf
 It does not happen every time. Who knows why?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-08 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-8893:
-

 Summary: Require positive partition counts in RDD.repartition
 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial


What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{x}}  0. But the behavior for {{x}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8068) Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib


 [ 
https://issues.apache.org/jira/browse/SPARK-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8068:
---

Assignee: (was: Apache Spark)

 Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
 --

 Key: SPARK-8068
 URL: https://issues.apache.org/jira/browse/SPARK-8068
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Ai He
Priority: Minor

 There is no confusionMatrix method at class MulticlassMetrics in 
 pyspark/mllib. This method is actually implemented in scala mllib. To achieve 
 this, we just need add a function call to the corresponding one in scala 
 mllib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8897) SparkR DataFrame fail to return data of float type

2015-07-08 Thread Sun Rui (JIRA)

Sun Rui created SPARK-8897:
--

 Summary: SparkR DataFrame fail to return data of float type
 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui


This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
in the Spark mailing list:
{quote}
I'm got a trouble with float type coercion on SparkR with hiveContext.

 result - sql(hiveContext, SELECT offset, percentage from data limit  100)

 show(result)
DataFrame[offset:float, percentage:float]

 head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class jobj to a data.frame
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values


 [ 
https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6266:
---

Assignee: (was: Apache Spark)

 PySpark SparseVector missing doc for size, indices, values
 --

 Key: SPARK-6266
 URL: https://issues.apache.org/jira/browse/SPARK-6266
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Need to add doc for size, indices, values attributes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values


 [ 
https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6266:
---

Assignee: Apache Spark

 PySpark SparseVector missing doc for size, indices, values
 --

 Key: SPARK-6266
 URL: https://issues.apache.org/jira/browse/SPARK-6266
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 Need to add doc for size, indices, values attributes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8898) Jets3t hangs with more than 1 core

2015-07-08 Thread Daniel Darabos (JIRA)

Daniel Darabos created SPARK-8898:
-

 Summary: Jets3t hangs with more than 1 core
 Key: SPARK-8898
 URL: https://issues.apache.org/jira/browse/SPARK-8898
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: S3
Reporter: Daniel Darabos


If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write 
this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core 
per executor.

It sounds like a race condition, but so far I have seen it trigger 100% of the 
time. From a race for taking a limited number of connections I would expect it 
to succeed at least on 1 task at least some of the time. But I never saw a 
single completed task, except when running with 1-core executors.

All executor threads hang with one of the following two stack traces:

{noformat:title=Stack trace 1}
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535)
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987)
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown 
Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Stack trace 2}
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at

[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2015-07-08 Thread Greg Senia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618624#comment-14618624
 ] 

Greg Senia commented on SPARK-5159:
---

Yes that is the exact issue. It doesnt execute as proxy user.. This works 
correctly with native hiveserver2 with hive but not with sparksql thriftserver.

 Thrift server does not respect hive.server2.enable.doAs=true
 

 Key: SPARK-5159
 URL: https://issues.apache.org/jira/browse/SPARK-5159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andrew Ray

 I'm currently testing the spark sql thrift server on a kerberos secured 
 cluster in YARN mode. Currently any user can access any table regardless of 
 HDFS permissions as all data is read as the hive user. In HiveServer2 the 
 property hive.server2.enable.doAs=true causes all access to be done as the 
 submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8897) SparkR DataFrame fail to return data of float type


 [ 
https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8897:
---

Assignee: Apache Spark

 SparkR DataFrame fail to return data of float type
 --

 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Assignee: Apache Spark

 This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
 in the Spark mailing list:
 {quote}
 I'm got a trouble with float type coercion on SparkR with hiveContext.
  result - sql(hiveContext, SELECT offset, percentage from data limit  
  100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name


[ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618567#comment-14618567
 ] 

Sean Owen commented on SPARK-8896:
--

Why do you have multiple contexts? I think that's not allowed, right?

 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8897) SparkR DataFrame fail to return data of float type


[ 
https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618607#comment-14618607
 ] 

Apache Spark commented on SPARK-8897:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/7289

 SparkR DataFrame fail to return data of float type
 --

 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
 in the Spark mailing list:
 {quote}
 I'm got a trouble with float type coercion on SparkR with hiveContext.
  result - sql(hiveContext, SELECT offset, percentage from data limit  
  100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8897) SparkR DataFrame fail to return data of float type


 [ 
https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8897:
---

Assignee: (was: Apache Spark)

 SparkR DataFrame fail to return data of float type
 --

 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
 in the Spark mailing list:
 {quote}
 I'm got a trouble with float type coercion on SparkR with hiveContext.
  result - sql(hiveContext, SELECT offset, percentage from data limit  
  100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8600) Naive Bayes API for spark.ml Pipelines


 [ 
https://issues.apache.org/jira/browse/SPARK-8600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8600:
-
Shepherd: Joseph K. Bradley

 Naive Bayes API for spark.ml Pipelines
 --

 Key: SPARK-8600
 URL: https://issues.apache.org/jira/browse/SPARK-8600
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Yanbo Liang

 Create a NaiveBayes API for the spark.ml Pipelines API. This should wrap the 
 existing NaiveBayes implementation under spark.mllib package. Should also 
 keep the parameter names consistent. The output columns could include both 
 the prediction and confidence scores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8896) StreamingSource should choose a unique name


[ 
https://issues.apache.org/jira/browse/SPARK-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618764#comment-14618764
 ] 

Aniket Bhatnagar commented on SPARK-8896:
-

As per the documentation (scala docs), only one SparkContext per JVM is 
allowed. However, no such documentation exists for StreamingContext. 
SparkContext makes an effort to ensure only one SparkContext instance exists in 
the JVM using contextBeingConstructed variable. However, no such effort is made 
in StreamingContext. This lead me to believe that multiple StreamingContexts 
are allowed in a JVM.

I can understand why only one SparkContext is allowed per JVM (global state, et 
al), but I don't think thats true for StreamingContext. There maybe genuine use 
cases to have 2 StreamingContext instances using the same SparkContext to 
leverage the same workers and have different batch intervals.

 StreamingSource should choose a unique name
 ---

 Key: SPARK-8896
 URL: https://issues.apache.org/jira/browse/SPARK-8896
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Aniket Bhatnagar

 If 2 instances of StreamingContext are created and run using the same 
 SparkContext, it results the following exception in the logs and causes the 
 latter StreamingContext's metrics to go unreported.
 [ForkJoinPool-2-worker-7] [info] o.a.s.m.MetricsSystem -Metrics already 
 registered
 java.lang.IllegalArgumentException: A metric named 
 AppName.StreamingMetrics.streaming.lastReceivedBatch_processingEndTime 
 already exists
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.registerAll(MetricRegistry.java:385)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:85)~[metrics-core-3.1.0.jar:3.1.0]
 at 
 org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:148)
  ~[spark-core_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:199)
  [spark-streaming_2.11-1.4.0.jar:1.4.0]
 at 
 org.apache.spark.streaming.StreamingContext.init(StreamingContext.scala:71) 
 [spark-streaming_2.11-1.4.0.jar:1.4.0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8899) remove duplicated equals method for Row


[ 
https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618744#comment-14618744
 ] 

Apache Spark commented on SPARK-8899:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7291

 remove duplicated equals method for Row
 ---

 Key: SPARK-8899
 URL: https://issues.apache.org/jira/browse/SPARK-8899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8899) remove duplicated equals method for Row


 [ 
https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8899:
---

Assignee: (was: Apache Spark)

 remove duplicated equals method for Row
 ---

 Key: SPARK-8899
 URL: https://issues.apache.org/jira/browse/SPARK-8899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8899) remove duplicated equals method for Row


 [ 
https://issues.apache.org/jira/browse/SPARK-8899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8899:
---

Assignee: Apache Spark

 remove duplicated equals method for Row
 ---

 Key: SPARK-8899
 URL: https://issues.apache.org/jira/browse/SPARK-8899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8898) Jets3t hangs with more than 1 core


[ 
https://issues.apache.org/jira/browse/SPARK-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618747#comment-14618747
 ] 

Sean Owen commented on SPARK-8898:
--

Yeah, this is a jets3t problem. You will have to manage to update it in your 
build or get EC2 + Hadoop 2 to work, which I think can be done. At least, this 
is just a subset of EC2 should support Hadoop 2 and/or that the EC2 support 
should move out of Spark anyway. I don't know there's another action to take in 
Spark.

 Jets3t hangs with more than 1 core
 --

 Key: SPARK-8898
 URL: https://issues.apache.org/jira/browse/SPARK-8898
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: S3
Reporter: Daniel Darabos

 If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write 
 this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core 
 per executor.
 It sounds like a race condition, but so far I have seen it trigger 100% of 
 the time. From a race for taking a limited number of connections I would 
 expect it to succeed at least on 1 task at least some of the time. But I 
 never saw a single completed task, except when running with 1-core executors.
 All executor threads hang with one of the following two stack traces:
 {noformat:title=Stack trace 1}
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007759cae70 (a 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
 at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
 - locked 0x0007759cae70 (a 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
 at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
 at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
 at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
 at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535)
 at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987)
 at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332)
 at 
 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107)
 at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown 
 Source)
 at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
 at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
 at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at

[jira] [Created] (SPARK-8899) remove duplicated equals method for Row

2015-07-08 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-8899:
--

 Summary: remove duplicated equals method for Row
 Key: SPARK-8899
 URL: https://issues.apache.org/jira/browse/SPARK-8899
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0


 [ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8118:
-
Target Version/s: 1.5.0  (was: 1.4.1, 1.5.0)

 Turn off noisy log output produced by Parquet 1.7.0
 ---

 Key: SPARK-8118
 URL: https://issues.apache.org/jira/browse/SPARK-8118
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Minor
 Fix For: 1.5.0


 Parquet 1.7.0 renames package name to org.apache.parquet, need to adjust 
 {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8872) Improve FPGrowthSuite with equivalent R code


 [ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8872:
-
Assignee: Kashif Rasul

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Kashif Rasul
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8873) Support cleaning up shuffle files for drivers launched with Mesos


 [ 
https://issues.apache.org/jira/browse/SPARK-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8873:
-
   Priority: Minor  (was: Major)
Component/s: Mesos

 Support cleaning up shuffle files for drivers launched with Mesos
 -

 Key: SPARK-8873
 URL: https://issues.apache.org/jira/browse/SPARK-8873
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Priority: Minor
  Labels: mesos

 With dynamic allocation enabled with Mesos, drivers can launch with shuffle 
 data cached in the external shuffle service.
 However, there is no reliable way to let the shuffle service clean up the 
 shuffle data when the driver exits, since it may crash before it notifies the 
 shuffle service and shuffle data will be cached forever.
 We need to implement a reliable way to detect driver termination and clean up 
 shuffle data accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8872) Improve FPGrowthSuite with equivalent R code


 [ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8872.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7269
[https://github.com/apache/spark/pull/7269]

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code

2015-07-08 Thread Kashif Rasul (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618822#comment-14618822
 ] 

Kashif Rasul commented on SPARK-8872:
-

[~mengxr] my jira username is krasul

 Improve FPGrowthSuite with equivalent R code
 

 Key: SPARK-8872
 URL: https://issues.apache.org/jira/browse/SPARK-8872
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the 
 expected output is hard-coded. We can add equivalent R code using the arules 
 package to generate the expect output for validation purpose, similar to 
 https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98
  and the test code in https://github.com/apache/spark/pull/7005.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix


[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618835#comment-14618835
 ] 

Shivaram Venkataraman commented on SPARK-8514:
--

I posted a link to a paper on communication efficient LU before 
(http://dl.acm.org/citation.cfm?id=1413400). There might be an MPI 
implementation at https://github.com/solomonik/NuLAB that might also be helpful

 LU factorization on BlockMatrix
 ---

 Key: SPARK-8514
 URL: https://issues.apache.org/jira/browse/SPARK-8514
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
  Labels: advanced

 LU is the most common method to solve a general linear system or inverse a 
 general matrix. A distributed version could in implemented block-wise with 
 pipelining. A reference implementation is provided in ScaLAPACK:
 http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition


 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8893:
-
Component/s: Spark Core

 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8897) SparkR DataFrame fail to return data of float type


 [ 
https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8897.
--
Resolution: Duplicate

 SparkR DataFrame fail to return data of float type
 --

 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
 in the Spark mailing list:
 {quote}
 I'm got a trouble with float type coercion on SparkR with hiveContext.
  result - sql(hiveContext, SELECT offset, percentage from data limit  
  100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2


[ 
https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618863#comment-14618863
 ] 

Shivaram Venkataraman commented on SPARK-8596:
--

Thanks for the PR. Will review this today. And we don't have anything like this 
open for ipython as far as I know. You can open a new JIRA and discuss this.

 Install and configure RStudio server on Spark EC2
 -

 Key: SPARK-8596
 URL: https://issues.apache.org/jira/browse/SPARK-8596
 Project: Spark
  Issue Type: Improvement
  Components: EC2, SparkR
Reporter: Shivaram Venkataraman

 This will make it convenient for R users to use SparkR from their browsers 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8897) SparkR DataFrame fail to return data of float type


[ 
https://issues.apache.org/jira/browse/SPARK-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618849#comment-14618849
 ] 

Shivaram Venkataraman commented on SPARK-8897:
--

[~sunrui] I'm closing this as we already have a PR and an issue open in 
SPARK-8840

 SparkR DataFrame fail to return data of float type
 --

 Key: SPARK-8897
 URL: https://issues.apache.org/jira/browse/SPARK-8897
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 This is an issue reported by Evgeny Sinelnikov esinelni...@griddynamics.com 
 in the Spark mailing list:
 {quote}
 I'm got a trouble with float type coercion on SparkR with hiveContext.
  result - sql(hiveContext, SELECT offset, percentage from data limit  
  100)
  show(result)
 DataFrame[offset:float, percentage:float]
  head(result)
 Error in as.data.frame.default(x[[i]], optional = TRUE) :
 cannot coerce class jobj to a data.frame
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType


 [ 
https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8866:
---

Assignee: Yijie Shen  (was: Apache Spark)

 Use 1 microsecond (us) precision for TimestampType
 --

 Key: SPARK-8866
 URL: https://issues.apache.org/jira/browse/SPARK-8866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen

 100ns is slightly weird to compute. Let's use 1us to be more consistent with 
 other systems (e.g. Postgres) and less error prone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)

2015-07-08 Thread Mingyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618179#comment-14618179
 ] 

Mingyu Kim commented on SPARK-7917:
---

This looks like a duplicate of SPARK-5970, which is fixed in Spark 1.4.

 Spark doesn't clean up Application Directories (local dirs) 
 

 Key: SPARK-7917
 URL: https://issues.apache.org/jira/browse/SPARK-7917
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zach Fry
Priority: Minor

 Similar to SPARK-4834. 
 Spark does clean up the cache and lock files in the local dirs, however, it 
 doesn't clean up the actual directories. 
 We have to write custom scripts to go back through the local dirs and find 
 directories that don't contain any files and clear those out. 
 Its a pretty simple repro: 
 Run a job that does some shuffling, wait for the shuffle files to get cleaned 
 up, go and look on disk at spark.local.dir and notice that the directory(s) 
 are still there, but there are no files in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8881) Scheduling fails if num_executors num_workers

2015-07-08 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618369#comment-14618369
 ] 

Nishkam Ravi commented on SPARK-8881:
-

This isn't the best example because the third worker will get screened out. 
Consider the following instead: three workers with num_cores (8, 8, 3). 
spark.cores.maximum=8, spark.executor.cores=2. Core allocation would be (3, 3, 
2). 3 executors launched instead of 4. You get the drift.

 Scheduling fails if num_executors  num_workers
 ---

 Key: SPARK-8881
 URL: https://issues.apache.org/jira/browse/SPARK-8881
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0, 1.5.0
Reporter: Nishkam Ravi

 Current scheduling algorithm (in Master.scala) has two issues:
 1. cores are allocated one at a time instead of spark.executor.cores at a time
 2. when spark.cores.max/spark.executor.cores  num_workers, executors are not 
 launched and the app hangs (due to 1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-08 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619098#comment-14619098
 ] 

shane knapp commented on SPARK-8571:


ok, upon auditing all of the spark builds, i think i found the culprits:

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-with-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-pre-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-with-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-pre-YARN/configure

* #!/bin/bash is NOT set (defaulting to behavior where if mvn fails, the 
lsof/xargs kill commands will never run)
* the lsof/xargs kill line will potentially pollute the exit code of the build 
block
* set -e looks to be impossible to set due to the lsof/xargs kill being the 
last line of the block

proposed:
* store the retcodes of the mvn commands, and if either one fails, fail the 
build after the lsof/xargs kill command
* add #!/bin/bash



 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe,

[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-07-08 Thread Matt Massie (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619100#comment-14619100
]

Matt Massie commented on SPARK-7263:

The Spark shuffle manager APIs, in their current state, don't support a
standalone shuffle implementation. If you like, I can split my pull request
into two parts: (a) changes to Spark, e.g. [serializing class
info|https://github.com/massie/spark/commit/fc03c0bd29fa71ff390b86a8f6fd31c1cbef960f],
making APIs public, etc and (b) the new Parquet implementation.

I think your comment that we're creating a whole new shuffle subsystem for one
data type is technically correct but it misses the bigger point. The currently
supported data type, {{IndexedRecord}} is the base type for all Avro objects
and includes three methods -- {{get}}, {{put}} and {{getSchema}} -- the
primitives necessary for describing, storing and building objects. Since
Parquet supports Thrift and Protobuf too, it would be straight-forward to add
their base types here too which perform similar functions.

I reached out to Michael Armbrust and looked at the Spark SQL code, in depth,
before I wrote this. I had hoped to piggyback on the Spark SQL work but found
that it wasn't a good match. If you like, I can list all the issues that I
found.

I'd like to know why you think this would be a maintenance nightmare? I think
otherwise, but of course I wrote this. Can you be more specific with your
concerns around maintenance?

Add new shuffle manager which stores shuffle blocks in Parquet
--

Key: SPARK-7263
URL: https://issues.apache.org/jira/browse/SPARK-7263
Project: Spark
Issue Type: New Feature
Components: Block Manager
Reporter: Matt Massie

I have a working prototype of this feature that can be viewed at
https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
Setting the spark.shuffle.manager to parquet enables this shuffle manager.
The dictionary support that Parquet provides appreciably reduces the amount of
memory that objects use; however, once Parquet data is shuffled, all the
dictionary information is lost and the column-oriented data is written to
shuffle
blocks in a record-oriented fashion. This shuffle manager addresses this issue
by reading and writing all shuffle blocks in the Parquet format.
If shuffle objects are Avro records, then the Avro $SCHEMA is converted to
Parquet
schema and used directly, otherwise, the Parquet schema is generated via
reflection.
Currently, the only non-Avro keys supported is primitive types. The reflection
code can be improved (or replaced) to support complex records.
The ParquetShufflePair class allows the shuffle key and value to be stored in
Parquet blocks as a single record with a single schema.
This commit adds the following new Spark configuration options:
spark.shuffle.parquet.compression - sets the Parquet compression codec
spark.shuffle.parquet.blocksize - sets the Parquet block size
spark.shuffle.parquet.pagesize - set the Parquet page size
spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off
Parquet does not (and has no plans to) support a streaming API. Metadata
sections
are scattered through a Parquet file making a streaming API difficult. As
such,
the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
of map outputs into temporary blocks before loading the data into the reducer.
Interesting future asides:
o There is no need to define a data serializer (although Spark requires it)
o Parquet support predicate pushdown and projection which could be used at
between shuffle stages to improve performance in the future

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-07-08 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619105#comment-14619105
 ] 

Reynold Xin commented on SPARK-7263:


Can you please list the issues that made Spark SQL a bad fit for this?


 Add new shuffle manager which stores shuffle blocks in Parquet
 --

 Key: SPARK-7263
 URL: https://issues.apache.org/jira/browse/SPARK-7263
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Matt Massie

 I have a working prototype of this feature that can be viewed at
 https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
 Setting the spark.shuffle.manager to parquet enables this shuffle manager.
 The dictionary support that Parquet provides appreciably reduces the amount of
 memory that objects use; however, once Parquet data is shuffled, all the
 dictionary information is lost and the column-oriented data is written to 
 shuffle
 blocks in a record-oriented fashion. This shuffle manager addresses this issue
 by reading and writing all shuffle blocks in the Parquet format.
 If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
 Parquet
 schema and used directly, otherwise, the Parquet schema is generated via 
 reflection.
 Currently, the only non-Avro keys supported is primitive types. The reflection
 code can be improved (or replaced) to support complex records.
 The ParquetShufflePair class allows the shuffle key and value to be stored in
 Parquet blocks as a single record with a single schema.
 This commit adds the following new Spark configuration options:
 spark.shuffle.parquet.compression - sets the Parquet compression codec
 spark.shuffle.parquet.blocksize - sets the Parquet block size
 spark.shuffle.parquet.pagesize - set the Parquet page size
 spark.shuffle.parquet.enabledictionary - turns dictionary encoding on/off
 Parquet does not (and has no plans to) support a streaming API. Metadata 
 sections
 are scattered through a Parquet file making a streaming API difficult. As 
 such,
 the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
 of map outputs into temporary blocks before loading the data into the reducer.
 Interesting future asides:
 o There is no need to define a data serializer (although Spark requires it)
 o Parquet support predicate pushdown and projection which could be used at
   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8657) Fail to upload conf archive to viewfs


 [ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8657.
--
Resolution: Fixed

 Fail to upload conf archive to viewfs
 -

 Key: SPARK-8657
 URL: https://issues.apache.org/jira/browse/SPARK-8657
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: spark-1.4.2  hadoop-2.5.0-cdh5.3.2
Reporter: Tao Li
Assignee: Tao Li
Priority: Minor
  Labels: distributed_cache, viewfs
 Fix For: 1.4.2, 1.5.0


 When I run in spark-1.4 yarn-client mode, I throws the following Exception 
 when trying to upload conf archive to viewfs:
 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
 file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
 .zip - 
 viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
 .sparkStaging/application_1434370929997_191242
 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
 java.lang.IllegalArgumentException: Wrong FS: 
 hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
 oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
 at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
 at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
 at 
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
 at 
 org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
 at 
 org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
 at 
 org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
 at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
 at org.apache.spark.SparkContext.init(SparkContext.scala:497)
 at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
 at $line3.$read$$iwC$$iwC.init(console:9)
 at $line3.$read$$iwC.init(console:18)
 at $line3.$read.init(console:20)
 at $line3.$read$.init(console:24)
 at $line3.$read$.clinit(console)
 at $line3.$eval$.init(console:7)
 at $line3.$eval$.clinit(console)
 at $line3.$eval.$print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 The bug is easy to fix, we should pass the correct file system object to 
 addResource. The similar issure is: 
 https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
 soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8657) Fail to upload conf archive to viewfs


 [ 
https://issues.apache.org/jira/browse/SPARK-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8657:
-
Assignee: Tao Li
Target Version/s:   (was: 1.5.0)
Priority: Minor  (was: Major)
   Fix Version/s: 1.5.0
  1.4.2

 Fail to upload conf archive to viewfs
 -

 Key: SPARK-8657
 URL: https://issues.apache.org/jira/browse/SPARK-8657
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.4.0
 Environment: spark-1.4.2  hadoop-2.5.0-cdh5.3.2
Reporter: Tao Li
Assignee: Tao Li
Priority: Minor
  Labels: distributed_cache, viewfs
 Fix For: 1.4.2, 1.5.0


 When I run in spark-1.4 yarn-client mode, I throws the following Exception 
 when trying to upload conf archive to viewfs:
 15/06/26 17:56:37 INFO yarn.Client: Uploading resource 
 file:/tmp/spark-095ec3d2-5dad-468c-8d46-2c813457404d/__hadoop_conf__8436284925771788661
 .zip - 
 viewfs://nsX/user/ultraman/.sparkStaging/application_1434370929997_191242/__hadoop_conf__8436284925771788661.zip
 15/06/26 17:56:38 INFO yarn.Client: Deleting staging directory 
 .sparkStaging/application_1434370929997_191242
 15/06/26 17:56:38 ERROR spark.SparkContext: Error initializing SparkContext.
 java.lang.IllegalArgumentException: Wrong FS: 
 hdfs://SunshineNameNode2:8020/user/ultraman/.sparkStaging/application_1434370929997_191242/__had
 oop_conf__8436284925771788661.zip, expected: viewfs://nsX/
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
 at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:117)
 at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:346)
 at 
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
 at 
 org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:341)
 at 
 org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:338)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:338)
 at 
 org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:559)
 at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:58)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
 at org.apache.spark.SparkContext.init(SparkContext.scala:497)
 at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
 at $line3.$read$$iwC$$iwC.init(console:9)
 at $line3.$read$$iwC.init(console:18)
 at $line3.$read.init(console:20)
 at $line3.$read$.init(console:24)
 at $line3.$read$.clinit(console)
 at $line3.$eval$.init(console:7)
 at $line3.$eval$.clinit(console)
 at $line3.$eval.$print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 The bug is easy to fix, we should pass the correct file system object to 
 addResource. The similar issure is: 
 https://github.com/apache/spark/pull/1483. I will attach my bug fix PR very 
 soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-08 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619135#comment-14619135
 ] 

shane knapp commented on SPARK-8571:


basically the code would look something like:

#!/bin/bash
rm -rf ./work
git clean -fdx

export BLAH
build/mvn BLAH BLAH
retcode1=$?
build/mvn WHEE ZOMG
retcode2=$?

lsof | xargs kill

if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
  exit 1
fi

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2015-07-08 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619144#comment-14619144
 ] 

RJ Nowling commented on SPARK-3644:
---

[~joshrosen] The issue and corresponding PR you reference only seem to provide 
read-only access.  Is that correct?  If so, then are there open issues to 
address the needs of the users above?  Thanks!

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen
Assignee: Imran Rashid
 Fix For: 1.4.0


 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8894) Example code errors in SparkR documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8894.
--
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.2

Issue resolved by pull request 7287
[https://github.com/apache/spark/pull/7287]

 Example code errors in SparkR documentation
 ---

 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
 Fix For: 1.4.2, 1.5.0


 There are errors in SparkR related documentation.
 1. in 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
 for R part, 
 {code}
 results = sqlContext.sql(FROM src SELECT key, value).collect()
 {code}
 should be
 {code}
 results - collect(sql(sqlContext, FROM src SELECT key, value))
 {code}
 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
 {code}
 results - hiveContext.sql(FROM src SELECT key, value)
 {code}
 should be
 {code}
 results - sql(hiveContext, FROM src SELECT key, value)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8894) Example code errors in SparkR documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-8894:
-
Assignee: Sun Rui

 Example code errors in SparkR documentation
 ---

 Key: SPARK-8894
 URL: https://issues.apache.org/jira/browse/SPARK-8894
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui
Assignee: Sun Rui
 Fix For: 1.4.2, 1.5.0


 There are errors in SparkR related documentation.
 1. in 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, 
 for R part, 
 {code}
 results = sqlContext.sql(FROM src SELECT key, value).collect()
 {code}
 should be
 {code}
 results - collect(sql(sqlContext, FROM src SELECT key, value))
 {code}
 2. in https://spark.apache.org/docs/latest/sparkr.html#from-hive-tables, 
 {code}
 results - hiveContext.sql(FROM src SELECT key, value)
 {code}
 should be
 {code}
 results - sql(hiveContext, FROM src SELECT key, value)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8900) sparkPackages flag name is wrong in the documentation


[ 
https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618928#comment-14618928
 ] 

Apache Spark commented on SPARK-8900:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/7293

 sparkPackages flag name is wrong in the documentation
 -

 Key: SPARK-8900
 URL: https://issues.apache.org/jira/browse/SPARK-8900
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Shivaram Venkataraman

 The SparkR documentation example in 
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is 
 incorrect.
 sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3)
 should be
 sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8900) sparkPackages flag name is wrong in the documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8900:
---

Assignee: (was: Apache Spark)

 sparkPackages flag name is wrong in the documentation
 -

 Key: SPARK-8900
 URL: https://issues.apache.org/jira/browse/SPARK-8900
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Shivaram Venkataraman

 The SparkR documentation example in 
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is 
 incorrect.
 sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3)
 should be
 sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8785) Improve Parquet schema merging

2015-07-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8785:
--
Assignee: Liang-Chi Hsieh

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8785) Improve Parquet schema merging

2015-07-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8785.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7182
[https://github.com/apache/spark/pull/7182]

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6912.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7257
[https://github.com/apache/spark/pull/7257]

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro
 Fix For: 1.5.0


 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8900) sparkPackages flag name is wrong in the documentation

Shivaram Venkataraman created SPARK-8900:


 Summary: sparkPackages flag name is wrong in the documentation
 Key: SPARK-8900
 URL: https://issues.apache.org/jira/browse/SPARK-8900
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Shivaram Venkataraman


The SparkR documentation example in 
http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is 
incorrect.

sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3)
should be
sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8900) sparkPackages flag name is wrong in the documentation


 [ 
https://issues.apache.org/jira/browse/SPARK-8900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8900:
---

Assignee: Apache Spark

 sparkPackages flag name is wrong in the documentation
 -

 Key: SPARK-8900
 URL: https://issues.apache.org/jira/browse/SPARK-8900
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Shivaram Venkataraman
Assignee: Apache Spark

 The SparkR documentation example in 
 http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html is 
 incorrect.
 sc - sparkR.init(packages=com.databricks:spark-csv_2.11:1.0.3)
 should be
 sc - sparkR.init(sparkPackages=com.databricks:spark-csv_2.11:1.0.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values


 [ 
https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6266:
-
Shepherd: Xiangrui Meng

 PySpark SparseVector missing doc for size, indices, values
 --

 Key: SPARK-6266
 URL: https://issues.apache.org/jira/browse/SPARK-6266
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Need to add doc for size, indices, values attributes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7736) Exception not failing Python applications (in yarn cluster mode)

2015-07-08 Thread Neelesh Srinivas Salian (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618885#comment-14618885
]

Neelesh Srinivas Salian commented on SPARK-7736:

My 2 cents:

To have a YARN failed, ApplicationMaster running the driver needs to fail.

Scenario:
1) It fails once, YARN retries and succeeds if the exception has been handled
correctly. This results in a Successful YARN job (assuming the child tasks
(executors) succeeded).
2) The retries fail and the YARN job fails completely.
You need the Spark Application to coz a failure in YARN to mark it as a Failure.

Moreover, the ApplicationMaster.java code from the:
/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
in the Hadoop project should help.

Reference:
[1]
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

So, I would say this is expected behavior.
Hope that helps.

Please add/correct me if needed.

Exception not failing Python applications (in yarn cluster mode)

Key: SPARK-7736
URL: https://issues.apache.org/jira/browse/SPARK-7736
Project: Spark
Issue Type: Bug
Components: YARN
Environment: Spark 1.3.1, Yarn 2.7.0, Ubuntu 14.04
Reporter: Shay Rojansky

It seems that exceptions thrown in Python spark apps after the SparkContext
is instantiated don't cause the application to fail, at least in Yarn: the
application is marked as SUCCEEDED.
Note that any exception right before the SparkContext correctly places the
application in FAILED state.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8785) Improve Parquet schema merging

2015-07-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8785:
--
Shepherd: Cheng Lian

 Improve Parquet schema merging
 --

 Key: SPARK-8785
 URL: https://issues.apache.org/jira/browse/SPARK-8785
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

 Currently, the parquet schema merging (ParquetRelation2.readSchema) may spend 
 much time to merge duplicate schema. We can select only non duplicate schema 
 and merge them later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6266) PySpark SparseVector missing doc for size, indices, values


 [ 
https://issues.apache.org/jira/browse/SPARK-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6266:
-
Assignee: Kai Sasaki

 PySpark SparseVector missing doc for size, indices, values
 --

 Key: SPARK-6266
 URL: https://issues.apache.org/jira/browse/SPARK-6266
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki
Priority: Minor

 Need to add doc for size, indices, values attributes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF

2015-07-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6912:

Assignee: Takeshi Yamamuro

 Throw an AnalysisException when unsupported Java MapK,V types used in Hive 
 UDF
 

 Key: SPARK-6912
 URL: https://issues.apache.org/jira/browse/SPARK-6912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro
 Fix For: 1.5.0


 The current implementation can't handle MapK,V as a return type in Hive 
 UDF. 
 We assume an UDF below;
 public class UDFToIntIntMap extends UDF {
 public MapInteger, Integer evaluate(Object o);
 }
 Hive supports this type, and see a link below for details;
 https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception

2015-07-08 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5707.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7272
[https://github.com/apache/spark/pull/7272]

 Enabling spark.sql.codegen throws ClassNotFound exception
 -

 Key: SPARK-5707
 URL: https://issues.apache.org/jira/browse/SPARK-5707
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
 Environment: yarn-client mode, spark.sql.codegen=true
Reporter: Yi Yao
Assignee: Ram Sriharsha
Priority: Blocker
 Fix For: 1.5.0


 Exception thrown:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
 stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 
 133.0 (TID 3066, cdh52-node2): java.io.IOException: 
 com.esotericsoftware.kryo.KryoException: Unable to find class: 
 __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
 Serialization trace:
 hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at

[jira] [Commented] (SPARK-8900) sparkPackages flag name is wrong in the documentation