date:20150105


 [ 
https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2247.

  Resolution: Duplicate
Assignee: Reynold Xin
Target Version/s: 1.3.0

 Data frame (or Pandas) like API for structured data
 ---

 Key: SPARK-2247
 URL: https://issues.apache.org/jira/browse/SPARK-2247
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.0.0
Reporter: venu k tangirala
Assignee: Reynold Xin
  Labels: features

 I would be nice to have R or python pandas like data frames on spark.
 1) To be able to access the RDD data frame from python with pandas 
 2) To be able to access the RDD data frame from R 
 3) To be able to access the RDD data frame from scala's saddle 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5097:
---
Description: 
SchemaRDD, through its DSL, already has many of the functionalities provided by 
common data frame operations. However, the DSL was originally created for 
constructing test cases without much end-user usability and API stability 
consideration. This design doc proposes a set of API changes for Scala and 
Python to make the SchemaRDD DSL API more usable and stable.


  was:
SchemaRDD, through its DSL, already has many of the functionalities provided by 
common data frame operations. However, the DSL was originally created for 
constructing test cases without much end-user usability and API stability 
consideration. This design doc proposes a set of API changes to make the 
SchemaRDD DSL API more usable and stable.



 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already has many of the functionalities provided 
 by common data frame operations. However, the DSL was originally created for 
 constructing test cases without much end-user usability and API stability 
 consideration. This design doc proposes a set of API changes for Scala and 
 Python to make the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2247) Data frame (or Pandas) like API for structured data


[ 
https://issues.apache.org/jira/browse/SPARK-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265784#comment-14265784
 ] 

Reynold Xin commented on SPARK-2247:


Ok I uploaded a design doc in https://issues.apache.org/jira/browse/SPARK-5097

 Data frame (or Pandas) like API for structured data
 ---

 Key: SPARK-2247
 URL: https://issues.apache.org/jira/browse/SPARK-2247
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.0.0
Reporter: venu k tangirala
Assignee: Reynold Xin
  Labels: features

 I would be nice to have R or python pandas like data frames on spark.
 1) To be able to access the RDD data frame from python with pandas 
 2) To be able to access the RDD data frame from R 
 3) To be able to access the RDD data frame from scala's saddle 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5097:
---
Description: 
SchemaRDD, through its DSL, already provides common data frame functionalities. 
However, the DSL was originally created for constructing test cases without 
much end-user usability and API stability consideration. This design doc 
proposes a set of API changes for Scala and Python to make the SchemaRDD DSL 
API more usable and stable.


  was:
SchemaRDD, through its DSL, already has many of the functionalities provided by 
common data frame operations. However, the DSL was originally created for 
constructing test cases without much end-user usability and API stability 
consideration. This design doc proposes a set of API changes for Scala and 
Python to make the SchemaRDD DSL API more usable and stable.



 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4843) Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module


 [ 
https://issues.apache.org/jira/browse/SPARK-4843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4843.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3696
[https://github.com/apache/spark/pull/3696]

 Squash ExecutorRunnable and ExecutorRunnableUtil hierarchy in yarn module
 -

 Key: SPARK-4843
 URL: https://issues.apache.org/jira/browse/SPARK-4843
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis
 Fix For: 1.3.0


 ExecutorRunnableUtil is a parent of ExecutorRunnable because of the 
 yarn-alpha and yarn-stable split. Now that yarn-alpha is gone, we can squash 
 the unnecessary hierarchy.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5098) Number of running tasks become negative after tasks lost

2015-01-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-5098:
--
Description: 
15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on 
spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
[Stage 0:===(44 + -14) 
/ 40]
15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on 
spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
[Stage 0:==(45 + -29) / 
40]

  was:
15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on 
spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
[Stage 
0:=(44
 + -14) / 40]
15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on 
spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
[Stage 
0:=(45
 + -29) / 40]


 Number of running tasks become negative after tasks lost
 

 Key: SPARK-5098
 URL: https://issues.apache.org/jira/browse/SPARK-5098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Critical

 15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on 
 spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated
 15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote 
 system 
 [akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] 
 has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
 15/01/06

[jira] [Commented] (SPARK-3452) Maven build should skip publishing artifacts people shouldn't depend on

2015-01-05 Thread Aniket Bhatnagar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265749#comment-14265749
 ] 

Aniket Bhatnagar commented on SPARK-3452:
-

I would like this to be revisited. The issue I am facing is that while people 
may not dependent on some modules during compile time but they may dependent on 
them during runtime. For example, I am building a spark server that lets users 
submit spark jobs using convenient REST endpoints. This used to work great even 
in yarn-client mode. However, once I migrate to 1.2.0, this breaks because I 
can no longer add dependency of my spark server to spark-yarn module which is 
used while submitting jobs to YARN cluster.

 Maven build should skip publishing artifacts people shouldn't depend on
 ---

 Key: SPARK-3452
 URL: https://issues.apache.org/jira/browse/SPARK-3452
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0, 1.1.0
Reporter: Patrick Wendell
Assignee: Prashant Sharma
Priority: Critical
 Fix For: 1.2.0


 I think it's easy to do this by just adding a skip configuration somewhere. 
 We shouldn't be publishing repl, yarn, assembly, tools, repl-bin, or examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5097) Adding data frame APIs to SchemaRDD

Reynold Xin created SPARK-5097:
--

 Summary: Adding data frame APIs to SchemaRDD
 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


SchemaRDD, through its DSL, already has many of the functionalities provided by 
common data frame operations. However, the DSL was originally created for 
constructing test cases without much end-user usability and API stability 
consideration. This design doc proposes a set of API changes to make the 
SchemaRDD DSL API more usable and stable.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5097) Adding data frame APIs to SchemaRDD


 [ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5097:
---
Attachment: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already has many of the functionalities provided 
 by common data frame operations. However, the DSL was originally created for 
 constructing test cases without much end-user usability and API stability 
 consideration. This design doc proposes a set of API changes to make the 
 SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5098) Number of running tasks become negative after tasks lost

2015-01-05 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5098:
-

 Summary: Number of running tasks become negative after tasks lost
 Key: SPARK-5098
 URL: https://issues.apache.org/jira/browse/SPARK-5098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Critical


15/01/06 07:26:58 ERROR TaskSchedulerImpl: Lost executor 6 on 
spark-worker-002.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:26:58 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-002.c.lofty-inn-754.internal:32852] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:26:58 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 55, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 WARN TaskSetManager: Lost task 7.2 in stage 0.0 (TID 52, 
spark-worker-002.c.lofty-inn-754.internal): ExecutorLostFailure (executor 6 
lost)
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
15/01/06 07:26:58 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 6
[Stage 
0:=(44
 + -14) / 40]
15/01/06 07:27:10 ERROR TaskSchedulerImpl: Lost executor 2 on 
spark-worker-003.c.lofty-inn-754.internal: remote Akka client disassociated
15/01/06 07:27:10 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkExecutor@spark-worker-003.c.lofty-inn-754.internal:39188] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/06 07:27:10 WARN TaskSetManager: Lost task 16.1 in stage 0.0 (TID 60, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 WARN TaskSetManager: Lost task 12.0 in stage 0.0 (TID 12, 
spark-worker-003.c.lofty-inn-754.internal): ExecutorLostFailure (executor 2 
lost)
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
15/01/06 07:27:10 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2
[Stage 
0:=(45
 + -29) / 40]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode


[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265796#comment-14265796
 ] 

Timothy Chen commented on SPARK-5095:
-

I think instead of configuring the number of executors to launch per slave, I 
think it's more ideal to configure the amount of cpu/mem per executor.
My current thoughts for implementation is to introduce two more configs:
spark.mesos.coarse.executors.max -- the maximum amount of executors launched 
per slave, applies to coarse grain mode
spark.mesos.coarse.cores.max -- the maximum amount of cpus to use per executor

Memory is already configurable through spark.executor.memory.

With these, you can choose to launch two executors by specifiying two max 
executors and also capping the max cpus to be halved the amount. 

These configurations can also fix SPARK-4940.

 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode


[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265802#comment-14265802
 ] 

Timothy Chen commented on SPARK-4940:
-

So I assume you're specifiying coarse grain mode right? And how are streaming 
consumers launched?
I know that on the scheduler side it is launching spark executors/drivers, and 
we simply launch one spark executor per slave that is running multiple spark 
tasks.

My assumption was that it was the number of resources allocated that is 
disproportional to each slave's executor. 

 Support more evenly distributing cores for Mesos mode
 -

 Key: SPARK-4940
 URL: https://issues.apache.org/jira/browse/SPARK-4940
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in Coarse grain mode the spark scheduler simply takes all the 
 resources it can on each node, but can cause uneven distribution based on 
 resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426
 ] 

Tathagata Das commented on SPARK-4960:
--

Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the 
document, its quite a good proposal. However there are still some corner cases 
that is very confusing. What happens when the user accidentally tries to do 
something like this

val input = ssc.socketStream(...)
val intercepted = input.interceptor(...)

Now actually use `input` for further processing. Since the `input` stream gets 
deregistered, there will not be any data.

To avoid this kind of situation, here is a more limited idea. For the generic 
interceptor pattern applicable to all receiver, lets assume that function can 
be of the form T = Iterator[T]. This eliminates the need for changing data 
types, and probably addresses corner cases like the one I raised. For M





 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264450#comment-14264450
 ] 

Tathagata Das commented on SPARK-4960:
--

[~ted.m] This interceptor pattern discussion should be of interest to you!

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4960) Interceptor pattern in receivers


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426
 ] 

Tathagata Das edited comment on SPARK-4960 at 1/5/15 10:15 AM:
---

Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the 
document, its quite a good proposal. However there are still some corner cases 
that is very confusing. What happens when the user accidentally tries to do 
something like this

val input = ssc.socketStream(...)
val intercepted = input.interceptor(...)

Now actually use `input` for further processing. Since the `input` stream gets 
deregistered, there will not be any data.

To avoid this kind of situation, here is a more limited idea. For the generic 
interceptor pattern applicable to all receiver, lets assume that function can 
be of the form T = Iterator[T]. This eliminates the need for changing data 
types, and probably addresses corner cases like the one I raised. We can leave 
the general case of T = Iterator[M] for the users to implement their own 
receivers, in the same as [~jerryshao] has suggested in his doc. That is quite 
hacky (type casting Receiver[T] to Receiver[M]) and so its probably not the 
best to have that available by default with Spark Streaming.






was (Author: tdas):
Both [~c...@koeninger.org] and [~jerryshao] are valid. And I looked at the 
document, its quite a good proposal. However there are still some corner cases 
that is very confusing. What happens when the user accidentally tries to do 
something like this

val input = ssc.socketStream(...)
val intercepted = input.interceptor(...)

Now actually use `input` for further processing. Since the `input` stream gets 
deregistered, there will not be any data.

To avoid this kind of situation, here is a more limited idea. For the generic 
interceptor pattern applicable to all receiver, lets assume that function can 
be of the form T = Iterator[T]. This eliminates the need for changing data 
types, and probably addresses corner cases like the one I raised. For M





 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4908) Spark SQL built for Hive 13 fails under concurrent metadata queries


[ 
https://issues.apache.org/jira/browse/SPARK-4908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264469#comment-14264469
 ] 

Cheng Lian commented on SPARK-4908:
---

Would like to add a comment about the root cause of this issue.  When serving a 
HiveQL query, Spark SQL's {{HiveContext.runHive}} method gets a 
{{org.apache.hadoop.hive.ql.Driver}} instance via 
{{CommandProcessFactory.get}}, which creates and caches {{Driver}} instances. 
In the case of {{HiveThriftServer2}}, {{HiveContext.runHive}} is called by 
multiple threads owned by a threaded executor of the Thrift server. However, 
{{Driver}} is not thread safe, but cached {{Driver}} instance can be accessed 
by multiple threads, thus causes problem. PR #3834 fixes this issue by 
synchronizing {{HiveContext.runHive}}, which is valid.  On the other hand, 
HiveServer2 actually create a new {{Driver}} instance for every served SQL 
query when initializing a {{SQLOperation}}.

[~dyross] When built against Hive 0.12.0, Spark SQL 1.2.0 also suffers this 
issue. The snippet doesn't show this because Hive 0.12.0 JDBC driver doesn't 
execute a {{USE db}} statement to switch current database even if the JDBC 
connection URL specifies a database name. If you replace the lines in the 
{{try}} block with:
{code}
  val conn = DriverManager.getConnection(url)
  val stmt = conn.createStatement()
  stmt.execute(use hello;)
  stmt.close()
  println(Finished:  + i)
{code}
you'll see exactly the same exceptions.

 Spark SQL built for Hive 13 fails under concurrent metadata queries
 ---

 Key: SPARK-4908
 URL: https://issues.apache.org/jira/browse/SPARK-4908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 We are trunk: {{1.3.0-SNAPSHOT}}, as of this commit: 
 https://github.com/apache/spark/commit/3d0c37b8118f6057a663f959321a79b8061132b6
 We are using Spark built for Hive 13, using this option:
 {{-Phive-0.13.1}}
 In single-threaded mode, normal operations look fine. However, under 
 concurrency, with at least 2 concurrent connections, metadata queries fail.
 For example, {{USE some_db}}, {{SHOW TABLES}}, and the implicit {{USE}} 
 statement when you pass a default schema in the JDBC URL, all fail.
 {{SELECT}} queries like {{SELECT * FROM some_table}} do not have this issue.
 Here is some example code:
 {code}
 object main extends App {
   import java.sql._
   import scala.concurrent._
   import scala.concurrent.duration._
   import scala.concurrent.ExecutionContext.Implicits.global
   Class.forName(org.apache.hive.jdbc.HiveDriver)
   val host = localhost // update this
   val url = sjdbc:hive2://${host}:10511/some_db // update this
   val future = Future.traverse(1 to 3) { i =
 Future {
   println(Starting:  + i)
   try {
 val conn = DriverManager.getConnection(url)
   } catch {
 case e: Throwable = e.printStackTrace()
 println(Failed:  + i)
   }
   println(Finishing:  + i)
 }
   }
   Await.result(future, 2.minutes)
   println(done!)
 }
 {code}
 Here is the output:
 {code}
 Starting: 1
 Starting: 3
 Starting: 2
 java.sql.SQLException: 
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Operation 
 cancelled
   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:121)
   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:109)
   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:231)
   at 
 org.apache.hive.jdbc.HiveConnection.configureConnection(HiveConnection.java:451)
   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:195)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:664)
   at java.sql.DriverManager.getConnection(DriverManager.java:270)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply$mcV$sp(ConnectionManager.scala:896)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 com.atscale.engine.connection.pool.main$$anonfun$30$$anonfun$apply$2.apply(ConnectionManager.scala:893)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
   at 
 scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
   at 
 scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-05 Thread jeanlyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Fix Version/s: (was: 1.2.1)

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn

 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.


[ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264508#comment-14264508
 ] 

Sean Owen commented on SPARK-5066:
--

I'm not clear what this issue is trying to report. This is code from 
{{ExternalAppendOnlyMap}} right? The javadoc says:

{code}
 * Fill a buffer with the next set of keys with the same hash code from a 
given iterator. We
 * read streams one hash code at a time to ensure we don't miss elements 
when they are merged.
 *
 * Assumes the given iterator is in sorted order of hash code.
{code}

The behavior and code you describe seems correct then. k4 and k5 would be read 
from the stream for file 2 first since they have the lowest hashes. Next, k1 
would be read from both files. Where are you saying that this breaks down?

 Can not get all key that has same hashcode  when reading key ordered  from 
 different Streaming.
 ---

 Key: SPARK-5066
 URL: https://issues.apache.org/jira/browse/SPARK-5066
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: DoingDone9
Priority: Critical

 when spill is open, data ordered by hashCode will be spilled to disk. We need 
 get all key that has the same hashCode from different tmp files when merge 
 value, but it just read the key that has the minHashCode that in a tmp file, 
 we can not read all key.
 Example :
 If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
 And hashcode of k4  hashcode of k5  hashcode of k1   hashcode of k2   
 hashcode of k3
 we just  read k1 from file1 and k4 from file2. Can not read all k1.
 Code :
 private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it = 
 it.buffered)
 inputStreams.foreach { it =
   val kcPairs = new ArrayBuffer[(K, C)]
   readNextHashCode(it, kcPairs)
   if (kcPairs.length  0) {
 mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
   }
 }
  private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
 ArrayBuffer[(K, C)]): Unit = {
   if (it.hasNext) {
 var kc = it.next()
 buf += kc
 val minHash = hashKey(kc)
 while (it.hasNext  it.head._1.hashCode() == minHash) {
   kc = it.next()
   buf += kc
 }
   }
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization


 [ 
https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5082.
--
Resolution: Not a Problem

This was already fixed by 
https://github.com/apache/spark/commit/4da1039840182e8e8bc836b89cda7b77fe7356d9

 Minor typo in the Tuning Spark document about Data Serialization
 

 Key: SPARK-5082
 URL: https://issues.apache.org/jira/browse/SPARK-5082
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: yangping wu
Priority: Trivial
   Original Estimate: 8h
  Remaining Estimate: 8h

 The latest documentation for *Tuning Spark* has some error 
 entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data 
 Serialization*: section:
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}
 the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* 
 not *Seq[Class[_]]*. The right code snippets is
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem


[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264527#comment-14264527
 ] 

Sean Owen commented on SPARK-1529:
--

OK, but, why do these files *have* to go on a non-local disk? It sounds like 
you're saying Spark doesn't work at all on MapR right now, but that can't be 
the case. They *can* go on a non-local disk, I'm sure. What's the value of 
that, given that Spark is transporting the files itself?

Still, as you say, this proprietary setup works already through the java.io+NFS 
and HDFS APIs, with no change. If it's just not as fast, is that a problem that 
Spark should be solving? Just don't do it. Or it's up to the vendor to optimize.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value


[ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264535#comment-14264535
 ] 

Sean Owen commented on SPARK-5073:
--

The documentation suggests that 8192 is the intended default, and that's 
consistent with the javadoc that says this should be a limit near the OS page 
size, which is indeed I think 4KB on modern OSes (?). So open a PR to fix the 
default in {{TransportConf}}?

 spark.storage.memoryMapThreshold have two default value
 -

 Key: SPARK-5073
 URL: https://issues.apache.org/jira/browse/SPARK-5073
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Yuan Jianhui
Priority: Minor

 In org.apache.spark.storage.DiskStore:
  val minMemoryMapBytes = 
 blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L)
 In org.apache.spark.network.util.TransportConf:
  public int memoryMapBytes() {
  return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 
 1024);
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception

2015-01-05 Thread shengli (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shengli updated SPARK-5009:
---
Description: 
Recently I found a bug when I add new feature in SqlParser. Which is :
If I define a KeyWord that has a long name. Like：
 ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```

Since the all case version is implement by recursive function, so when  
```implicit asParser`` function is called  and the stack memory is very small, 
it will leads to SO Exception. 

java.lang.StackOverflowError
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)


  was:
Recently I found a bug when I add new feature in SqlParser. Which is :
If I define a KeyWord that has a long name. Like：
 ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```

Since the all case version is implement by recursive function, so when  
```implicit asParser`` function is called  and the stack memory is very small, 
it will leads to SO Exception. 


 allCaseVersions function in  SqlLexical  leads to StackOverflow Exception
 -

 Key: SPARK-5009
 URL: https://issues.apache.org/jira/browse/SPARK-5009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.0
Reporter: shengli
 Fix For: 1.3.0, 1.2.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 Recently I found a bug when I add new feature in SqlParser. Which is :
 If I define a KeyWord that has a long name. Like：
  ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```
 Since the all case version is implement by recursive function, so when  
 ```implicit asParser`` function is called  and the stack memory is very 
 small, it will leads to SO Exception. 
 java.lang.StackOverflowError
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value


[ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264554#comment-14264554
 ] 

Apache Spark commented on SPARK-5073:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3900

 spark.storage.memoryMapThreshold have two default value
 -

 Key: SPARK-5073
 URL: https://issues.apache.org/jira/browse/SPARK-5073
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Yuan Jianhui
Priority: Minor

 In org.apache.spark.storage.DiskStore:
  val minMemoryMapBytes = 
 blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L)
 In org.apache.spark.network.util.TransportConf:
  public int memoryMapBytes() {
  return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 
 1024);
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4940) Document or Support more evenly distributing cores for Mesos mode

2015-01-05 Thread Gerard Maas (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264579#comment-14264579
]

Gerard Maas commented on SPARK-4940:

From the perspective of evenly allocating Spark Streaming consumers
(network-bound), the ideal solution would be to explicitly set the number of
hosts.

With the current resource allocation policy, we can have eg. (4),(1),(1)
consumers over 3 hosts, instead of the ideal (2),(2),(2). Given that the
resource allocation is dynamic at job startup time, this results in variable
performance characteristic for the job being submitted.
In practice, we have been restarting the job (using Marathon) until we get a
favorable resource allocation.

Not sure how well the requirement of a fix amount of executors would fit with
the node transparency offered by Mesos. I'm just trying to elaborate on the
requirements from the Spark Streaming job perspective.

Document or Support more evenly distributing cores for Mesos mode
-

Key: SPARK-4940
URL: https://issues.apache.org/jira/browse/SPARK-4940
Project: Spark
Issue Type: Improvement
Components: Mesos
Reporter: Timothy Chen

Currently in Coarse grain mode the spark scheduler simply takes all the
resources it can on each node, but can cause uneven distribution based on
resources available on each slave.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5073) spark.storage.memoryMapThreshold have two default value

2015-01-05 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264558#comment-14264558
 ] 

Kai Sasaki commented on SPARK-5073:
---

I did not notice above comment. Sorry, I've just created PR for this issue.

 spark.storage.memoryMapThreshold have two default value
 -

 Key: SPARK-5073
 URL: https://issues.apache.org/jira/browse/SPARK-5073
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Yuan Jianhui
Priority: Minor

 In org.apache.spark.storage.DiskStore:
  val minMemoryMapBytes = 
 blockManager.conf.getLong(spark.storage.memoryMapThreshold, 2 * 4096L)
 In org.apache.spark.network.util.TransportConf:
  public int memoryMapBytes() {
  return conf.getInt(spark.storage.memoryMapThreshold, 2 * 1024 * 
 1024);
  }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4897) Python 3 support

2015-01-05 Thread Matthew Cornell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264643#comment-14264643
 ] 

Matthew Cornell commented on SPARK-4897:


Please!!

 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5085) netty shuffle service causing connection timeouts


 [ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Haberman updated SPARK-5085:

Description: 
In Spark 1.2.0, the netty backend is causing our report's cluster to lock up 
with connection timeouts, ~75% of the way through the job.

It happens with both the external shuffle server and the 
non-external/executor-hosted shuffle server, but if I change the shuffle 
service from netty to nio, it immediately works.

Here's log output from one executor (I turned on trace output for the network 
package and ShuffleBlockFetcherIterator; all executors in the cluster have 
basically the same pattern of ~15m of silence then timeouts):

{code}
// lots of log output, doing fine...
15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder 
(MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: 
ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, 
chunkIndex=170}}
15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
(TransportRequestHandler.java:processFetchRequest(107)) - Received req from 
/10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, 
chunkIndex=170}
15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager 
(OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750
15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
(TransportRequestHandler.java:operationComplete(152)) - Sent result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, 
chunkIndex=170}, 
buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data,
 offset=4574685, length=20939}} to client /10.169.175.179:57056
// note 15m of silence here...
15/01/03 05:48:13 WARN  [shuffle-server-7] server.TransportChannelHandler 
(TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection 
from /10.33.166.218:42780
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler 
(TransportRequestHandler.java:operationComplete(154)) - Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, 
chunkIndex=52}, 
buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/2d/shuffle_4_520_0.data,
 offset=2214139, length=20607}} to /10.33.166.218:42780; closing connection
java.nio.channels.ClosedChannelException
15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler 
(TransportRequestHandler.java:operationComplete(154)) - Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, 
chunkIndex=53}, 
buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/10/shuffle_4_524_0.data,
 offset=2215548, length=23998}} to /10.33.166.218:42780; closing connection
java.nio.channels.ClosedChannelException
15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler 
(TransportRequestHandler.java:operationComplete(154)) - Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, 
chunkIndex=54}, 
buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/32/shuffle_4_532_0.data,
 offset=2248230, length=20580}} to /10.33.166.218:42780; closing connection
java.nio.channels.ClosedChannelException
// lots more of these...
{code}

Note how, up through 5:33, everything was fine, then after ~15 minutes of 
silence, at 5:48, the shuffle-server connection times out, and all of that 
server-7's requests fail.

Here is shuffle-server-1 from the same stdout (with 1 last 
ClosedChannelException

[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to {{DenseVectors}}. If the data are numpy arrays 
 with dtype {{float64}} this works. If data are numpy arrays with lower 
 precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
 {{float64}}, but due to a small bug in this line this currently doesn't 
 happen (casting is not inplace). 
 {code:python}
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 {code}
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 {code:python}
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 {code}
 But this works fine:
 {code:python}
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 {code}
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:none}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:none}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:none}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to {{DenseVectors}}. If the data are numpy arrays 
 with dtype {{float64}} this works. If data are numpy arrays with lower 
 precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
 {{float64}}, but due to a small bug in this line this currently doesn't 
 happen (casting is not inplace). 
 {code:none}
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 {code}
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 {code:none}
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 {code}
 But this works fine:
 {code:none}
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 {code}
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5089) Vector conversion broken for non-float64 arrays

Jeremy Freeman created SPARK-5089:
-

 Summary: Vector conversion broken for non-float64 arrays
 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman


Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays


 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to `DenseVectors`. If the data are numpy arrays with 
 dtype `float64` this works. If data are numpy arrays with lower precision 
 (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to 
 a small bug in this line this currently doesn't happen (casting is not 
 inplace). 
 ``
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 ``
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 ```
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 ```
 But this works fine:
 ```
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 ```
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4898) Replace cloudpickle with Dill

2015-01-05 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264801#comment-14264801
 ] 

Nicholas Chammas commented on SPARK-4898:
-

cc [~davies]

 Replace cloudpickle with Dill
 -

 Key: SPARK-4898
 URL: https://issues.apache.org/jira/browse/SPARK-4898
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Josh Rosen

 We should consider replacing our modified version of {{cloudpickle}} with 
 [Dill|https://github.com/uqfoundation/dill], since it supports both Python 2 
 and 3 and might do a better job of handling certain corner-cases.
 I attempted to do this a few months ago but ran into cases where Dill had 
 issues pickling objects defined in doctests, which broke our test suite: 
 https://github.com/uqfoundation/dill/issues/50.  This issue may have been 
 resolved now; I haven't checked.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream

2015-01-05 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265228#comment-14265228
 ] 

Hari Shreedharan commented on SPARK-4905:
-

I can't reproduce this - but once Jenkins is back, I will look at the logs to 
see if there is any info there.

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at

[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts


[ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265194#comment-14265194
 ] 

Stephen Haberman commented on SPARK-5085:
-

I think I've found a good clue:

{code}
[ 2527.610744] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2555.073922] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2606.652438] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2615.427676] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2626.008450] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2742.996355] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2744.434263] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2744.623440] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.204023] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.470360] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.517182] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.616516] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2818.547464] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2824.525844] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2831.868035] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2833.644154] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2895.396963] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2896.939451] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2901.464337] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2902.461459] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2904.840728] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2908.156252] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2908.925033] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2933.240541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2975.218843] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2975.333279] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2980.533872] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2984.017055] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2984.107575] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2991.252054] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2993.965474] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3000.521793] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3057.080236] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3067.674541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3077.984465] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3139.590085] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3140.145975] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3217.729824] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3249.614154] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3252.775976] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3261.234940] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3302.538848] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3325.811720] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3332.873067] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3340.724759] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3349.646235] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3354.857573] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3361.728122] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3373.623622] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3384.29] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3394.701554] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3402.048682] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3408.972487] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3415.697781] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3415.746289] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3428.234060] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3438.317541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3467.061761] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3592.827300] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3598.320551] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3601.487113] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3636.656200] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3670.347676] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3672.573875] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3720.392902] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3748.385374] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3763.997229] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3776.472560] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3783.343165] xen_netfront: xennet: skb rides the rocket: 19 slots
[

[jira] [Commented] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

2015-01-05 Thread Matt Cheah (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265267#comment-14265267
]

Matt Cheah commented on SPARK-4737:
---

I will be out of the office with limited access to e-mail from January 05 to
January 06. If there are specifically urgent matters requiring my assistance
and you have other means of contacting me, please use those other channels.

Sorry for the inconvenience. Thanks,

-Matt Cheah

Prevent serialization errors from ever crashing the DAG scheduler
-

Key: SPARK-4737
URL: https://issues.apache.org/jira/browse/SPARK-4737
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Blocker

Currently in Spark we assume that when tasks are serialized in the
TaskSetManager that the serialization cannot fail. We assume this because
upstream in the DAGScheduler we attempt to catch any serialization errors by
serializing a single partition. However, in some cases this upstream test is
not accurate - i.e. an RDD can have one partition that can serialize cleanly
but not others.
Do do this in the proper way we need to catch and propagate the exception at
the time of serialization. The tricky bit is making sure it gets propagated
in the right way.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5055) Minor typos on the downloads page

2015-01-05 Thread Neelesh Srinivas Salian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neelesh Srinivas Salian updated SPARK-5055:
---
Attachment: Spark_DownloadsPage_FixedTypos.html

Here is the HTML page with the typos corrected.

Please let me know if this is the right way to do it.

Thank you.

 Minor typos on the downloads page
 -

 Key: SPARK-5055
 URL: https://issues.apache.org/jira/browse/SPARK-5055
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Marko Bonaci
Priority: Trivial
  Labels: documentation, newbie
 Attachments: Spark_DownloadsPage_FixedTypos.html

   Original Estimate: 1m
  Remaining Estimate: 1m

 The _Downloads_ page uses the word Chose for present. It should say 
 Choose.
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2015-01-05 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265290#comment-14265290
 ] 

Matthew Farrellee commented on SPARK-927:
-

PR #2313 was subsumed by PR #3351, which resolved SPARK-4477 and this issue

the resolution was to remove the use of numpy altogether

 PySpark sample() doesn't work if numpy is installed on master but not on 
 workers
 

 Key: SPARK-927
 URL: https://issues.apache.org/jira/browse/SPARK-927
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
Reporter: Josh Rosen
Assignee: Matthew Farrellee
Priority: Minor

 PySpark's sample() method crashes with ImportErrors on the workers if numpy 
 is installed on the driver machine but not on the workers.  I'm not sure 
 what's the best way to fix this.  A general mechanism for automatically 
 shipping libraries from the master to the workers would address this, but 
 that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265157#comment-14265157
 ] 

Michael Armbrust commented on SPARK-4943:
-

I wouldn't say that the notion of {{tableIdentifier: Seq\[String\]}} is 
unclear.  Instead I would say that it is deliberately unspecified in order to 
be flexible.  Some systems have {{clusters}}, some systems have {{databases}}, 
some systems have {{schema}}, some have {{tables}}.  Thus, this API gives us 
one interface for communicating between the parser and the underlying datastore 
that makes no assumption about how that datastore is laid out.

If we do make this change then yes, I agree that we should also make it in the 
catalog as well.  In general our handling of this has always been a little 
clunky since there is a whole bunch of code that just ignores the database 
field.

One question is: what parts of the table identifier Spark SQL handles and what 
parts we pass on to the datasource?  A goal here should be to be able to 
connect to and join data from multiple sources.  Here is what I would propose 
as an addition to the current API, which only lets you register individual 
tables.

 - Users can register external catalogs which are responsible for producing 
{{BaseRelation}}s.
 - Each external catalog has a user specified name that is given when 
registering.
 - There is a notion of the current catalog, which can be changed with {{USE}}. 
 By default, we pass the all the {{tableIdentifiers}} to this default catalog 
and its up to it to determine what each part means.
 - Users can also specify fully qualified tables when joining multiple data 
sources.  We detect this case when the first {{tableIdentifier}} matches one of 
the registered catalogs.  In this case we strip of the catalog name and pass 
the remaining {{tableIdentifiers}} to the specified catalog.

What do you think?

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265212#comment-14265212
 ] 

Alex Liu commented on SPARK-4943:
-

Catalog part of table identifier should be handled by Spark SQL which calls the 
registered catalog Context to connect to the underline datasources. 
cluster/database/scheme/table should be handled by datasource(Cassandra Spark 
SQL integration can handle cluster, database and table level join).

e.g.
{code}
SELECT test1.a, test1.b, test2.c FROM cassandra.cluster.database.table1 AS test1
LEFT OUTER JOIN mySql.cluster.database.table2 AS test2 ON test1.a = test2.a
{code}

so cluster.database.table1 is passed to cassandra catalog datasource, cassandra 
is handled by Spark SQL to call cassandraContext which then call the underline 
datasource.

cluster.database.table2 is passed to mySql catalog datasource, mySql is handled 
by Spark SQL to call the mySqlContext which then call the underline datasource.


If USE command is used, then all tableIdentifiers are passed to datasource.  
e.g.
{code}
USE cassandra
SELECT test1.a, test1.b, test2.c FROM cluster1.database.table AS test1
LEFT OUTER JOIN cluster2.database.table AS test2 ON test1.a = test2.a
{code}

cluster1.database.table1 and cluster2.database.table are passed to cassandra 
datasource



 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at

[jira] [Resolved] (SPARK-927) PySpark sample() doesn't work if numpy is installed on master but not on workers

2015-01-05 Thread Matthew Farrellee (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Farrellee resolved SPARK-927.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 PySpark sample() doesn't work if numpy is installed on master but not on 
 workers
 

 Key: SPARK-927
 URL: https://issues.apache.org/jira/browse/SPARK-927
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.8.0, 0.9.1, 1.0.2, 1.1.2
Reporter: Josh Rosen
Assignee: Matthew Farrellee
Priority: Minor
 Fix For: 1.2.0


 PySpark's sample() method crashes with ImportErrors on the workers if numpy 
 is installed on the driver machine but not on the workers.  I'm not sure 
 what's the best way to fix this.  A general mechanism for automatically 
 shipping libraries from the master to the workers would address this, but 
 that could be complicated to implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265304#comment-14265304
 ] 

Tathagata Das commented on SPARK-4905:
--

What is the reason behind such a behavior where the number of records received 
is same as sent, but all the records are empty?

TD

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at

[jira] [Created] (SPARK-5094) Python API for gradient-boosted trees

2015-01-05 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-5094:


 Summary: Python API for gradient-boosted trees
 Key: SPARK-5094
 URL: https://issues.apache.org/jira/browse/SPARK-5094
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5085) netty shuffle service causing connection timeouts


[ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265194#comment-14265194
 ] 

Stephen Haberman edited comment on SPARK-5085 at 1/5/15 10:14 PM:
--

I think I've found a good clue (from the dmesg logs):

{code}
[ 2527.610744] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2555.073922] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2606.652438] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2615.427676] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2626.008450] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2742.996355] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2744.434263] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2744.623440] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.204023] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.470360] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.517182] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2745.616516] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2818.547464] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2824.525844] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2831.868035] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2833.644154] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2895.396963] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2896.939451] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2901.464337] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2902.461459] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2904.840728] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2908.156252] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2908.925033] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2933.240541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2975.218843] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2975.333279] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2980.533872] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2984.017055] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2984.107575] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2991.252054] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 2993.965474] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3000.521793] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3057.080236] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3067.674541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3077.984465] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3139.590085] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3140.145975] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3217.729824] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3249.614154] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3252.775976] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3261.234940] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3302.538848] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3325.811720] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3332.873067] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3340.724759] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3349.646235] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3354.857573] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3361.728122] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3373.623622] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3384.29] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3394.701554] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3402.048682] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3408.972487] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3415.697781] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3415.746289] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3428.234060] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3438.317541] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3467.061761] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3592.827300] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3598.320551] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3601.487113] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3636.656200] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3670.347676] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3672.573875] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3720.392902] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3748.385374] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3763.997229] xen_netfront: xennet: skb rides the rocket: 19 slots
[ 3776.472560] xen_netfront: xennet: skb rides the rocket: 19 slots
[

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265224#comment-14265224
 ] 

Alex Liu commented on SPARK-4943:
-

For each catalog, the configuration settings should start with catalog name. 
e.g.
{noformat}
set cassandra.cluster.database.table.ttl = 1000
set cassandra.database.table.ttl =1000 (default cluster)
set mysql.cluster.database.table.xxx = 200
{noformat}

If there's no catalog in the setting string, use the default catalog.

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
 [info]   at

[jira] [Updated] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

2015-01-05 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4737:
---
Affects Version/s: 1.0.2
   1.1.1

 Prevent serialization errors from ever crashing the DAG scheduler
 -

 Key: SPARK-4737
 URL: https://issues.apache.org/jira/browse/SPARK-4737
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Blocker

 Currently in Spark we assume that when tasks are serialized in the 
 TaskSetManager that the serialization cannot fail. We assume this because 
 upstream in the DAGScheduler we attempt to catch any serialization errors by 
 serializing a single partition. However, in some cases this upstream test is 
 not accurate - i.e. an RDD can have one partition that can serialize cleanly 
 but not others.
 Do do this in the proper way we need to catch and propagate the exception at 
 the time of serialization. The tricky bit is making sure it gets propagated 
 in the right way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5093) Make network related timeouts consistent


 [ 
https://issues.apache.org/jira/browse/SPARK-5093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5093.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Make network related timeouts consistent
 

 Key: SPARK-5093
 URL: https://issues.apache.org/jira/browse/SPARK-5093
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 We have a few instances of spark.network.timeout that are using different 
 default settings (e.g. 45s in block manager and 100s in shuffle). This 
 proposes to make them consistent at 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream

2015-01-05 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265322#comment-14265322
 ] 

Hari Shreedharan commented on SPARK-4905:
-

I am not sure. It might have something to do with the encoding/decoding. The 
events even have the headers, but the body is empty (there is comparison of the 
headers for each event)

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46)
   at

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265335#comment-14265335
 ] 

Michael Armbrust commented on SPARK-4943:
-

Thanks for your comments Alex.  Are you proposing any changes to what I said?

Another thing I'm confused about is your comment regarding joins.  As of now 
there is no public API for passing that kind of information down into a 
datasource.

Regarding the configuration.  We will pass the datasource a SQLContext and you 
can do {{.getConf}} using whatever arbitrary string you want.  I don't think 
Spark SQL needs to have any control here.

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at

[jira] [Commented] (SPARK-5055) Minor typos on the downloads page


[ 
https://issues.apache.org/jira/browse/SPARK-5055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265331#comment-14265331
 ] 

Sean Owen commented on SPARK-5055:
--

Normally the right way to propose a change is to open a pull request against 
master at github.com/apache/spark. 
This site is hosted, however, in Apache SVN. You can post a patch against HEAD 
here for site changes.
But, in this case, the change is so trivial that the person who will commit the 
change will just make the 3 edits and be done.

 Minor typos on the downloads page
 -

 Key: SPARK-5055
 URL: https://issues.apache.org/jira/browse/SPARK-5055
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.2.0
Reporter: Marko Bonaci
Priority: Trivial
  Labels: documentation, newbie
 Attachments: Spark_DownloadsPage_FixedTypos.html

   Original Estimate: 1m
  Remaining Estimate: 1m

 The _Downloads_ page uses the word Chose for present. It should say 
 Choose.
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2015-01-05 Thread Bert Greevenbosch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265344#comment-14265344
 ] 

Bert Greevenbosch edited comment on SPARK-2352 at 1/5/15 11:36 PM:
---

Hi Nathan,

Great to hear of your interest!

The pull request is quite active. You can see it and related discussion here:
https://github.com/apache/spark/pull/1290

Best regards,
Bert


was (Author: bgreeven):
Hi Nathan,

Great to year of your interest!

The pull request is quite active. You can see it and related discussion here:
https://github.com/apache/spark/pull/1290

Best regards,
Bert

 [MLLIB] Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch
Assignee: Bert Greevenbosch

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5040) Support expressing unresolved attributes using $attribute name notation in SQL DSL


 [ 
https://issues.apache.org/jira/browse/SPARK-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5040.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3862
[https://github.com/apache/spark/pull/3862]

 Support expressing unresolved attributes using $attribute name notation in 
 SQL DSL
 

 Key: SPARK-5040
 URL: https://issues.apache.org/jira/browse/SPARK-5040
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 The SQL DSL uses Scala symbols to represent attributes, e.g. 'attr. Symbols, 
 however, cannot capture attributes with spaces or uncommon characters. Here 
 we suggest supporting $attribute name as an alternative.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4940) Support more evenly distributing cores for Mesos mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated SPARK-4940:

Summary: Support more evenly distributing cores for Mesos mode  (was: 
Document or Support more evenly distributing cores for Mesos mode)

 Support more evenly distributing cores for Mesos mode
 -

 Key: SPARK-4940
 URL: https://issues.apache.org/jira/browse/SPARK-4940
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in Coarse grain mode the spark scheduler simply takes all the 
 resources it can on each node, but can cause uneven distribution based on 
 resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated SPARK-5095:

Description: 
Currently in coarse grained mesos mode, it's expected that we only launch one 
Mesos executor that launches one JVM process to launch multiple spark executors.

However, this become a problem when the JVM process launched is larger than an 
ideal size (30gb is recommended value from databricks), which causes GC 
problems reported on the mailing list.

We should support launching mulitple executors when large enough resources are 
available for spark to use, and these resources are still under the configured 
limit.

This is also applicable when users want to specifiy number of executors to be 
launched on each node

  was:
Currently in coarse grained mesos mode, it's expected that we only launch one 
Mesos executor that launches one JVM process to launch multiple spark executors.

However, this become a problem when the JVM process launched is larger than an 
ideal size (30gb is recommended value from databricks), which causes GC 
problems reported on the mailing list.

We should support launching mulitple executors when large enough resources are 
available for spark to use, and these resources are still under the configured 
limit.


 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2015-01-05 Thread Bert Greevenbosch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265344#comment-14265344
 ] 

Bert Greevenbosch commented on SPARK-2352:
--

Hi Nathan,

Great to year of your interest!

The pull request is quite active. You can see it and related discussion here:
https://github.com/apache/spark/pull/1290

Best regards,
Bert

 [MLLIB] Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch
Assignee: Bert Greevenbosch

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

[
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265390#comment-14265390
]

Sean Owen commented on SPARK-1517:
--

Recap: old URL was building-with-maven.html, new URL is building-spark.html
to match a rename and content change of the page itself a few months ago. There
should be a redirect from the former to latter. Until the 1.2.0 site was
published, there was no building-spark.html page live on the site. So README.md
had to link to building-with-maven.html, with the intent that after 1.2.0 this
would just redirect to building-spark.html.

I'm not sure why, but the redirect isn't working. It redirects to
http://spark.apache.org/building-spark.html . It seems like this is some
default mechanism, and the redirector that the plugin is supposed to generate
isn't present or something. Could somehow be my mistake but I certainly recall
it worked on my local build of the site or else I never would have proposed it.

So yes one direct hotfix is to change links to the old page to links to the new
page. Only one of two links in README.md was updated. It's easy to fix the
other.

The README.md that you see on github.com is always going to be from master, but
people are going to encounter the page and sometimes expect it corresponds to a
latest stable release. (You can always view README.md from the branch you want
of course, if you know what you're doing.) Yes, for this reason I agree that
it's best to make it mostly pointers to other information, and I think that was
already the intent of changes that included the renaming I alluded to above.
IIRC there was a desire to not strip down README.md further and leave some
minimal, hopefully fairly unchanging, info there.

Whether there should be nightly builds of the site is a different question. If
you linked to nightly instead of latest I suppose you'd have more of the
same problem, no? people finding the github site and perhaps thinking they are
seeing latest stable docs? On the other hand, it would at least be more
internally consistent. On the other other hand, would you have to change the
links to the stable URLs for release and then back as part of the release
process? I had thought just linking to latest stable release docs was simple
and fine.

Publish nightly snapshots of documentation, maven artifacts, and binary builds
--

Key: SPARK-1517
URL: https://issues.apache.org/jira/browse/SPARK-1517
Project: Spark
Issue Type: Improvement
Components: Build, Project Infra
Reporter: Patrick Wendell
Priority: Blocker

Should be pretty easy to do with Jenkins. The only thing I can think of that
would be tricky is to set up credentials so that jenkins can publish this
stuff somewhere on apache infra.
Ideally we don't want to have to put a private key on every jenkins box
(since they are otherwise pretty stateless). One idea is to encrypt these
credentials with a passphrase and post them somewhere publicly visible. Then
the jenkins build can download the credentials provided we set a passphrase
in an environment variable in jenkins. There may be simpler solutions as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4296:

Priority: Blocker  (was: Critical)

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265351#comment-14265351
 ] 

Alex Liu commented on SPARK-4943:
-

No changes to your approach. 

Regarding cluster1.database.table1 and cluster2.database.table passing to 
datasources. they are set as tableIdentifier and tableIdentifier is passed to 
catalog.lookupRelation method where datasource can use it.

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu

 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

[
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265533#comment-14265533
]

Tathagata Das commented on SPARK-4960:
--

The reason we have I am suggesting the limited solution is that there are
usecases where even T = Iterator[T[ helps. People have asked me before whether
the data received from a source can be filtered even before it has been
inserted to reduce memory usage. Others have also asked if they can do very low
latency stuff like pushing received data out to some other store immediately
for greater reliability. This simplified interceptor pattern can solve those.
However, yes, it does not solve [~c...@koeninger.org] requirement. That should
best be solved using a new Receiver and InputDStream.

This limited solution can be implemented without another receiver. The
interceptor function, if set, can be applied by the BlockGenerator to every
records it is getting. And since we want everyone to use the BlockGenerator,
all receivers will be able to take advantage of this interceptor.

Interceptor pattern in receivers

Key: SPARK-4960
URL: https://issues.apache.org/jira/browse/SPARK-4960
Project: Spark
Issue Type: New Feature
Components: Streaming
Reporter: Tathagata Das

Sometimes it is good to intercept a message received through a receiver and
modify / do something with the message before it is stored into Spark. This
is often referred to as the interceptor pattern. There should be general way
to specify an interceptor function that gets applied to all receivers.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-01-05 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265355#comment-14265355
 ] 

Ryan Williams commented on SPARK-1517:
--

Hey [~pwendell], any updates here? The disconnect between the content of the 
github README and the /latest/ published docs leading up to the 1.2.0 release 
continues to cast a shadow and new divergence is set to begin as we move 
further from having just cut a release.

As was [recently pointed out on the dev 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Starting-with-Spark-tp9908p9925.html],
 
[my|https://github.com/apache/spark/commit/4ceb048b38949dd0a909d2ee6777607341c9c93a#diff-04c6e90faac2675aa89e2176d2eec7d8]
 and 
[Reynold's|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#diff-04c6e90faac2675aa89e2176d2eec7d8]
 fixes to previously-broken links in the README became broken links when the 
1.2.0 docs were cut, as [~srowen] [warned would 
happen|https://github.com/apache/spark/commit/342b57db66e379c475daf5399baf680ff42b87c2#commitcomment-8250912]
 (one is fixed [here|https://github.com/apache/spark/pull/3802/files], the 
other remains broken on the README today).

I still believe that the correct fix is to have the README point at docs that 
are published with each Jenkins build, per this JIRA and [our previous 
discussion about 
it|http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-tt9560.html#a9568].

Even better would be to publish nightly docs *and* remove any pretense that the 
github README is a canonical source of documentation, in favor of just linking 
to the /latest/ published docs. Let me know if you want me to file that as a 
sub-task here.

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
Priority: Blocker

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

Timothy Chen created SPARK-5095:
---

 Summary: Support launching multiple mesos executors in coarse 
grained mesos mode
 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
Reporter: Timothy Chen


Currently in coarse grained mesos mode, it's expected that we only launch one 
Mesos executor that launches one JVM process to launch multiple spark executors.

However, this become a problem when the JVM process launched is larger than an 
ideal size (30gb is recommended value from databricks), which causes GC 
problems reported on the mailing list.

We should support launching mulitple executors when large enough resources are 
available for spark to use, and these resources are still under the configured 
limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated SPARK-5095:

Component/s: Mesos

 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler

2015-01-05 Thread Matt Cheah (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matt Cheah updated SPARK-4737:
--
Comment: was deleted

(was: I will be out of the office with limited access to e-mail from January 05
to January 06. If there are specifically urgent matters requiring my assistance
and you have other means of contacting me, please use those other channels.

Sorry for the inconvenience. Thanks,

-Matt Cheah
)

Prevent serialization errors from ever crashing the DAG scheduler
-

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4687) SparkContext#addFile doesn't keep file folder information

2015-01-05 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265416#comment-14265416
 ] 

Patrick Wendell commented on SPARK-4687:


I spent some more time looking at this and talking with [~sandyr] and 
[~joshrosen]. I think having some limited version of this is fine given that, 
from what I can tell, this is pretty difficult to implement outside of Spark. I 
am going to post further comments on the JIRA.

 SparkContext#addFile doesn't keep file folder information
 -

 Key: SPARK-4687
 URL: https://issues.apache.org/jira/browse/SPARK-4687
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Jimmy Xiang

 Files added with SparkContext#addFile are loaded with Utils#fetchFile before 
 a task starts. However, Utils#fetchFile puts all files under the Spart root 
 on the worker node. We should have an option to keep the folder information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2015-01-05 Thread Ryan Williams (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265419#comment-14265419
]

Ryan Williams commented on SPARK-1517:
--

Agreed that the redirect you speak of should exist / be fixed; a separate JIRA
should be filed for that.

bq. Whether there should be nightly builds of the site is a different question.

My understanding is that there has been consensus at a few points that this is
a good idea. The main concern you've voiced is the risk that people will land
on the github README when looking for stable/release docs, and:
# find nightly info directly in the README content (and not understand it to
be incorrect (too up-to-date) for their purposes), or
# inadvertently follow a link to published nightly docs.

re: 1, in my last post I suggested doing away with the pretense that the github
README will directly contain Spark documentation, and replacing its current
content with links to the relevant published docs, potentially *both* nightly
and stable.

re: 2, as long as the README's links to nightly and stable docs sites are
clearly marked, this should not be a problem. Users already must have a minimal
level of understanding of what version of Spark docs they want to look at.

Publish nightly snapshots of documentation, maven artifacts, and binary builds
--

Key: SPARK-1517
URL: https://issues.apache.org/jira/browse/SPARK-1517
Project: Spark
Issue Type: Improvement
Components: Build, Project Infra
Reporter: Patrick Wendell
Priority: Blocker

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts


[ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265472#comment-14265472
 ] 

Stephen Haberman commented on SPARK-5085:
-

Looks like this is probably a known EC2/Ubuntu/Linux/Xen issue:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811

http://www.spinics.net/lists/netdev/msg282340.html

I'm trying to run with those flags (tso/sg) off to see if that fixes it.

 netty shuffle service causing connection timeouts
 -

 Key: SPARK-5085
 URL: https://issues.apache.org/jira/browse/SPARK-5085
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0
 Here's our spark-defaults:
 {code}
 spark.master spark://$MASTER_IP:7077
 spark.eventLog.enabled true
 spark.eventLog.dir /mnt/spark/work/history
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.executor.memory ${EXECUTOR_MEM}m
 spark.core.connection.ack.wait.timeout 600
 spark.storage.blockManagerSlaveTimeoutMs 6
 spark.shuffle.consolidateFiles true
 spark.shuffle.service.enabled false
 spark.shuffle.blockTransferService nio # works with nio, fails with netty
 # Use snappy because LZF uses ~100-300k buffer per block
 spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
 spark.shuffle.file.buffer.kb 10
 spark.executor.extraJavaOptions -XX:+PrintGCDetails 
 -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC 
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati...
 spark.akka.logLifecycleEvents true
 spark.akka.timeout 360
 spark.akka.askTimeout 120
 spark.akka.lookupTimeout 120
 spark.akka.frameSize 100
 spark.files.userClassPathFirst true
 spark.shuffle.memoryFraction 0.5
 spark.storage.memoryFraction 0.2
 {code}
Reporter: Stephen Haberman

 In Spark 1.2.0, the netty backend is causing our report's cluster to lock up 
 with connection timeouts, ~75% of the way through the job.
 It happens with both the external shuffle server and the 
 non-external/executor-hosted shuffle server, but if I change the shuffle 
 service from netty to nio, it immediately works.
 Here's log output from one executor (I turned on trace output for the network 
 package and ShuffleBlockFetcherIterator; all executors in the cluster have 
 basically the same pattern of ~15m of silence then timeouts):
 {code}
 // lots of log output, doing fine...
 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder 
 (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: 
 ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:processFetchRequest(107)) - Received req from 
 /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager 
 (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:operationComplete(152)) - Sent result 
 ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}, 
 buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data,
  offset=4574685, length=20939}} to client /10.169.175.179:57056
 // note 15m of silence here...
 15/01/03 05:48:13 WARN  [shuffle-server-7] server.TransportChannelHandler 
 (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection 
 from /10.33.166.218:42780
 java.io.IOException: Connection timed out
   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
   at 
 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
   at 
 io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
   at

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2015-01-05 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265508#comment-14265508
 ] 

Saisai Shao commented on SPARK-4960:


Hi TD, thanks a lot for your suggestions. What I concerned about changing the 
interceptor pattern to T = Iterator\[T\] is that it will narrow the usage 
scenario of this functionality, at least it cannot meet the requirement 
[~c...@koeninger.org] mentioned for KafkaInputDStream. Is it necessary to add 
just T = Iterator\[T\] pattern in receiver rather than moving into DStream 
transformation?

For the problem you mentioned, I think maybe we can create a interceptor 
receiver to ship the data from actual receiver and do some transformation 
before storing, so each actual receiver will have a shadow interceptor receiver 
to intercept the data, this might fix the problem but will increase the 
implementation complexity.

I will re-think about my design and try to figure out a better way if possible. 
Thanks a lot.

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization

2015-01-05 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265549#comment-14265549
 ] 

yangping wu commented on SPARK-5082:


Yes, I also found the pull after I created the issue.

 Minor typo in the Tuning Spark document about Data Serialization
 

 Key: SPARK-5082
 URL: https://issues.apache.org/jira/browse/SPARK-5082
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: yangping wu
Priority: Trivial
   Original Estimate: 8h
  Remaining Estimate: 8h

 The latest documentation for *Tuning Spark* has some error 
 entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data 
 Serialization*: section:
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}
 the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* 
 not *Seq[Class[_]]*. The right code snippets is
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2015-01-05 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265548#comment-14265548
 ] 

Saisai Shao commented on SPARK-4960:


Thanks TD, this sounds reasonable, I will refactor the design doc and I think 
we should take SPARK-5042 into consideration.

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization

2015-01-05 Thread yangping wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265550#comment-14265550
 ] 

yangping wu commented on SPARK-5082:


Yes, I also found the pull after I created the issue.

 Minor typo in the Tuning Spark document about Data Serialization
 

 Key: SPARK-5082
 URL: https://issues.apache.org/jira/browse/SPARK-5082
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: yangping wu
Priority: Trivial
   Original Estimate: 8h
  Remaining Estimate: 8h

 The latest documentation for *Tuning Spark* has some error 
 entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data 
 Serialization*: section:
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}
 the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* 
 not *Seq[Class[_]]*. The right code snippets is
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5082) Minor typo in the Tuning Spark document about Data Serialization

2015-01-05 Thread yangping wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangping wu updated SPARK-5082:
---
Comment: was deleted

(was: Yes, I also found the pull after I created the issue.)

 Minor typo in the Tuning Spark document about Data Serialization
 

 Key: SPARK-5082
 URL: https://issues.apache.org/jira/browse/SPARK-5082
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.2.0
Reporter: yangping wu
Priority: Trivial
   Original Estimate: 8h
  Remaining Estimate: 8h

 The latest documentation for *Tuning Spark* has some error 
 entry(http://spark.apache.org/docs/latest/tuning.html) in the *Data 
 Serialization*: section:
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Seq(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}
 the *registerKryoClasses* method's parameter type should be *Array[Class[_]]* 
 not *Seq[Class[_]]*. The right code snippets is
 {code}
 val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
 val sc = new SparkContext(conf)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4258) NPE with new Parquet Filters


 [ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-4258:
-

Oops this got closed by your documentation fix.  Reopening.

 NPE with new Parquet Filters
 

 Key: SPARK-4258
 URL: https://issues.apache.org/jira/browse/SPARK-4258
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
 java.lang.NullPointerException: 
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
 
 parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 {code}
 This occurs when reading parquet data encoded with the older version of the 
 library for TPC-DS query 34.  Will work on coming up with a smaller 
 reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers

2015-01-05 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265570#comment-14265570
 ] 

Cody Koeninger commented on SPARK-4960:
---

At that point, it sounds like you're talking about an early filter rather
than an early flatmap.  Should it just be T = Option[T]?

Since this ticket no longer solves the problem raised by SPARK-3146, and it
seems unlikely that patch will ever get merged, what is the concrete plan
for giving users of receiver-based kafka implementations early access to
MessageAndMetadata?

On Mon, Jan 5, 2015 at 7:48 PM, Tathagata Das (JIRA) j...@apache.org



 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir

Michael Armbrust created SPARK-5096:
---

 Summary: SparkBuild.scala assumes you are at the spark root dir
 Key: SPARK-5096
 URL: https://issues.apache.org/jira/browse/SPARK-5096
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Michael Armbrust
Assignee: Michael Armbrust


This is bad because it breaks compiling spark as an external project ref and is 
generally bad SBT practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5096) SparkBuild.scala assumes you are at the spark root dir


[ 
https://issues.apache.org/jira/browse/SPARK-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265666#comment-14265666
 ] 

Apache Spark commented on SPARK-5096:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3905

 SparkBuild.scala assumes you are at the spark root dir
 --

 Key: SPARK-5096
 URL: https://issues.apache.org/jira/browse/SPARK-5096
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 This is bad because it breaks compiling spark as an external project ref and 
 is generally bad SBT practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5085) netty shuffle service causing connection timeouts


[ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265642#comment-14265642
 ] 

Stephen Haberman commented on SPARK-5085:
-

I've confirmed that adding:

sudo ethtool -K eth0 tso off
sudo ethtool -K eth0 sg off

To our cluster setup script fixed the issue and the job runs perfectly now. (I 
had initially tried only tso off; that did not fix it, tso off + sg off 
did fix it. I have not tried only sg off, as from running ethtool, it my 
naive/quick interpretation was that sg off implied tso off (but don't hold me 
to that).

Closing this ticket.

 netty shuffle service causing connection timeouts
 -

 Key: SPARK-5085
 URL: https://issues.apache.org/jira/browse/SPARK-5085
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0
 Here's our spark-defaults:
 {code}
 spark.master spark://$MASTER_IP:7077
 spark.eventLog.enabled true
 spark.eventLog.dir /mnt/spark/work/history
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.executor.memory ${EXECUTOR_MEM}m
 spark.core.connection.ack.wait.timeout 600
 spark.storage.blockManagerSlaveTimeoutMs 6
 spark.shuffle.consolidateFiles true
 spark.shuffle.service.enabled false
 spark.shuffle.blockTransferService nio # works with nio, fails with netty
 # Use snappy because LZF uses ~100-300k buffer per block
 spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
 spark.shuffle.file.buffer.kb 10
 spark.executor.extraJavaOptions -XX:+PrintGCDetails 
 -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC 
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati...
 spark.akka.logLifecycleEvents true
 spark.akka.timeout 360
 spark.akka.askTimeout 120
 spark.akka.lookupTimeout 120
 spark.akka.frameSize 100
 spark.files.userClassPathFirst true
 spark.shuffle.memoryFraction 0.5
 spark.storage.memoryFraction 0.2
 {code}
Reporter: Stephen Haberman

 In Spark 1.2.0, the netty backend is causing our report's cluster to lock up 
 with connection timeouts, ~75% of the way through the job.
 It happens with both the external shuffle server and the 
 non-external/executor-hosted shuffle server, but if I change the shuffle 
 service from netty to nio, it immediately works.
 Here's log output from one executor (I turned on trace output for the network 
 package and ShuffleBlockFetcherIterator; all executors in the cluster have 
 basically the same pattern of ~15m of silence then timeouts):
 {code}
 // lots of log output, doing fine...
 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder 
 (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: 
 ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:processFetchRequest(107)) - Received req from 
 /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager 
 (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:operationComplete(152)) - Sent result 
 ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}, 
 buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data,
  offset=4574685, length=20939}} to client /10.169.175.179:57056
 // note 15m of silence here...
 15/01/03 05:48:13 WARN  [shuffle-server-7] server.TransportChannelHandler 
 (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection 
 from /10.33.166.218:42780
 java.io.IOException: Connection timed out
   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
   at 
 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
   at 
 io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

[jira] [Closed] (SPARK-5085) netty shuffle service causing connection timeouts


 [ 
https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Haberman closed SPARK-5085.
---
Resolution: Invalid

 netty shuffle service causing connection timeouts
 -

 Key: SPARK-5085
 URL: https://issues.apache.org/jira/browse/SPARK-5085
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0
 Here's our spark-defaults:
 {code}
 spark.master spark://$MASTER_IP:7077
 spark.eventLog.enabled true
 spark.eventLog.dir /mnt/spark/work/history
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.executor.memory ${EXECUTOR_MEM}m
 spark.core.connection.ack.wait.timeout 600
 spark.storage.blockManagerSlaveTimeoutMs 6
 spark.shuffle.consolidateFiles true
 spark.shuffle.service.enabled false
 spark.shuffle.blockTransferService nio # works with nio, fails with netty
 # Use snappy because LZF uses ~100-300k buffer per block
 spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
 spark.shuffle.file.buffer.kb 10
 spark.executor.extraJavaOptions -XX:+PrintGCDetails 
 -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC 
 -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati...
 spark.akka.logLifecycleEvents true
 spark.akka.timeout 360
 spark.akka.askTimeout 120
 spark.akka.lookupTimeout 120
 spark.akka.frameSize 100
 spark.files.userClassPathFirst true
 spark.shuffle.memoryFraction 0.5
 spark.storage.memoryFraction 0.2
 {code}
Reporter: Stephen Haberman

 In Spark 1.2.0, the netty backend is causing our report's cluster to lock up 
 with connection timeouts, ~75% of the way through the job.
 It happens with both the external shuffle server and the 
 non-external/executor-hosted shuffle server, but if I change the shuffle 
 service from netty to nio, it immediately works.
 Here's log output from one executor (I turned on trace output for the network 
 package and ShuffleBlockFetcherIterator; all executors in the cluster have 
 basically the same pattern of ~15m of silence then timeouts):
 {code}
 // lots of log output, doing fine...
 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder 
 (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: 
 ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:processFetchRequest(107)) - Received req from 
 /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager 
 (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750
 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler 
 (TransportRequestHandler.java:operationComplete(152)) - Sent result 
 ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, 
 chunkIndex=170}, 
 buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data,
  offset=4574685, length=20939}} to client /10.169.175.179:57056
 // note 15m of silence here...
 15/01/03 05:48:13 WARN  [shuffle-server-7] server.TransportChannelHandler 
 (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection 
 from /10.33.166.218:42780
 java.io.IOException: Connection timed out
   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
   at 
 io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
   at 
 io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
   at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
   at java.lang.Thread.run(Thread.java:745)
 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler 
 (TransportRequestHandler.java:operationComplete(154)) - Error sending result

[jira] [Commented] (SPARK-4960) Interceptor pattern in receivers


[ 
https://issues.apache.org/jira/browse/SPARK-4960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265589#comment-14265589
 ] 

Tathagata Das commented on SPARK-4960:
--

That is a good question. What we could do is provide a new API in KafkaUtils 
for that. Though SPARK-4964 also tries to add more interfaces to KafkaUtils.
I have to brainstorm a little bit about how all this should be organized. 

 Interceptor pattern in receivers
 

 Key: SPARK-4960
 URL: https://issues.apache.org/jira/browse/SPARK-4960
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Tathagata Das

 Sometimes it is good to intercept a message received through a receiver and 
 modify / do something with the message before it is stored into Spark. This 
 is often referred to as the interceptor pattern. There should be general way 
 to specify an interceptor function that gets applied to all receivers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5090) The improvement of python converter for hbase

2015-01-05 Thread Gen TANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gen TANG updated SPARK-5090:

Description: 
The python converter `HBaseResultToStringConverter` provided in the 
HBaseConverter.scala returns only the value of first column in the result. It 
limits the utility of this converter, because it returns only one value per 
row(perhaps there are several version in hbase) and moreover it loses the other 
information of record, such as column:cell, timestamp. 

Here we would like to propose an improvement about python converter which 
returns all the records in the results (in a single string) with more complete 
information. We would like also make some improvements for hbase_inputformat.py

  was:
The python converter `HBaseResultToStringConverter` provided in the 
HBaseConverter.scala returns only the value of first column in the result. It 
limits the utility of this converter, because it returns only one value per 
row(perhaps there are several version in hbase) and moreover it loses the other 
information of record, such as column:cell, timestamp. 

Here we would like to propose an improvement about python converter which 
returns all the records in the results (in a single string) with more complete 
information.


 The improvement of python converter for hbase
 -

 Key: SPARK-5090
 URL: https://issues.apache.org/jira/browse/SPARK-5090
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Gen TANG
  Labels: hbase, python
 Fix For: 1.2.1

   Original Estimate: 168h
  Remaining Estimate: 168h

 The python converter `HBaseResultToStringConverter` provided in the 
 HBaseConverter.scala returns only the value of first column in the result. It 
 limits the utility of this converter, because it returns only one value per 
 row(perhaps there are several version in hbase) and moreover it loses the 
 other information of record, such as column:cell, timestamp. 
 Here we would like to propose an improvement about python converter which 
 returns all the records in the results (in a single string) with more 
 complete information. We would like also make some improvements for 
 hbase_inputformat.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5057) Log message in failed askWithReply attempts


 [ 
https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5057.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3875
[https://github.com/apache/spark/pull/3875]

 Log message in failed askWithReply attempts
 ---

 Key: SPARK-5057
 URL: https://issues.apache.org/jira/browse/SPARK-5057
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor
 Fix For: 1.3.0


 As is used in many cases, it is better for analysis to print contents of 
 message after attempt failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5057) Log message in failed askWithReply attempts


 [ 
https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5057:
--
Summary: Log message in failed askWithReply attempts  (was: Add more 
details in log when using actor to get infos)

 Log message in failed askWithReply attempts
 ---

 Key: SPARK-5057
 URL: https://issues.apache.org/jira/browse/SPARK-5057
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: WangTaoTheTonic
Priority: Minor
 Fix For: 1.3.0


 As is used in many cases, it is better for analysis to print contents of 
 message after attempt failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5057) Log message in failed askWithReply attempts


 [ 
https://issues.apache.org/jira/browse/SPARK-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5057:
--
Assignee: WangTaoTheTonic

 Log message in failed askWithReply attempts
 ---

 Key: SPARK-5057
 URL: https://issues.apache.org/jira/browse/SPARK-5057
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Fix For: 1.3.0


 As is used in many cases, it is better for analysis to print contents of 
 message after attempt failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4465) runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.


 [ 
https://issues.apache.org/jira/browse/SPARK-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4465:
--
Assignee: Jongyoul Lee

 runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.
 -

 Key: SPARK-4465
 URL: https://issues.apache.org/jira/browse/SPARK-4465
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Input/Output, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Assignee: Jongyoul Lee
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 runAsSparkUser enable classes using hadoop library to change an active user 
 to spark User, however in Mesos environment, because the function calls 
 before running within JNI, runAsSparkUser doesn't affect anything, and 
 meaningless code. fix the Appropriate scope of function and remove 
 meaningless code. That's a bug because of running program incorrectly. That's 
 related to SPARK-3223 but setting framework user is not perfect solution in 
 my tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4465) runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.


 [ 
https://issues.apache.org/jira/browse/SPARK-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4465.
---
Resolution: Fixed

Issue resolved by pull request 3741
[https://github.com/apache/spark/pull/3741]

 runAsSparkUser doesn't affect TaskRunner in Mesos environment at all.
 -

 Key: SPARK-4465
 URL: https://issues.apache.org/jira/browse/SPARK-4465
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Input/Output, Mesos
Affects Versions: 1.2.0
Reporter: Jongyoul Lee
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 runAsSparkUser enable classes using hadoop library to change an active user 
 to spark User, however in Mesos environment, because the function calls 
 before running within JNI, runAsSparkUser doesn't affect anything, and 
 meaningless code. fix the Appropriate scope of function and remove 
 meaningless code. That's a bug because of running program incorrectly. That's 
 related to SPARK-3223 but setting framework user is not perfect solution in 
 my tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5089) Vector conversion broken for non-float64 arrays


[ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265025#comment-14265025
 ] 

Apache Spark commented on SPARK-5089:
-

User 'freeman-lab' has created a pull request for this issue:
https://github.com/apache/spark/pull/3902

 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to {{DenseVectors}}. If the data are numpy arrays 
 with dtype {{float64}} this works. If data are numpy arrays with lower 
 precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
 {{float64}}, but due to a small bug in this line this currently doesn't 
 happen (casting is not inplace). 
 {code:none}
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 {code}
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 {code:none}
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 {code}
 But this works fine:
 {code:none}
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 {code}
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5091) Hooks for PySpark tasks

2015-01-05 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5091:
-

 Summary: Hooks for PySpark tasks
 Key: SPARK-5091
 URL: https://issues.apache.org/jira/browse/SPARK-5091
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Davies Liu


Currently, it's not convenient to add package on executor to PYTHONPATH (we did 
not assume the environment of driver an executor are identical). 

It will be nice to have a hook to called before/after every tasks, then user 
could manipulate sys.path by pre-task-hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5092) Selecting from a nested structure with SparkSQL should return a nested structure

2015-01-05 Thread Brad Willard (JIRA)

Brad Willard created SPARK-5092:
---

 Summary: Selecting from a nested structure with SparkSQL should 
return a nested structure
 Key: SPARK-5092
 URL: https://issues.apache.org/jira/browse/SPARK-5092
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brad Willard
Priority: Minor


When running a sparksql query like this (at least on a json dataset)

select
   rid,
   meta_data.name
from
   a_table

The rows returned lose the nested structure. I receive a row like

Row(rid='123', name='delete')

instead of

Row(rid='123', meta_data=Row(name='data'))

I personally think this is confusing especially when programmatically building 
and executing queries and then parsing it to find your data in a new structure. 
I could understand how that's less desirable in some situations, but you could 
get around it by supporting 'as'. If you wanted to skip the nested structure 
simply write.

select
   rid,
   meta_data.name as name
from
   a_table




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type


[ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264862#comment-14264862
 ] 

Cheng Lian commented on SPARK-4850:
---

[~debugger87] The query you provided in the description is actually invalid, 
because no aggregation function is involved. The expanded {{*}} clearly 
contains expressions that are not in group by (all attributes except for {{a}}).

If you try something like:
{code}
val t = sqlContext.sql(select a, count(*) from t group by a)
{code}
or
{code}
val t = sqlContext.sql(select arr, count(*) from t group by arr)
{code}
Then everything is fine.

 GROUP BY can't work if the schema of SchemaRDD contains struct or array type
 --

 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang
Assignee: Cheng Lian
  Labels: group, sql
   Original Estimate: 96h
  Remaining Estimate: 96h

 Code in Spark Shell as follows:
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val path = path/to/json
 sqlContext.jsonFile(path).register(Table)
 val t = sqlContext.sql(select * from Table group by a)
 t.collect
 {code}
 Let's look into the schema of `Table`
 {code}
 root
  |-- a: integer (nullable = true)
  |-- arr: array (nullable = true)
  ||-- element: integer (containsNull = false)
  |-- createdAt: string (nullable = true)
  |-- f: struct (nullable = true)
  ||-- __type: string (nullable = true)
  ||-- className: string (nullable = true)
  ||-- objectId: string (nullable = true)
  |-- objectId: string (nullable = true)
  |-- s: string (nullable = true)
  |-- updatedAt: string (nullable = true)
 {code}
 Exception will be throwed:
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: arr#9, tree:
 Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
  Subquery TestImport
   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
 MappedRDD[18] at map at JsonRDD.scala:47
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)

[jira] [Closed] (SPARK-4850) GROUP BY can't work if the schema of SchemaRDD contains struct or array type


 [ 
https://issues.apache.org/jira/browse/SPARK-4850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian closed SPARK-4850.
-
Resolution: Invalid

Not a bug.

 GROUP BY can't work if the schema of SchemaRDD contains struct or array type
 --

 Key: SPARK-4850
 URL: https://issues.apache.org/jira/browse/SPARK-4850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.2
Reporter: Chaozhong Yang
Assignee: Cheng Lian
  Labels: group, sql
   Original Estimate: 96h
  Remaining Estimate: 96h

 Code in Spark Shell as follows:
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val path = path/to/json
 sqlContext.jsonFile(path).register(Table)
 val t = sqlContext.sql(select * from Table group by a)
 t.collect
 {code}
 Let's look into the schema of `Table`
 {code}
 root
  |-- a: integer (nullable = true)
  |-- arr: array (nullable = true)
  ||-- element: integer (containsNull = false)
  |-- createdAt: string (nullable = true)
  |-- f: struct (nullable = true)
  ||-- __type: string (nullable = true)
  ||-- className: string (nullable = true)
  ||-- objectId: string (nullable = true)
  |-- objectId: string (nullable = true)
  |-- s: string (nullable = true)
  |-- updatedAt: string (nullable = true)
 {code}
 Exception will be throwed:
 {code}
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: arr#9, tree:
 Aggregate [a#8], [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14]
  Subquery TestImport
   LogicalRDD [a#8,arr#9,createdAt#10,f#11,objectId#12,s#13,updatedAt#14], 
 MappedRDD[18] at map at JsonRDD.scala:47
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:126)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:125)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:108)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:106)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at $iwC$$iwC$$iwC$$iwC.init(console:17)
   at $iwC$$iwC$$iwC.init(console:22)
   at $iwC$$iwC.init(console:24)
   at $iwC.init(console:26)
   at init(console:28)
   at

[jira] [Resolved] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-4296.
---
  Resolution: Duplicate
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.3.0)

This issue is a duplicate of SPARK-4322, which has already been fixed in 1.2.0.

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


 [ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reopened SPARK-4296:
---

Sorry, missed David's comment below.

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4688) Have a single shared network timeout in Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4688.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Have a single shared network timeout in Spark
 -

 Key: SPARK-4688
 URL: https://issues.apache.org/jira/browse/SPARK-4688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical
 Fix For: 1.3.0


 We have several different timeouts, but in most cases users just want to set 
 something that is large enough that they can avoid GC pauses. We should have 
 a single conf spark.network.timeout that is used as the default timeout for 
 all network interactions. This can replace the following timeouts:
 {code}
 spark.core.connection.ack.wait.timeout
 spark.akka.timeout  
 spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
 spark.shuffle.io.connectionTimeout (undocumented)
 {code}
 Of course, for compatibility we should respect the old ones when they are 
 set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
 suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4387) Refactoring python profiling code to make it extensible


[ 
https://issues.apache.org/jira/browse/SPARK-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264919#comment-14264919
 ] 

Apache Spark commented on SPARK-4387:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3901

 Refactoring python profiling code to make it extensible
 ---

 Key: SPARK-4387
 URL: https://issues.apache.org/jira/browse/SPARK-4387
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Yandu Oppacher

 SPARK-3478 introduced python profiling for workers which is great but it 
 would be nice to be able to change the profiler and output formats as needed. 
 This is a refactoring of the code to allow that to happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4905) Flaky FlumeStreamSuite test: org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream

2015-01-05 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264936#comment-14264936
 ] 

Hari Shreedharan commented on SPARK-4905:
-

Taking a look now.

 Flaky FlumeStreamSuite test: 
 org.apache.spark.streaming.flume.FlumeStreamSuite.flume input stream
 -

 Key: SPARK-4905
 URL: https://issues.apache.org/jira/browse/SPARK-4905
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Josh Rosen
  Labels: flaky-test

 It looks like the org.apache.spark.streaming.flume.FlumeStreamSuite.flume 
 input stream test might be flaky 
 ([link|https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24647/testReport/junit/org.apache.spark.streaming.flume/FlumeStreamSuite/flume_input_stream/]):
 {code}
 Error Message
 The code passed to eventually never returned normally. Attempted 106 times 
 over 10.045097243 seconds. Last failure message: ArrayBuffer(, , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 ) was not equal to Vector(1, 2, 3, 4, 5, 6, 7, 8, 9, 
 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 
 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 
 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 
 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
 100).
 Stacktrace
 sbt.ForkMain$ForkError: The code passed to eventually never returned 
 normally. Attempted 106 times over 10.045097243 seconds. Last failure 
 message: ArrayBuffer(, , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , , , , , , , , , , , 
 , , , , , , , , , ) was not equal to Vector(1, 2, 
 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 
 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 
 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 
 95, 96, 97, 98, 99, 100).
   at 
 org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:142)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply$mcV$sp(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$2.apply(FlumeStreamSuite.scala:62)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.org$scalatest$BeforeAndAfter$$super$runTest(FlumeStreamSuite.scala:46)
   at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:200)
   at 
 org.apache.spark.streaming.flume.FlumeStreamSuite.runTest(FlumeStreamSuite.scala:46)
   at

[jira] [Commented] (SPARK-4688) Have a single shared network timeout in Spark

2015-01-05 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264942#comment-14264942
 ] 

Varun Saxena commented on SPARK-4688:
-

Thanks [~rxin] for the commit. Mind assigning the issue to me ?

 Have a single shared network timeout in Spark
 -

 Key: SPARK-4688
 URL: https://issues.apache.org/jira/browse/SPARK-4688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical
 Fix For: 1.3.0


 We have several different timeouts, but in most cases users just want to set 
 something that is large enough that they can avoid GC pauses. We should have 
 a single conf spark.network.timeout that is used as the default timeout for 
 all network interactions. This can replace the following timeouts:
 {code}
 spark.core.connection.ack.wait.timeout
 spark.akka.timeout  
 spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
 spark.shuffle.io.connectionTimeout (undocumented)
 {code}
 Of course, for compatibility we should respect the old ones when they are 
 set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
 suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4688) Have a single shared network timeout in Spark

2015-01-05 Thread Varun Saxena (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264942#comment-14264942
 ] 

Varun Saxena edited comment on SPARK-4688 at 1/5/15 7:20 PM:
-

Thanks [~rxin] for the commit. Can you assign this issue to me ?


was (Author: varun_saxena):
Thanks [~rxin] for the commit. Mind assigning the issue to me ?

 Have a single shared network timeout in Spark
 -

 Key: SPARK-4688
 URL: https://issues.apache.org/jira/browse/SPARK-4688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Critical
 Fix For: 1.3.0


 We have several different timeouts, but in most cases users just want to set 
 something that is large enough that they can avoid GC pauses. We should have 
 a single conf spark.network.timeout that is used as the default timeout for 
 all network interactions. This can replace the following timeouts:
 {code}
 spark.core.connection.ack.wait.timeout
 spark.akka.timeout  
 spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
 spark.shuffle.io.connectionTimeout (undocumented)
 {code}
 Of course, for compatibility we should respect the old ones when they are 
 set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
 suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4757) Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource

2015-01-05 Thread Chris Albright (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264979#comment-14264979
 ] 

Chris Albright commented on SPARK-4757:
---

Is there an ETA on when this might make it into a release? Or, can we checkout 
a commit and build locally? I don't see a commit hash for the fix.

 Yarn-client failed to start due to Wrong FS error in distCacheMgr.addResource
 -

 Key: SPARK-4757
 URL: https://issues.apache.org/jira/browse/SPARK-4757
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Jianshi Huang
 Fix For: 1.2.0, 1.3.0


 I got the following error during Spark startup (Yarn-client mode):
 14/12/04 19:33:58 INFO Client: Uploading resource 
 file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar 
 - 
 hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar
 java.lang.IllegalArgumentException: Wrong FS: 
 hdfs://stampy/user/jianshuang/.sparkStaging/application_1404410683830_531767/datanucleus-api-jdo-3.2.6.jar,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
 at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
 at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
 at 
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:67)
 at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:257)
 at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:242)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:242)
 at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:35)
 at 
 org.apache.spark.deploy.yarn.ClientBase$class.createContainerLaunchContext(ClientBase.scala:350)
 at 
 org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:35)
 at 
 org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:80)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
 at org.apache.spark.SparkContext.init(SparkContext.scala:335)
 at 
 org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:986)
 at $iwC$$iwC.init(console:9)
 at $iwC.init(console:18)
 at init(console:20)
 at .init(console:24)
 According to Liancheng and Andrew, this hotfix might be the root cause:
  
 https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4688) Have a single shared network timeout in Spark


 [ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4688:
---
Assignee: Varun Saxena

 Have a single shared network timeout in Spark
 -

 Key: SPARK-4688
 URL: https://issues.apache.org/jira/browse/SPARK-4688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Varun Saxena
Priority: Critical
 Fix For: 1.3.0


 We have several different timeouts, but in most cases users just want to set 
 something that is large enough that they can avoid GC pauses. We should have 
 a single conf spark.network.timeout that is used as the default timeout for 
 all network interactions. This can replace the following timeouts:
 {code}
 spark.core.connection.ack.wait.timeout
 spark.akka.timeout  
 spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
 spark.shuffle.io.connectionTimeout (undocumented)
 {code}
 Of course, for compatibility we should respect the old ones when they are 
 set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
 suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4688) Have a single shared network timeout in Spark


[ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264987#comment-14264987
 ] 

Reynold Xin commented on SPARK-4688:


Done - thanks for doing this.

 Have a single shared network timeout in Spark
 -

 Key: SPARK-4688
 URL: https://issues.apache.org/jira/browse/SPARK-4688
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Varun Saxena
Priority: Critical
 Fix For: 1.3.0


 We have several different timeouts, but in most cases users just want to set 
 something that is large enough that they can avoid GC pauses. We should have 
 a single conf spark.network.timeout that is used as the default timeout for 
 all network interactions. This can replace the following timeouts:
 {code}
 spark.core.connection.ack.wait.timeout
 spark.akka.timeout  
 spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
 spark.shuffle.io.connectionTimeout (undocumented)
 {code}
 Of course, for compatibility we should respect the old ones when they are 
 set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
 suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5090) The improvement of python converter for hbase

2015-01-05 Thread Gen TANG (JIRA)

Gen TANG created SPARK-5090:
---

 Summary: The improvement of python converter for hbase
 Key: SPARK-5090
 URL: https://issues.apache.org/jira/browse/SPARK-5090
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.2.0
Reporter: Gen TANG
 Fix For: 1.2.1


The python converter `HBaseResultToStringConverter` provided in the 
HBaseConverter.scala returns only the value of first column in the result. It 
limits the utility of this converter, because it returns only one value per 
row(perhaps there are several version in hbase) and moreover it loses the other 
information of record, such as column:cell, timestamp. 

Here we would like to propose an improvement about python converter which 
returns all the records in the results (in a single string) with more complete 
information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot