date:20140925


 [ 
https://issues.apache.org/jira/browse/SPARK-2778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2778.

   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed by:
https://github.com/apache/spark/pull/2257

 Add unit tests for Yarn integration
 ---

 Key: SPARK-2778
 URL: https://issues.apache.org/jira/browse/SPARK-2778
 Project: Spark
  Issue Type: Test
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 It would be nice to add some Yarn integration tests to the unit tests in 
 Spark; Yarn provides a MiniYARNCluster class that can be used to spawn a 
 cluster locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147465#comment-14147465
]

Patrick Wendell commented on SPARK-3687:

Can you perform a jstack on the executor when it is hanging? We usually only
post things on JIRA like this when a specific issue has been debugged a bit
more. But if you can produce a jstack of the hung executor we can keep it open
:)

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3576) Provide script for creating the Spark AMI from scratch


 [ 
https://issues.apache.org/jira/browse/SPARK-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3576.

Resolution: Fixed

This was fixed in spark-ec2 itself

 Provide script for creating the Spark AMI from scratch
 --

 Key: SPARK-3576
 URL: https://issues.apache.org/jira/browse/SPARK-3576
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Patrick Wendell
Assignee: Patrick Wendell





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters


 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3288:
---
Assignee: (was: Andrew Or)

 All fields in TaskMetrics should be private and use getters/setters
 ---

 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
  Labels: starter

 This is particularly bad because we expose this as a developer API. 
 Technically a library could create a TaskMetrics object and then change the 
 values inside of it and pass it onto someone else. It can be written pretty 
 compactly like below:
 {code}
   /**
* Number of bytes written for the shuffle by this task
*/
   @volatile private var _shuffleBytesWritten: Long = _
   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
 value
   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
 value
   def shuffleBytesWritten = _shuffleBytesWritten
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters


 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3288:
---
Labels: starter  (was: )

 All fields in TaskMetrics should be private and use getters/setters
 ---

 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
  Labels: starter

 This is particularly bad because we expose this as a developer API. 
 Technically a library could create a TaskMetrics object and then change the 
 values inside of it and pass it onto someone else. It can be written pretty 
 compactly like below:
 {code}
   /**
* Number of bytes written for the shuffle by this task
*/
   @volatile private var _shuffleBytesWritten: Long = _
   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
 value
   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
 value
   def shuffleBytesWritten = _shuffleBytesWritten
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files


[ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147504#comment-14147504
 ] 

Ziv Huang commented on SPARK-3687:
--

The following is the jstack dump of one executor when it hangs:

File appending thread for 
/opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stderr daemon 
prio=10 tid=0x7ffe0c002800 nid=0x18a3 runnable [0x7ffebc402000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0xfaeee1d0 (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

File appending thread for 
/opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stdout daemon 
prio=10 tid=0x7ffe0c004000 nid=0x18a2 runnable [0x7ffebc503000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0xfaeec108 (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

process reaper daemon prio=10 tid=0x7ffe0c001000 nid=0x1868 runnable 
[0x7ffecc0c7000]
   java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(Native Method)
at java.lang.UNIXProcess.access$500(UNIXProcess.java:54)
at java.lang.UNIXProcess$4.run(UNIXProcess.java:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ExecutorRunner for app-20140925150845-0007/2 daemon prio=10 
tid=0x7ffe7011b800 nid=0x1866 in Object.wait() [0x7ffebc705000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xfaee9df8 (a java.lang.UNIXProcess)
at java.lang.Object.wait(Object.java:503)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:263)
- locked 0xfaee9df8 (a java.lang.UNIXProcess)
at 
org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:164)
at 
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:63)

Attach Listener daemon prio=10 tid=0x7ffe84001000 nid=0x170f waiting on 
condition [0x]
   java.lang.Thread.State: RUNNABLE

sparkWorker-akka.actor.default-dispatcher-16 daemon prio=10 
tid=0x7ffe68214800 nid=0x13a3 waiting on condition [0x7ffebc806000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xfd614a78 (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

sparkWorker-akka.actor.default-dispatcher-15 daemon prio=10 
tid=0x7ffe7011e000 nid=0x13a2 waiting on condition [0x7ffebc604000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native

[jira] [Created] (SPARK-3688) LogicalPlan can't resolve column correctlly

2014-09-25 Thread Yi Tian (JIRA)

Yi Tian created SPARK-3688:
--

 Summary: LogicalPlan can't resolve column correctlly
 Key: SPARK-3688
 URL: https://issues.apache.org/jira/browse/SPARK-3688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian


How to reproduce this problem:
create a table:
{quote}
create table test (a string, b string);
{quote}
execute sql:
{quote}
select a.b ,count(1) from test a join test t group by a.b;
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147513#comment-14147513
]

Ziv Huang commented on SPARK-3687:
--

Just a few mins ago I ran a job twice, processing 203 sequence files.
Both times I saw the job hanging with different behavior from before:
1. the web UI of spark master shows that the job is finished with state
failed after 3.x mins
2. the job stage web UI still hangs, and execution duration time is still
accumulating.
Hope this information helps debugging :)

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3422) JavaAPISuite.getHadoopInputSplits isn't used anywhere


 [ 
https://issues.apache.org/jira/browse/SPARK-3422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-3422.
---
Resolution: Fixed

 JavaAPISuite.getHadoopInputSplits isn't used anywhere
 -

 Key: SPARK-3422
 URL: https://issues.apache.org/jira/browse/SPARK-3422
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3688) LogicalPlan can't resolve column correctlly

2014-09-25 Thread Yi Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147520#comment-14147520
 ] 

Yi Tian commented on SPARK-3688:


As we know, the hive support complex colunm datatype like struct.
So it is hard to resolve a reference like a.b.c
I think we should add some judgements on the datatype attributes, like:
{quote}
option.dataType.isInstanceOf[StructType]
{quote}

 LogicalPlan can't resolve column correctlly
 ---

 Key: SPARK-3688
 URL: https://issues.apache.org/jira/browse/SPARK-3688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian

 How to reproduce this problem:
 create a table:
 {quote}
 create table test (a string, b string);
 {quote}
 execute sql:
 {quote}
 select a.b ,count(1) from test a join test t group by a.b;
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3687) Spark hang while processing more than 100 sequence files

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147513#comment-14147513
]

Ziv Huang edited comment on SPARK-3687 at 9/25/14 8:36 AM:
---

Just a few mins ago I ran a job twice, processing 203 sequence files.
Both times I saw the job hanging with different behavior than before:
1. the web UI of spark master shows that the job is finished with state
failed after 3.x mins
2. the job stage web UI still hangs, and execution duration time is still
accumulating.
Hope this information helps debugging :)

was (Author: taqilabon):
Just a few mins ago I ran a job twice, processing 203 sequence files.
Both times I saw the job hanging with different behavior from before:
1. the web UI of spark master shows that the job is finished with state
failed after 3.x mins
2. the job stage web UI still hangs, and execution duration time is still
accumulating.
Hope this information helps debugging :)

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3651) Consolidate executor maps in CoarseGrainedSchedulerBackend


[ 
https://issues.apache.org/jira/browse/SPARK-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147619#comment-14147619
 ] 

Apache Spark commented on SPARK-3651:
-

User 'tigerquoll' has created a pull request for this issue:
https://github.com/apache/spark/pull/2533

 Consolidate executor maps in CoarseGrainedSchedulerBackend
 --

 Key: SPARK-3651
 URL: https://issues.apache.org/jira/browse/SPARK-3651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Dale Richardson

 In CoarseGrainedSchedulerBackend, we have:
 {code}
 private val executorActor = new HashMap[String, ActorRef]
 private val executorAddress = new HashMap[String, Address]
 private val executorHost = new HashMap[String, String]
 private val freeCores = new HashMap[String, Int]
 private val totalCores = new HashMap[String, Int]
 {code}
 We only ever put / remove stuff from these maps together. It would simplify 
 the code if we consolidate these all into one map as we have done in 
 JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme

2014-09-25 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-3689:
-

 Summary: FileLogger should create new instance of FileSystem 
regardless of it's scheme
 Key: SPARK-3689
 URL: https://issues.apache.org/jira/browse/SPARK-3689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta


FileLogger creates new instance of FileSystem to avoid the effect of 
FileSystem#close from another module but it's expected only HDFS.
We can used another filesystem for the directory which event log is stored to.

{code}
if (scheme == hdfs) {
  conf.setBoolean(fs.hdfs.impl.disable.cache, true)
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme


[ 
https://issues.apache.org/jira/browse/SPARK-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147689#comment-14147689
 ] 

Apache Spark commented on SPARK-3689:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2534

 FileLogger should create new instance of FileSystem regardless of it's scheme
 -

 Key: SPARK-3689
 URL: https://issues.apache.org/jira/browse/SPARK-3689
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 FileLogger creates new instance of FileSystem to avoid the effect of 
 FileSystem#close from another module but it's expected only HDFS.
 We can used another filesystem for the directory which event log is stored to.
 {code}
 if (scheme == hdfs) {
   conf.setBoolean(fs.hdfs.impl.disable.cache, true)
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3682) Add helpful warnings to the UI

2014-09-25 Thread Arun Ahuja (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147730#comment-14147730
]

Arun Ahuja commented on SPARK-3682:
---

We've been running to a lot of these issues so this would be very helpful -
could you explain this one however Warn that tasks in a particular stage are
spilling a lot, and that the number of partitions should probably be
decreased.?

Thanks!

Add helpful warnings to the UI
--

Key: SPARK-3682
URL: https://issues.apache.org/jira/browse/SPARK-3682
Project: Spark
Issue Type: New Feature
Components: Web UI
Affects Versions: 1.1.0
Reporter: Sandy Ryza

Spark has a zillion configuration options and a zillion different things that
can go wrong with a job. Improvements like incremental and better metrics
and the proposed spark replay debugger provide more insight into what's going
on under the covers. However, it's difficult for non-advanced users to
synthesize this information and understand where to direct their attention.
It would be helpful to have some sort of central location on the UI users
could go to that would provide indications about why an app/job is failing or
performing poorly.
Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young
generation.
* Warn that tasks in a particular stage are very short, and that the number
of partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the
number of partitions should probably be decreased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a
lot of time is being spent recomputing it.
To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the
general health of executors.
* Warnings at the job level that indicate why a job might be performing
slowly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module


[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147751#comment-14147751
 ] 

Apache Spark commented on SPARK-3638:
-

User 'aniketbhatnagar' has created a pull request for this issue:
https://github.com/apache/spark/pull/2535

 Commons HTTP client dependency conflict in extras/kinesis-asl module
 

 Key: SPARK-3638
 URL: https://issues.apache.org/jira/browse/SPARK-3638
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.1.0
Reporter: Aniket Bhatnagar
  Labels: dependencies

 Followed instructions as mentioned @ 
 https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
  and when running the example, I get the following error:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:114)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:99)
 at 
 com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
 at 
 com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
 at 
 com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:181)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
 at 
 com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:136)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisClient.init(AmazonKinesisClient.java:117)
 at 
 com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.init(AmazonKinesisAsyncClient.java:132)
 I believe this is due to the dependency conflict as described @ 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3639) Kinesis examples set master as local


[ 
https://issues.apache.org/jira/browse/SPARK-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147758#comment-14147758
 ] 

Apache Spark commented on SPARK-3639:
-

User 'aniketbhatnagar' has created a pull request for this issue:
https://github.com/apache/spark/pull/2536

 Kinesis examples set master as local
 

 Key: SPARK-3639
 URL: https://issues.apache.org/jira/browse/SPARK-3639
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.0.2, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor
  Labels: examples

 Kinesis examples set master as local thus not allowing the example to be 
 tested on a cluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3690) Closing shuffle writers we swallow more important exception

Egor Pakhomov created SPARK-3690:


 Summary: Closing shuffle writers we swallow more important 
exception
 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


ShaffleMapTask: line 75

{quote}
 case e: Exception =
if (writer != null) {
  writer.stop(success = false)
}
throw e
{quote}

Exception in writer.stop() swallows the important one. Couldn't find the reason 
for problems for days. Look up in internet java.io.FileNotFoundException: 
/local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
 (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3690) Closing shuffle writers we swallow more important exception


 [ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pakhomov updated SPARK-3690:
-
Description: 
ShaffleMapTask: line 75

{code:title=Bar.java|borderStyle=solid}
 case e: Exception =
if (writer != null) {
  writer.stop(success = false)
}
throw e
{code}

Exception in writer.stop() swallows the important one. Couldn't find the reason 
for problems for days. Look up in internet java.io.FileNotFoundException: 
/local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
 (No such file or directory) - there are plenty poor guys like me.

  was:
ShaffleMapTask: line 75

{quote}
 case e: Exception =
if (writer != null) {
  writer.stop(success = false)
}
throw e
{quote}

Exception in writer.stop() swallows the important one. Couldn't find the reason 
for problems for days. Look up in internet java.io.FileNotFoundException: 
/local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
 (No such file or directory) - there are plenty poor guys like me.


 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=Bar.java|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3690) Closing shuffle writers we swallow more important exception


 [ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pakhomov updated SPARK-3690:
-
Description: 
ShaffleMapTask: line 75

{code:title=ShaffleMapTask|borderStyle=solid}
 case e: Exception =
if (writer != null) {
  writer.stop(success = false)
}
throw e
{code}

Exception in writer.stop() swallows the important one. Couldn't find the reason 
for problems for days. Look up in internet java.io.FileNotFoundException: 
/local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
 (No such file or directory) - there are plenty poor guys like me.

  was:
ShaffleMapTask: line 75

{code:title=Bar.java|borderStyle=solid}
 case e: Exception =
if (writer != null) {
  writer.stop(success = false)
}
throw e
{code}

Exception in writer.stop() swallows the important one. Couldn't find the reason 
for problems for days. Look up in internet java.io.FileNotFoundException: 
/local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
 (No such file or directory) - there are plenty poor guys like me.


 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=ShaffleMapTask|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2014-09-25 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147835#comment-14147835
 ] 

Thomas Graves commented on SPARK-3678:
--

Also note that  the spark-submit --name option doesn't work in client mode.  
Atleast not for the spark examples.

 Yarn app name reported in RM is different between cluster and client mode
 -

 Key: SPARK-3678
 URL: https://issues.apache.org/jira/browse/SPARK-3678
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 If you launch an application in yarn cluster mode the name of the application 
 in the ResourceManager generally shows up as the full name 
 org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
 mode it shows up as SparkHdfsLR.
 We should be consistent between them.  
 I haven't looked at it in detail, perhaps its only the examples but I think 
 I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3687) Spark hang while processing more than 100 sequence files

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ziv Huang updated SPARK-3687:
-
Comment: was deleted

(was: Just a few mins ago I ran a job twice, processing 203 sequence files.
Both times I saw the job hanging with different behavior than before:
1. the web UI of spark master shows that the job is finished with state
failed after 3.x mins
2. the job stage web UI still hangs, and execution duration time is still
accumulating.
Hope this information helps debugging :))

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

In my application, I read more than 100 sequence files to a JavaPairRDD,
perform flatmap to get another JavaRDD, and then use takeOrdered to get the
result.
It is quite often (but not always) that the spark hangs while the executing
some of 120th-150th tasks.
In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for
its completion).
When the spark job hangs, I can't kill the job from web UI.
In 1.1.0, the job hangs for couple mins (3.x mins actually),
and then web UI of spark master shows that the job is finished with state
FAILED.
In addition, the job stage web UI still hangs, and execution duration time is
still accumulating.
For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere.
The current workaround is to use coalesce to reduce the number of partitions
to be processed.
I never get a job hanged if the number of partitions to be processed is no
greater than 100.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3687) Spark hang while processing more than 100 sequence files

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ziv Huang updated SPARK-3687:
-
Description:
In my application, I read more than 100 sequence files to a JavaPairRDD,
perform flatmap to get another JavaRDD, and then use takeOrdered to get the
result.
It is quite often (but not always) that the spark hangs while the executing
some of 120th-150th tasks.

In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for
its completion).
When the spark job hangs, I can't kill the job from web UI.

In 1.1.0, the job hangs for couple mins (3.x mins actually),
and then web UI of spark master shows that the job is finished with state
FAILED.
In addition, the job stage web UI still hangs, and execution duration time is
still accumulating.

For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere.

The current workaround is to use coalesce to reduce the number of partitions to
be processed.
I never get a job hanged if the number of partitions to be processed is no
greater than 100.

was:
In my application, I read more than 100 sequence files to a JavaPairRDD,
perform flatmap to get another JavaRDD, and then use takeOrdered to get the
result.
It is quite often (but not always) that the spark hangs while the executing
some of 110th-130th tasks.
The job can hang for several hours, maybe forever (I can't wait for its
completion).
When the spark job hangs, I can't find any error message in anywhere, and I
can't kill the job from web UI.

The current workaround is to use coalesce to reduce the number of partitions to
be processed.
I never get a job hanged if the number of partitions to be processed is no
greater than 80.

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3687) Spark hang while processing more than 100 sequence files


[ 
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147504#comment-14147504
 ] 

Ziv Huang edited comment on SPARK-3687 at 9/25/14 3:09 PM:
---

The following is the jstack dump of one executor when it hangs (the spark 
version is 1.1.0):

File appending thread for 
/opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stderr daemon 
prio=10 tid=0x7ffe0c002800 nid=0x18a3 runnable [0x7ffebc402000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0xfaeee1d0 (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

File appending thread for 
/opt/spark-1.1.0-bin-hadoop2.4/work/app-20140925150845-0007/2/stdout daemon 
prio=10 tid=0x7ffe0c004000 nid=0x18a2 runnable [0x7ffebc503000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:272)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0xfaeec108 (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)

process reaper daemon prio=10 tid=0x7ffe0c001000 nid=0x1868 runnable 
[0x7ffecc0c7000]
   java.lang.Thread.State: RUNNABLE
at java.lang.UNIXProcess.waitForProcessExit(Native Method)
at java.lang.UNIXProcess.access$500(UNIXProcess.java:54)
at java.lang.UNIXProcess$4.run(UNIXProcess.java:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ExecutorRunner for app-20140925150845-0007/2 daemon prio=10 
tid=0x7ffe7011b800 nid=0x1866 in Object.wait() [0x7ffebc705000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0xfaee9df8 (a java.lang.UNIXProcess)
at java.lang.Object.wait(Object.java:503)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:263)
- locked 0xfaee9df8 (a java.lang.UNIXProcess)
at 
org.apache.spark.deploy.worker.ExecutorRunner.fetchAndRunExecutor(ExecutorRunner.scala:164)
at 
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:63)

Attach Listener daemon prio=10 tid=0x7ffe84001000 nid=0x170f waiting on 
condition [0x]
   java.lang.Thread.State: RUNNABLE

sparkWorker-akka.actor.default-dispatcher-16 daemon prio=10 
tid=0x7ffe68214800 nid=0x13a3 waiting on condition [0x7ffebc806000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0xfd614a78 (a 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

sparkWorker-akka.actor.default-dispatcher-15 daemon prio=10 
tid=0x7ffe7011e000 nid=0x13a2 waiting on condition [0x7ffebc604000]

[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception


[ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147864#comment-14147864
 ] 

Egor Pakhomov commented on SPARK-3690:
--

https://github.com/apache/spark/pull/2537

 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=ShaffleMapTask|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception


[ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147867#comment-14147867
 ] 

Apache Spark commented on SPARK-3690:
-

User 'epahomov' has created a pull request for this issue:
https://github.com/apache/spark/pull/2537

 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=ShaffleMapTask|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2516) Bootstrapping

2014-09-25 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147875#comment-14147875
 ] 

Yu Ishikawa commented on SPARK-2516:


Hi [~mengxr],
I would like to work this issue, if possible.
If you have any design of bootstrapping, could you share it with me?

 Bootstrapping
 -

 Key: SPARK-2516
 URL: https://issues.apache.org/jira/browse/SPARK-2516
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng

 Support re-sampling and bootstrap estimators in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-09-25 Thread Arun Ahuja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147918#comment-14147918
 ] 

Arun Ahuja commented on SPARK-3633:
---

Which timeout values were increased to work around this?  We have been seeing 
many more errors with FetchFailed(BlockManagerId(21,

And I also see a 

java.io.IOException: Failed to list files for dir: 
/data/09/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1403901413406_1926/spark-local-20140925115858-c4a7
at org.apache.spark.util.Utils$.listFilesSafely(Utils.scala:673)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:685)
exception with that failure

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-09-25 Thread Zhan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14147939#comment-14147939
 ] 

Zhan Zhang commented on SPARK-3633:
---

Increasing timeout does not help my case either. I still keep getting fetch 
error.

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-09-25 Thread Arun Ahuja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148004#comment-14148004
 ] 

Arun Ahuja commented on SPARK-3633:
---

Also, which timeout setting was useful: spark.akka.timeout or 
spark.core.connection.ack.wait.timeout.  Using GC logging I see this both when 
there are many Full GC and even on smaller datasets when there are not.  It 
much more frequent on the former.

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-25 Thread Mayank Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148010#comment-14148010
 ] 

Mayank Bansal commented on SPARK-3561:
--

HI Guys,

we at ebay are having some issues in cluster utilization while running 
spark-on-yarn with batch workloads of Hadoop. I think this would be nice to try 
out and see if we overcome this issue. We would be intrested in trying this out.

Thanks,
Mayank

 Native Hadoop/YARN integration for batch/ETL workloads
 --

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal:
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark.
 The trait will define 4 only operations:
 * hadoopFile
 * newAPIHadoopFile
 * broadcast
 * runJob
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext.
 Please see the attached design doc for more details.
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3633) Fetches failure observed after SPARK-2711

2014-09-25 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148026#comment-14148026
 ] 

Nishkam Ravi commented on SPARK-3633:
-

Increasing the value of spark.core.connection.ack.wait.timeout (600) worked in 
my case 

 Fetches failure observed after SPARK-2711
 -

 Key: SPARK-3633
 URL: https://issues.apache.org/jira/browse/SPARK-3633
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.1.0
Reporter: Nishkam Ravi

 Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
 Recently upgraded to Spark 1.1. The workload fails with the following error 
 message(s):
 {code}
 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
 c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
 c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
 {code}
 In order to identify the problem, I carried out change set analysis. As I go 
 back in time, the error message changes to:
 {code}
 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
 c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
 /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
  (Too many open files)
 java.io.FileOutputStream.open(Native Method)
 java.io.FileOutputStream.init(FileOutputStream.java:221)
 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
 org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148052#comment-14148052
 ] 

Aaron Staple commented on SPARK-546:


Hi, I think there are two features requested in this ticket:

1) full outer join
2) an RDD function to join 2 rdds in a single shuffle (e.g. multiJoin function)

I’ve implemented #1 in my recent PR, but not #2. I’m happy to implement #2 as 
well though.

Would it make sense to reopen this ticket? File a new ticket?

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin
Assignee: Aaron Staple
 Fix For: 1.2.0


 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3691) Provide a mini cluster for testing system built on Spark

Xuefu Zhang created SPARK-3691:
--

 Summary: Provide a mini cluster for testing system built on Spark
 Key: SPARK-3691
 URL: https://issues.apache.org/jira/browse/SPARK-3691
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster that 
can be used to test the external systems that rely on those frameworks, such as 
Pig and Hive. While Spark's local mode can be used to do such testing and is 
friendly for debugging, it's too far from a real Spark cluster and a lot of 
problems cannot be discovered. Thus, an equivalent of Hadoop MR mini cluster in 
Spark would be very helpful in testing system such as Hive/Pig on Spark.

Spark's local-cluster is considered for this purpose but it doesn't fit well 
because it requires a Spark installation on the box where the tests run. Also, 
local-cluster isn't exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2932) Move MasterFailureTest out of main source directory

2014-09-25 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2932:
-
Fix Version/s: 1.2.0

 Move MasterFailureTest out of main source directory
 -

 Key: SPARK-2932
 URL: https://issues.apache.org/jira/browse/SPARK-2932
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Marcelo Vanzin
Priority: Trivial
 Fix For: 1.2.0


 Currently, MasterFailureTest.scala lives in streaming/src/main, which means 
 it ends up in the published streaming jar.
 It's only used by other test code, and although it also provides a main() 
 entry point, that's also only usable for testing, so the code should probably 
 be moved to the test directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2932) Move MasterFailureTest out of main source directory

2014-09-25 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-2932.
--
Resolution: Fixed

 Move MasterFailureTest out of main source directory
 -

 Key: SPARK-2932
 URL: https://issues.apache.org/jira/browse/SPARK-2932
 Project: Spark
  Issue Type: Task
  Components: Streaming
Reporter: Marcelo Vanzin
Priority: Trivial
 Fix For: 1.2.0


 Currently, MasterFailureTest.scala lives in streaming/src/main, which means 
 it ends up in the published streaming jar.
 It's only used by other test code, and although it also provides a main() 
 entry point, that's also only usable for testing, so the code should probably 
 be moved to the test directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3691) Provide a mini cluster for testing system built on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-3691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148071#comment-14148071
 ] 

Xuefu Zhang commented on SPARK-3691:


cc [~sandyr]

 Provide a mini cluster for testing system built on Spark
 

 Key: SPARK-3691
 URL: https://issues.apache.org/jira/browse/SPARK-3691
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xuefu Zhang

 Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster 
 that can be used to test the external systems that rely on those frameworks, 
 such as Pig and Hive. While Spark's local mode can be used to do such testing 
 and is friendly for debugging, it's too far from a real Spark cluster and a 
 lot of problems cannot be discovered. Thus, an equivalent of Hadoop MR mini 
 cluster in Spark would be very helpful in testing system such as Hive/Pig on 
 Spark.
 Spark's local-cluster is considered for this purpose but it doesn't fit well 
 because it requires a Spark installation on the box where the tests run. 
 Also, local-cluster isn't exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3691) Provide a mini cluster for testing system built on Spark


[ 
https://issues.apache.org/jira/browse/SPARK-3691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148071#comment-14148071
 ] 

Xuefu Zhang edited comment on SPARK-3691 at 9/25/14 6:21 PM:
-

cc [~sandyr], [~rxin]


was (Author: xuefuz):
cc [~sandyr]

 Provide a mini cluster for testing system built on Spark
 

 Key: SPARK-3691
 URL: https://issues.apache.org/jira/browse/SPARK-3691
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xuefu Zhang

 Most Hadoop components such MR, DFS, Tez, and Yarn provide a mini cluster 
 that can be used to test the external systems that rely on those frameworks, 
 such as Pig and Hive. While Spark's local mode can be used to do such testing 
 and is friendly for debugging, it's too far from a real Spark cluster and a 
 lot of problems cannot be discovered. Thus, an equivalent of Hadoop MR mini 
 cluster in Spark would be very helpful in testing system such as Hive/Pig on 
 Spark.
 Spark's local-cluster is considered for this purpose but it doesn't fit well 
 because it requires a Spark installation on the box where the tests run. 
 Also, local-cluster isn't exposed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large


[ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148208#comment-14148208
 ] 

Josh Rosen edited comment on SPARK-1823 at 9/25/14 7:42 PM:


SPARK-3074 is a related issue for large group-bys PySpark.


was (Author: joshrosen):
SPARK-3074 is a related issue for PySpark.

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Or

 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large


[ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148208#comment-14148208
 ] 

Josh Rosen commented on SPARK-1823:
---

SPARK-3074 is a related issue for PySpark.

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Or

 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2546) Configuration object thread safety issue


[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148244#comment-14148244
 ] 

Andrew Ash commented on SPARK-2546:
---

Another proposed fix: extend JobConf as a shim and replace the Hadoop one with 
one that's threadsafe

 Configuration object thread safety issue
 

 Key: SPARK-2546
 URL: https://issues.apache.org/jira/browse/SPARK-2546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
Reporter: Andrew Ash
Assignee: Josh Rosen
Priority: Critical

 // observed in 0.9.1 but expected to exist in 1.0.1 as well
 This ticket is copy-pasted from a thread on the dev@ list:
 {quote}
 We discovered a very interesting bug in Spark at work last week in Spark 
 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
 thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
 Let me explain:
 Observations
  - Was running a relatively simple job (read from Avro files, do a map, do 
 another map, write back to Avro files)
  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
  - The 412 successful tasks completed in median time 3.4s
  - The last hung task didn't finish even in 20 hours
  - The executor with the hung task was responsible for 100% of one core of 
 CPU usage
  - Jstack of the executor attached (relevant thread pasted below)
 Diagnosis
 After doing some code spelunking, we determined the issue was concurrent use 
 of a Configuration object for each task on an executor.  In Hadoop each task 
 runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
 the single-threaded access assumptions of the Configuration object no longer 
 hold in Spark.
 The specific issue is that the AvroRecordReader actually _modifies_ the 
 JobConf it's given when it's instantiated!  It adds a key for the RPC 
 protocol engine in the process of connecting to the Hadoop FileSystem.  When 
 many tasks start at the same time (like at the start of a job), many tasks 
 are adding this configuration item to the one Configuration object at once.  
 Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
 The below post is an excellent explanation of what happens in the situation 
 where multiple threads insert into a HashMap at the same time.
 http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
 The gist is that you have a thread following a cycle of linked list nodes 
 indefinitely.  This exactly matches our observations of the 100% CPU core and 
 also the final location in the stack trace.
 So it seems the way Spark shares a Configuration object between task threads 
 in an executor is incorrect.  We need some way to prevent concurrent access 
 to a single Configuration object.
 Proposed fix
 We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
 its own JobConf object (and thus Configuration object).  The optimization of 
 broadcasting the Configuration object across the cluster can remain, but on 
 the other side I think it needs to be cloned for each task to allow for 
 concurrent access.  I'm not sure the performance implications, but the 
 comments suggest that the Configuration object is ~10KB so I would expect a 
 clone on the object to be relatively speedy.
 Has this been observed before?  Does my suggested fix make sense?  I'd be 
 happy to file a Jira ticket and continue discussion there for the right way 
 to fix.
 Thanks!
 Andrew
 P.S.  For others seeing this issue, our temporary workaround is to enable 
 spark.speculation, which retries failed (or hung) tasks on other machines.
 {noformat}
 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
 at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:436)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:403)
 at

[jira] [Commented] (SPARK-1241) Support sliding in RDD

2014-09-25 Thread Frens Jan Rumph (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148281#comment-14148281
 ] 

Frens Jan Rumph commented on SPARK-1241:


Hi,

I'm investigating use of Spark for matching patterns in symbolically 
represented time series where the sliding functionality as available from scala 
iterators would make life a lot easier.

From the ticket I'd say this functionality is implemented (status resolved, 
fix version 1.0.0, ...), but I can't find it in the docs and the PR indicates 
that this functionality hasn't made it into Spark (just yet ...). Is this 
functionality available?

Cheers,
Frens

Background: I want to compare strings of segments to other (larger) strings of 
segments. As segment strings may be split up over partitions, the more straight 
forward aproaches I could come up with don't work.

 Support sliding in RDD
 --

 Key: SPARK-1241
 URL: https://issues.apache.org/jira/browse/SPARK-1241
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 Sliding is useful for operations like creating n-grams, calculating total 
 variation, numerical integration, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3692) RBF Kernel implementation to SVM

2014-09-25 Thread Ekrem Aksoy (JIRA)

Ekrem Aksoy created SPARK-3692:
--

 Summary: RBF Kernel implementation to SVM
 Key: SPARK-3692
 URL: https://issues.apache.org/jira/browse/SPARK-3692
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ekrem Aksoy
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3692) RBF Kernel implementation to SVM

2014-09-25 Thread Ekrem Aksoy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekrem Aksoy updated SPARK-3692:
---
Description: Radial Basis Function is another type of kernel that can be 
used instead of linear kernel in SVM.

 RBF Kernel implementation to SVM
 

 Key: SPARK-3692
 URL: https://issues.apache.org/jira/browse/SPARK-3692
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ekrem Aksoy
Priority: Minor

 Radial Basis Function is another type of kernel that can be used instead of 
 linear kernel in SVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3693) Cached Hadoop RDD always return rows with the same value

Xuefu Zhang created SPARK-3693:
--

 Summary: Cached Hadoop RDD always return rows with the same value
 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang


While trying RDD caching, it's found that caching a Hadoop RDD causes data 
correctness issues. The following code snippet demonstrates the usage:

{code}
public final class Test {
private static final Pattern SPACE = Pattern.compile( );

public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName(Test);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
... 
JavaPairRDDBytesWritable, BytesWritable input = 
ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
WritableComparable.class, Writable.class);
input = input.cache();
input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() {
@Override
public void call(Tuple2BytesWritable, BytesWritable row) throws 
Exception {
if (row._1() != null) {
System.out.println(Key:  + row._1());
}
if (row._2() != null) {
System.out.println(Value:  + row._2());
}
}
});
ctx.stop();
}
}
{code}
In this case, row._2() always gives the same value. If we disable caching by 
removing input.cache(), the program gives the expected rows.

Further analysis shows that MemoryStore (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
 is storing the references to (key, value) pairs returned by 
HadoopRDD.getNext() (See 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
 but this method always returns the same (key, value) object references, except 
each getNext() call updates values inside these objects. When there are no more 
records (key, value) objects are filled with empty strings (no values) in 
CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same 
key, value object pairs, all values become NULL.

Probably MemoryStore should instead store a copy of key, value pair rather 
than keeping a reference to it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3693) Cached Hadoop RDD always return rows with the same value


 [ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated SPARK-3693:
---
Description: 
While trying RDD caching, it's found that caching a Hadoop RDD causes data 
correctness issues. The following code snippet demonstrates the usage:

{code}
public final class Test {
public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName(Test);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
... 
JavaPairRDDBytesWritable, BytesWritable input = 
ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
WritableComparable.class, Writable.class);
input = input.cache();
input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() {
@Override
public void call(Tuple2BytesWritable, BytesWritable row) throws 
Exception {
if (row._1() != null) {
System.out.println(Key:  + row._1());
}
if (row._2() != null) {
System.out.println(Value:  + row._2());
}
}
});
ctx.stop();
}
}
{code}
In this case, row._2() always gives the same value. If we disable caching by 
removing input.cache(), the program gives the expected rows.

Further analysis shows that MemoryStore (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
 is storing the references to (key, value) pairs returned by 
HadoopRDD.getNext() (See 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
 but this method always returns the same (key, value) object references, except 
each getNext() call updates values inside these objects. When there are no more 
records (key, value) objects are filled with empty strings (no values) in 
CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same 
key, value object pairs, all values become NULL.

Probably MemoryStore should instead store a copy of key, value pair rather 
than keeping a reference to it.


  was:
While trying RDD caching, it's found that caching a Hadoop RDD causes data 
correctness issues. The following code snippet demonstrates the usage:

{code}
public final class Test {
private static final Pattern SPACE = Pattern.compile( );

public static void main(String[] args) throws Exception {
SparkConf sparkConf = new SparkConf().setAppName(Test);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
... 
JavaPairRDDBytesWritable, BytesWritable input = 
ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
WritableComparable.class, Writable.class);
input = input.cache();
input.foreach(new VoidFunctionTuple2BytesWritable, BytesWritable() {
@Override
public void call(Tuple2BytesWritable, BytesWritable row) throws 
Exception {
if (row._1() != null) {
System.out.println(Key:  + row._1());
}
if (row._2() != null) {
System.out.println(Value:  + row._2());
}
}
});
ctx.stop();
}
}
{code}
In this case, row._2() always gives the same value. If we disable caching by 
removing input.cache(), the program gives the expected rows.

Further analysis shows that MemoryStore (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
 is storing the references to (key, value) pairs returned by 
HadoopRDD.getNext() (See 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
 but this method always returns the same (key, value) object references, except 
each getNext() call updates values inside these objects. When there are no more 
records (key, value) objects are filled with empty strings (no values) in 
CombineFileRecordReader. As all pairs in MemoryStore.vector refer to the same 
key, value object pairs, all values become NULL.

Probably MemoryStore should instead store a copy of key, value pair rather 
than keeping a reference to it.



 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf

[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value


[ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148295#comment-14148295
 ] 

Xuefu Zhang commented on SPARK-3693:


cc [~rxin], [~sandyr]

 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf sparkConf = new SparkConf().setAppName(Test);
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 ... 
 JavaPairRDDBytesWritable, BytesWritable input = 
 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
 WritableComparable.class, Writable.class);
 input = input.cache();
 input.foreach(new VoidFunctionTuple2BytesWritable, 
 BytesWritable() {
 @Override
 public void call(Tuple2BytesWritable, BytesWritable row) throws 
 Exception {
 if (row._1() != null) {
 System.out.println(Key:  + row._1());
 }
 if (row._2() != null) {
 System.out.println(Value:  + row._2());
 }
 }
 });
 ctx.stop();
 }
 }
 {code}
 In this case, row._2() always gives the same value. If we disable caching by 
 removing input.cache(), the program gives the expected rows.
 Further analysis shows that MemoryStore (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
  is storing the references to (key, value) pairs returned by 
 HadoopRDD.getNext() (See 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
  but this method always returns the same (key, value) object references, 
 except each getNext() call updates values inside these objects. When there 
 are no more records (key, value) objects are filled with empty strings (no 
 values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
 to the same key, value object pairs, all values become NULL.
 Probably MemoryStore should instead store a copy of key, value pair rather 
 than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value


[ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148303#comment-14148303
 ] 

Sandy Ryza commented on SPARK-3693:
---

Spark's documentation actually makes a note of this.

{code}
   * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable 
object for each
   * record, directly caching the returned RDD will create many references to 
the same object.
   * If you plan to directly cache Hadoop writable objects, you should first 
copy them using
   * a `map` function.
{code}

Having MemoryStore make a copy would degrade performance in other situations 
where the objects aren't used. While it's not pretty, I don't see a better 
approach than the current behavior where the user needs to explicitly makes a 
copy.  Though we could possibly provide some utility to help if it requires a 
lot of user boilerplate?

 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf sparkConf = new SparkConf().setAppName(Test);
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 ... 
 JavaPairRDDBytesWritable, BytesWritable input = 
 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
 WritableComparable.class, Writable.class);
 input = input.cache();
 input.foreach(new VoidFunctionTuple2BytesWritable, 
 BytesWritable() {
 @Override
 public void call(Tuple2BytesWritable, BytesWritable row) throws 
 Exception {
 if (row._1() != null) {
 System.out.println(Key:  + row._1());
 }
 if (row._2() != null) {
 System.out.println(Value:  + row._2());
 }
 }
 });
 ctx.stop();
 }
 }
 {code}
 In this case, row._2() always gives the same value. If we disable caching by 
 removing input.cache(), the program gives the expected rows.
 Further analysis shows that MemoryStore (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
  is storing the references to (key, value) pairs returned by 
 HadoopRDD.getNext() (See 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
  but this method always returns the same (key, value) object references, 
 except each getNext() call updates values inside these objects. When there 
 are no more records (key, value) objects are filled with empty strings (no 
 values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
 to the same key, value object pairs, all values become NULL.
 Probably MemoryStore should instead store a copy of key, value pair rather 
 than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value

2014-09-25 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148306#comment-14148306
 ] 

Reynold Xin commented on SPARK-3693:


Just responded to you offline as well. This is a known issue.

The problem is that not all record types implement copy. As a matter of fact, 
even some common ones like Avro don't deal with that as well. For now the 
workaround is really to call copy by the user since the user knows what record 
type they are using. 

 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf sparkConf = new SparkConf().setAppName(Test);
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 ... 
 JavaPairRDDBytesWritable, BytesWritable input = 
 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
 WritableComparable.class, Writable.class);
 input = input.cache();
 input.foreach(new VoidFunctionTuple2BytesWritable, 
 BytesWritable() {
 @Override
 public void call(Tuple2BytesWritable, BytesWritable row) throws 
 Exception {
 if (row._1() != null) {
 System.out.println(Key:  + row._1());
 }
 if (row._2() != null) {
 System.out.println(Value:  + row._2());
 }
 }
 });
 ctx.stop();
 }
 }
 {code}
 In this case, row._2() always gives the same value. If we disable caching by 
 removing input.cache(), the program gives the expected rows.
 Further analysis shows that MemoryStore (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
  is storing the references to (key, value) pairs returned by 
 HadoopRDD.getNext() (See 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
  but this method always returns the same (key, value) object references, 
 except each getNext() call updates values inside these objects. When there 
 are no more records (key, value) objects are filled with empty strings (no 
 values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
 to the same key, value object pairs, all values become NULL.
 Probably MemoryStore should instead store a copy of key, value pair rather 
 than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3693) Cached Hadoop RDD always return rows with the same value

2014-09-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3693.

Resolution: Duplicate

 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf sparkConf = new SparkConf().setAppName(Test);
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 ... 
 JavaPairRDDBytesWritable, BytesWritable input = 
 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
 WritableComparable.class, Writable.class);
 input = input.cache();
 input.foreach(new VoidFunctionTuple2BytesWritable, 
 BytesWritable() {
 @Override
 public void call(Tuple2BytesWritable, BytesWritable row) throws 
 Exception {
 if (row._1() != null) {
 System.out.println(Key:  + row._1());
 }
 if (row._2() != null) {
 System.out.println(Value:  + row._2());
 }
 }
 });
 ctx.stop();
 }
 }
 {code}
 In this case, row._2() always gives the same value. If we disable caching by 
 removing input.cache(), the program gives the expected rows.
 Further analysis shows that MemoryStore (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
  is storing the references to (key, value) pairs returned by 
 HadoopRDD.getNext() (See 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
  but this method always returns the same (key, value) object references, 
 except each getNext() call updates values inside these objects. When there 
 are no more records (key, value) objects are filled with empty strings (no 
 values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
 to the same key, value object pairs, all values become NULL.
 Probably MemoryStore should instead store a copy of key, value pair rather 
 than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3693) Cached Hadoop RDD always return rows with the same value


[ 
https://issues.apache.org/jira/browse/SPARK-3693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148318#comment-14148318
 ] 

Xuefu Zhang commented on SPARK-3693:


Thanks, guys. We are fine with the workaround.

 Cached Hadoop RDD always return rows with the same value
 

 Key: SPARK-3693
 URL: https://issues.apache.org/jira/browse/SPARK-3693
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xuefu Zhang

 While trying RDD caching, it's found that caching a Hadoop RDD causes data 
 correctness issues. The following code snippet demonstrates the usage:
 {code}
 public final class Test {
 public static void main(String[] args) throws Exception {
 SparkConf sparkConf = new SparkConf().setAppName(Test);
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 ... 
 JavaPairRDDBytesWritable, BytesWritable input = 
 ctx.hadoopRDD(jobConf, CombineHiveInputClass.class, 
 WritableComparable.class, Writable.class);
 input = input.cache();
 input.foreach(new VoidFunctionTuple2BytesWritable, 
 BytesWritable() {
 @Override
 public void call(Tuple2BytesWritable, BytesWritable row) throws 
 Exception {
 if (row._1() != null) {
 System.out.println(Key:  + row._1());
 }
 if (row._2() != null) {
 System.out.println(Value:  + row._2());
 }
 }
 });
 ctx.stop();
 }
 }
 {code}
 In this case, row._2() always gives the same value. If we disable caching by 
 removing input.cache(), the program gives the expected rows.
 Further analysis shows that MemoryStore (see 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L236)
  is storing the references to (key, value) pairs returned by 
 HadoopRDD.getNext() (See 
 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L220),
  but this method always returns the same (key, value) object references, 
 except each getNext() call updates values inside these objects. When there 
 are no more records (key, value) objects are filled with empty strings (no 
 values) in CombineFileRecordReader. As all pairs in MemoryStore.vector refer 
 to the same key, value object pairs, all values become NULL.
 Probably MemoryStore should instead store a copy of key, value pair rather 
 than keeping a reference to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners


 [ 
https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple resolved SPARK-3550.
-
Resolution: Fixed

 Disable automatic rdd caching in python api for relevant learners
 -

 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 The python mllib api automatically caches training rdds. However, the 
 NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
 prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
 its input RDD once, while ALS and DecisionTree internally persist 
 transformations of their input RDDs. For these learners, we should disable 
 the automatic caching in the python mllib api.
 See discussion here:
 https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners


[ 
https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148347#comment-14148347
 ] 

Aaron Staple commented on SPARK-3550:
-

This has been addressed in another commit: 
https://github.com/apache/spark/commit/fce5e251d636c788cda91345867e0294280c074d

See comment here:
https://github.com/apache/spark/pull/2412#issuecomment-56865408

 Disable automatic rdd caching in python api for relevant learners
 -

 Key: SPARK-3550
 URL: https://issues.apache.org/jira/browse/SPARK-3550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 The python mllib api automatically caches training rdds. However, the 
 NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
 prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
 its input RDD once, while ALS and DecisionTree internally persist 
 transformations of their input RDDs. For these learners, we should disable 
 the automatic caching in the python mllib api.
 See discussion here:
 https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3488) cache deserialized python RDDs before iterative learning


 [ 
https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Staple resolved SPARK-3488.
-
Resolution: Won't Fix

 cache deserialized python RDDs before iterative learning
 

 Key: SPARK-3488
 URL: https://issues.apache.org/jira/browse/SPARK-3488
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Aaron Staple

 When running an iterative learning algorithm, it makes sense that the input 
 RDD be cached for improved performance. When learning is applied to a python 
 RDD, currently the python RDD is always cached, then in scala that cached RDD 
 is mapped to an uncached deserialized RDD, and the uncached RDD is passed to 
 the learning algorithm. Instead the deserialized RDD should be cached.
 This was originally discussed here:
 https://github.com/apache/spark/pull/2347#issuecomment-55181535



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3690) Closing shuffle writers we swallow more important exception


[ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148365#comment-14148365
 ] 

Josh Rosen commented on SPARK-3690:
---

For additional context, here's the mailing list thread: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-FileNotFoundException-in-usercache-td15135.html

 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=ShaffleMapTask|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3690) Closing shuffle writers we swallow more important exception


 [ 
https://issues.apache.org/jira/browse/SPARK-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3690.
---
Resolution: Fixed

Issue resolved by pull request 2537
[https://github.com/apache/spark/pull/2537]

 Closing shuffle writers we swallow more important exception
 ---

 Key: SPARK-3690
 URL: https://issues.apache.org/jira/browse/SPARK-3690
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


 ShaffleMapTask: line 75
 {code:title=ShaffleMapTask|borderStyle=solid}
  case e: Exception =
 if (writer != null) {
   writer.stop(success = false)
 }
 throw e
 {code}
 Exception in writer.stop() swallows the important one. Couldn't find the 
 reason for problems for days. Look up in internet 
 java.io.FileNotFoundException: 
 /local/hd2/yarn/local/usercache/epahomov/appcache/application_1411219858924_12991/spark-local-20140924225309-03f5/21/shuffle_4_12_147
  (No such file or directory) - there are plenty poor guys like me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3661) spark.driver.memory is ignored in cluster mode

2014-09-25 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3661:
-
Priority: Critical  (was: Major)

 spark.driver.memory is ignored in cluster mode
 --

 Key: SPARK-3661
 URL: https://issues.apache.org/jira/browse/SPARK-3661
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 This is related to https://issues.apache.org/jira/browse/SPARK-3653, but for 
 the config. Note that `spark.executor.memory` is fine because we pass the 
 Spark system properties to the driver after it has started.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3682) Add helpful warnings to the UI

[
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza updated SPARK-3682:
--
Description:
Spark has a zillion configuration options and a zillion different things that
can go wrong with a job. Improvements like incremental and better metrics and
the proposed spark replay debugger provide more insight into what's going on
under the covers. However, it's difficult for non-advanced users to synthesize
this information and understand where to direct their attention. It would be
helpful to have some sort of central location on the UI users could go to that
would provide indications about why an app/job is failing or performing poorly.

Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young
generation.
* Warn that tasks in a particular stage are very short, and that the number of
partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the number
of partitions should probably be increased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a
lot of time is being spent recomputing it.

To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the
general health of executors.
* Warnings at the job level that indicate why a job might be performing slowly.

was:
Spark has a zillion configuration options and a zillion different things that
can go wrong with a job. Improvements like incremental and better metrics and
the proposed spark replay debugger provide more insight into what's going on
under the covers. However, it's difficult for non-advanced users to synthesize
this information and understand where to direct their attention. It would be
helpful to have some sort of central location on the UI users could go to that
would provide indications about why an app/job is failing or performing poorly.

Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young
generation.
* Warn that tasks in a particular stage are very short, and that the number of
partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the number
of partitions should probably be decreased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a
lot of time is being spent recomputing it.

Add helpful warnings to the UI
--

Key: SPARK-3682
URL: https://issues.apache.org/jira/browse/SPARK-3682
Project: Spark
Issue Type: New Feature
Components: Web UI
Affects Versions: 1.1.0
Reporter: Sandy Ryza

Spark has a zillion configuration options and a zillion different things that
can go wrong with a job. Improvements like incremental and better metrics
and the proposed spark replay debugger provide more insight into what's going
on under the covers. However, it's difficult for non-advanced users to
synthesize this information and understand where to direct their attention.
It would be helpful to have some sort of central location on the UI users
could go to that would provide indications about why an app/job is failing or
performing poorly.
Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young
generation.
* Warn that tasks in a particular stage are very short, and that the number
of partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the
number of partitions should probably be increased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a
lot of time is being spent recomputing it.
To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the
general health of executors.
* Warnings at the job level that indicate why a job might be performing
slowly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3682) Add helpful warnings to the UI

[
https://issues.apache.org/jira/browse/SPARK-3682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148379#comment-14148379
]

Sandy Ryza commented on SPARK-3682:
---

Oops, that should have read increased.

When a task fetches more shuffle data from the previous stage than it can fit
in memory, it needs to spill the extra data to disk. Increasing the number of
partitions makes it so that each task will be responsible for dealing with less
data and will need to spill less.

Add helpful warnings to the UI
--

Key: SPARK-3682
URL: https://issues.apache.org/jira/browse/SPARK-3682
Project: Spark
Issue Type: New Feature
Components: Web UI
Affects Versions: 1.1.0
Reporter: Sandy Ryza

Spark has a zillion configuration options and a zillion different things that
can go wrong with a job. Improvements like incremental and better metrics
and the proposed spark replay debugger provide more insight into what's going
on under the covers. However, it's difficult for non-advanced users to
synthesize this information and understand where to direct their attention.
It would be helpful to have some sort of central location on the UI users
could go to that would provide indications about why an app/job is failing or
performing poorly.
Some helpful messages that we could provide:
* Warn that the tasks in a particular stage are spending a long time in GC.
* Warn that spark.shuffle.memoryFraction does not fit inside the young
generation.
* Warn that tasks in a particular stage are very short, and that the number
of partitions should probably be decreased.
* Warn that tasks in a particular stage are spilling a lot, and that the
number of partitions should probably be increased.
* Warn that a cached RDD that gets a lot of use does not fit in memory, and a
lot of time is being spent recomputing it.
To start, probably two kinds of warnings would be most helpful.
* Warnings at the app level that report on misconfigurations, issues with the
general health of executors.
* Warnings at the job level that indicate why a job might be performing
slowly.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming

2014-09-25 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148385#comment-14148385
 ] 

Davies Liu commented on SPARK-2377:
---

[~giwa] I also start to work on this (based on your branch), will send out an 
WIP PR recently. 

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag

Patrick Wendell created SPARK-3694:
--

 Summary: Allow printing object graph of tasks/RDD's with a debug 
flag
 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Patrick Wendell


This would be useful for debugging extra references inside of RDD's

Here is an example for inspiration:
http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3694) Allow printing object graph of tasks/RDD's with a debug flag


 [ 
https://issues.apache.org/jira/browse/SPARK-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3694:
---
Description: 
This would be useful for debugging extra references inside of RDD's

Here is an example for inspiration:
http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html

We'd want to print this trace for both the RDD serialization inside of the 
DAGScheduler and the task serialization in the TaskSetManager.

  was:
This would be useful for debugging extra references inside of RDD's

Here is an example for inspiration:
http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html


 Allow printing object graph of tasks/RDD's with a debug flag
 

 Key: SPARK-3694
 URL: https://issues.apache.org/jira/browse/SPARK-3694
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 This would be useful for debugging extra references inside of RDD's
 Here is an example for inspiration:
 http://ehcache.org/xref/net/sf/ehcache/pool/sizeof/ObjectGraphWalker.html
 We'd want to print this trace for both the RDD serialization inside of the 
 DAGScheduler and the task serialization in the TaskSetManager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148411#comment-14148411
 ] 

Andrew Ash edited comment on SPARK-3032 at 9/25/14 10:35 PM:
-

This bug prevents people from doing further testing of sort-based shuffle on 
the rest of the 1.1.x series.  Is this a good candidate for a backport to 1.1.1 
or a later 1.1 hotfix ?


was (Author: aash):
This bug prevents people from doing testing of sort-based shuffle on the rest 
of the 1.1.x series.  Is this a good candidate for a backport to 1.1 ?

 Potential bug when running sort-based shuffle with sorting using TimSort
 

 Key: SPARK-3032
 URL: https://issues.apache.org/jira/browse/SPARK-3032
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Blocker

 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
 data type for key and value is (String, String), always meet this issue:
 {noformat}
 java.lang.IllegalArgumentException: Comparison method violates its general 
 contract!
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
 at 
 org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
 at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
 at 
 org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
 at 
 org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
 at 
 org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
 at 
 org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {noformat}
 Seems the current partitionKeyComparator which use hashcode of String as key 
 comparator break some sorting contracts. 
 Also I tested using data type Int as key, this is OK to pass the test, since 
 hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
 of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148411#comment-14148411
 ] 

Andrew Ash commented on SPARK-3032:
---

This bug prevents people from doing testing of sort-based shuffle on the rest 
of the 1.1.x series.  Is this a good candidate for a backport to 1.1 ?

 Potential bug when running sort-based shuffle with sorting using TimSort
 

 Key: SPARK-3032
 URL: https://issues.apache.org/jira/browse/SPARK-3032
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Blocker

 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
 data type for key and value is (String, String), always meet this issue:
 {noformat}
 java.lang.IllegalArgumentException: Comparison method violates its general 
 contract!
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
 at 
 org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
 at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
 at 
 org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
 at 
 org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
 at 
 org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
 at 
 org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {noformat}
 Seems the current partitionKeyComparator which use hashcode of String as key 
 comparator break some sorting contracts. 
 Also I tested using data type Int as key, this is OK to pass the test, since 
 hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
 of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data


 [ 
https://issues.apache.org/jira/browse/SPARK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1484.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2347
[https://github.com/apache/spark/pull/2347]

 MLlib should warn if you are using an iterative algorithm on non-cached data
 

 Key: SPARK-1484
 URL: https://issues.apache.org/jira/browse/SPARK-1484
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Matei Zaharia
 Fix For: 1.2.0


 Not sure what the best way to warn is, but even printing to the log is 
 probably fine. We may want to print at the end of the training run as well as 
 the beginning to make it more visible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data


 [ 
https://issues.apache.org/jira/browse/SPARK-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1484:
-
Assignee: Aaron Staple

 MLlib should warn if you are using an iterative algorithm on non-cached data
 

 Key: SPARK-1484
 URL: https://issues.apache.org/jira/browse/SPARK-1484
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Matei Zaharia
Assignee: Aaron Staple
 Fix For: 1.2.0


 Not sure what the best way to warn is, but even printing to the log is 
 probably fine. We may want to print at the end of the training run as well as 
 the beginning to make it more visible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3584) sbin/slaves doesn't work when we use password authentication for SSH


 [ 
https://issues.apache.org/jira/browse/SPARK-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3584.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: Kousuke Saruta
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

Resolved by:
https://github.com/apache/spark/pull/2444

 sbin/slaves doesn't work when we use password authentication for SSH
 

 Key: SPARK-3584
 URL: https://issues.apache.org/jira/browse/SPARK-3584
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.2.0


 In sbin/slaves, ssh command run in the background but if we use password 
 authentication, background ssh command doesn't work so sbin/slaves doesn't 
 work.
 Also I suggest improvement for sbin/slaves.
 In current implementation, slaves file is trucked by Git but it can be edited 
 by user so we prepare slaves.template instead of slaves.
 Default slaves file has one entry, localhost, so we should use localhost as a 
 default host list.
 I modified sbin/slaves to choose localhost as a default host list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2377) Create a Python API for Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148512#comment-14148512
 ] 

Apache Spark commented on SPARK-2377:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2538

 Create a Python API for Spark Streaming
 ---

 Key: SPARK-2377
 URL: https://issues.apache.org/jira/browse/SPARK-2377
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Nicholas Chammas
Assignee: Kenichi Takagiwa

 [Spark 
 Streaming|http://spark.apache.org/docs/latest/streaming-programming-guide.html]
  currently offers APIs in Scala and Java. It would be great feature add to 
 have a Python API as well.
 This is probably a large task that will span many issues if undertaken. This 
 ticket should provide some place to track overall progress towards an initial 
 Python API for Spark Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148529#comment-14148529
 ] 

Xiangrui Meng commented on SPARK-1405:
--

[~Guoqiang Li] and [~pedrorodriguez], since there are already 4~5 
implementations of LDA on Spark and [~dlwh] is also interested in one with 
partial labels, we do need to coordinate to avoid duplicate effort. I think the 
TODOs are:

0. Make progress updates frequently.
1. Test Joey's implementation and Guoqiang's (both on GraphX) on some common 
datasets. We also need to verify the correctness of the output, by comparing 
the result with some single machine solvers.
2. Discuss the public APIs in MLlib. Because GraphX is an alpha component, we 
should not expose GraphX APIs in MLlib. See my previous comments on the input 
and model types.
3. Have a standard implementation of LDA with Gibbs Sampling in MLlib. The 
target is v1.2, which means it should be merged by the end of Nov. Improvements 
can be made in future releases.

Could you share your timeline? Thanks!

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Assignee: Guoqiang Li  (was: Xusen Yin)

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1405:
-
Shepherd: Xiangrui Meng

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
  Labels: features
   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1241) Support sliding in RDD


[ 
https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148549#comment-14148549
 ] 

Xiangrui Meng commented on SPARK-1241:
--

This is implemented MLlib: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala
 . You can check the discussion about where to put it here: 
https://github.com/apache/spark/pull/136 .

 Support sliding in RDD
 --

 Key: SPARK-1241
 URL: https://issues.apache.org/jira/browse/SPARK-1241
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 Sliding is useful for operations like creating n-grams, calculating total 
 variation, numerical integration, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-25 Thread Pedro Rodriguez (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148555#comment-14148555
]

Pedro Rodriguez commented on SPARK-1405:

[~mengxr], definitely a good idea to be coordinated about it. I have been
working with Evan so have been giving status updates and making todos with him.
I will post here on progress updates as well.

I have been working on creating a design doc/reference which you can find here:
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing

It is for a large part, a way for us/me to keep notes while working, but I
would like to take some of it and convert it to documentation. It primarily
contains
1. Relevant links to papers/code/repositories
2. Thorough explanation/documentation of LDA and motivation behind the graph
implementation (Joey's version)
3. Testing steps (which data sets on what)
4. Current todos (perhaps we should post them here primarily and update doc for
consistency).

1. Currently I am working on testing (unit test functions and correctness
testing), refactoring, and extending Joey's implementation. The objective for
this week is to have the mini-test running (a set of ~10 documents which acts
as a sanity check). Goal for early next week is to be running on NIPS. I think
the majority of time to get there will be putting the dataset in a parseable
format (remove equations, stop words...) and insuring that the result looks
correct.

To that end, we plan on running the same datasets through Graphlab for
benchmarking machine/ML performance and a python implementation for ML
performance/correctness.

Once we are there, the plan is to start looking at running on wikipedia.

2. The code I am currently working on lives here:
https://github.com/EntilZha/spark
https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/TopicModeling.scala
which is within GraphX, with the other graph based algorithms.

3. Prior to knowing about Joey's graph implementation, I wrote my own for a
final project. I stopped working on it since the graph implementation should be
more performant. Probably a good point of discussion if there should be a
standard and graph implementation together. When you reference standard
implementation, is there a particular implementation you are referring to that
I can look at?

TLDR timeline:
End of this week: mini-dataset for sanity check + refactoring code + unit
testing
Next week: Format NIPS for input + run NIPS data set on Spark, GraphLab, and
Python LDA. I will be away from Berkeley at a conference, but hope to still get
those done.
From there, we would like to get running on larger datasets for performance
testing.

parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
-

Key: SPARK-1405
URL: https://issues.apache.org/jira/browse/SPARK-1405
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Labels: features
Original Estimate: 336h
Remaining Estimate: 336h

Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
topics from text corpus. Different with current machine learning algorithms
in MLlib, instead of using optimization algorithms such as gradient desent,
LDA uses expectation algorithms such as Gibbs sampling.
In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
and a Gibbs sampling core.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2634) MapOutputTrackerWorker.mapStatuses should be thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2634.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 1541
[https://github.com/apache/spark/pull/1541]

 MapOutputTrackerWorker.mapStatuses should be thread-safe
 

 Key: SPARK-2634
 URL: https://issues.apache.org/jira/browse/SPARK-2634
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Shixiong Zhu
  Labels: easyfix
 Fix For: 1.2.0


 MapOutputTrackerWorker.mapStatuses will be used concurrently, so it should be 
 a thread-safe Map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2546) Configuration object thread safety issue


[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148645#comment-14148645
 ] 

Josh Rosen commented on SPARK-2546:
---

JobConf has a _ton_ of methods and it's not clear whether we can get away with 
synchronizing only some of them.

I'm going to look into using Scala macro annotations 
(http://docs.scala-lang.org/overviews/macros/annotations.html) to create a 
{{@synchronizeAll}} macro for adding synchronization to all methods of a class.

 Configuration object thread safety issue
 

 Key: SPARK-2546
 URL: https://issues.apache.org/jira/browse/SPARK-2546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
Reporter: Andrew Ash
Assignee: Josh Rosen
Priority: Critical

 // observed in 0.9.1 but expected to exist in 1.0.1 as well
 This ticket is copy-pasted from a thread on the dev@ list:
 {quote}
 We discovered a very interesting bug in Spark at work last week in Spark 
 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
 thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
 Let me explain:
 Observations
  - Was running a relatively simple job (read from Avro files, do a map, do 
 another map, write back to Avro files)
  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
  - The 412 successful tasks completed in median time 3.4s
  - The last hung task didn't finish even in 20 hours
  - The executor with the hung task was responsible for 100% of one core of 
 CPU usage
  - Jstack of the executor attached (relevant thread pasted below)
 Diagnosis
 After doing some code spelunking, we determined the issue was concurrent use 
 of a Configuration object for each task on an executor.  In Hadoop each task 
 runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
 the single-threaded access assumptions of the Configuration object no longer 
 hold in Spark.
 The specific issue is that the AvroRecordReader actually _modifies_ the 
 JobConf it's given when it's instantiated!  It adds a key for the RPC 
 protocol engine in the process of connecting to the Hadoop FileSystem.  When 
 many tasks start at the same time (like at the start of a job), many tasks 
 are adding this configuration item to the one Configuration object at once.  
 Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
 The below post is an excellent explanation of what happens in the situation 
 where multiple threads insert into a HashMap at the same time.
 http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
 The gist is that you have a thread following a cycle of linked list nodes 
 indefinitely.  This exactly matches our observations of the 100% CPU core and 
 also the final location in the stack trace.
 So it seems the way Spark shares a Configuration object between task threads 
 in an executor is incorrect.  We need some way to prevent concurrent access 
 to a single Configuration object.
 Proposed fix
 We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
 its own JobConf object (and thus Configuration object).  The optimization of 
 broadcasting the Configuration object across the cluster can remain, but on 
 the other side I think it needs to be cloned for each task to allow for 
 concurrent access.  I'm not sure the performance implications, but the 
 comments suggest that the Configuration object is ~10KB so I would expect a 
 clone on the object to be relatively speedy.
 Has this been observed before?  Does my suggested fix make sense?  I'd be 
 happy to file a Jira ticket and continue discussion there for the right way 
 to fix.
 Thanks!
 Andrew
 P.S.  For others seeing this issue, our temporary workaround is to enable 
 spark.speculation, which retries failed (or hung) tasks on other machines.
 {noformat}
 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
 at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
 at

[jira] [Created] (SPARK-3695) Enable to show host and port in block fetch failure

2014-09-25 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-3695:
--

 Summary: Enable to show host and port in block fetch failure
 Key: SPARK-3695
 URL: https://issues.apache.org/jira/browse/SPARK-3695
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Reporter: Adrian Wang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3695) Enable to show host and port in block fetch failure


[ 
https://issues.apache.org/jira/browse/SPARK-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148656#comment-14148656
 ] 

Apache Spark commented on SPARK-3695:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/2539

 Enable to show host and port in block fetch failure
 ---

 Key: SPARK-3695
 URL: https://issues.apache.org/jira/browse/SPARK-3695
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2546) Configuration object thread safety issue


[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148657#comment-14148657
 ] 

Josh Rosen commented on SPARK-2546:
---

A synchronization wrapper (whether written by hand or generated using macros) 
might introduce an unwanted runtime dependency on the exact compile-time 
version of Hadoop that we used.  For example, say we compile against Hadoop 1.x 
and run on Hadoop 1.y (where y  x) and the runtime version of JobConf contains 
methods that were not present in the version that we wrapped at compile-time.  
What happens in this case?

Before we explore this option, I should probably re-visit SPARK-2585 to see if 
I can understand why the patch seemed to introduce a performance regression, 
since that approach is Hadoop version agnostic.

 Configuration object thread safety issue
 

 Key: SPARK-2546
 URL: https://issues.apache.org/jira/browse/SPARK-2546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1
Reporter: Andrew Ash
Assignee: Josh Rosen
Priority: Critical

 // observed in 0.9.1 but expected to exist in 1.0.1 as well
 This ticket is copy-pasted from a thread on the dev@ list:
 {quote}
 We discovered a very interesting bug in Spark at work last week in Spark 
 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
 thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
 Let me explain:
 Observations
  - Was running a relatively simple job (read from Avro files, do a map, do 
 another map, write back to Avro files)
  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
  - The 412 successful tasks completed in median time 3.4s
  - The last hung task didn't finish even in 20 hours
  - The executor with the hung task was responsible for 100% of one core of 
 CPU usage
  - Jstack of the executor attached (relevant thread pasted below)
 Diagnosis
 After doing some code spelunking, we determined the issue was concurrent use 
 of a Configuration object for each task on an executor.  In Hadoop each task 
 runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
 the single-threaded access assumptions of the Configuration object no longer 
 hold in Spark.
 The specific issue is that the AvroRecordReader actually _modifies_ the 
 JobConf it's given when it's instantiated!  It adds a key for the RPC 
 protocol engine in the process of connecting to the Hadoop FileSystem.  When 
 many tasks start at the same time (like at the start of a job), many tasks 
 are adding this configuration item to the one Configuration object at once.  
 Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
 The below post is an excellent explanation of what happens in the situation 
 where multiple threads insert into a HashMap at the same time.
 http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
 The gist is that you have a thread following a cycle of linked list nodes 
 indefinitely.  This exactly matches our observations of the 100% CPU core and 
 also the final location in the stack trace.
 So it seems the way Spark shares a Configuration object between task threads 
 in an executor is incorrect.  We need some way to prevent concurrent access 
 to a single Configuration object.
 Proposed fix
 We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
 its own JobConf object (and thus Configuration object).  The optimization of 
 broadcasting the Configuration object across the cluster can remain, but on 
 the other side I think it needs to be cloned for each task to allow for 
 concurrent access.  I'm not sure the performance implications, but the 
 comments suggest that the Configuration object is ~10KB so I would expect a 
 clone on the object to be relatively speedy.
 Has this been observed before?  Does my suggested fix make sense?  I'd be 
 happy to file a Jira ticket and continue discussion there for the right way 
 to fix.
 Thanks!
 Andrew
 P.S.  For others seeing this issue, our temporary workaround is to enable 
 spark.speculation, which retries failed (or hung) tasks on other machines.
 {noformat}
 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
 at

[jira] [Commented] (SPARK-2532) Fix issues with consolidated shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148666#comment-14148666
 ] 

Andrew Ash commented on SPARK-2532:
---

[~pwendell] should we close this ticket and track the individual items 
separately?  It sounds like we should expect consolidated shuffle to work in 
1.1 and any issues should have separate tickets filed for them.  I know 
[~mridulm80] has several fixes on his branch that should be cherry picked over 
at some point as well though.

 Fix issues with consolidated shuffle
 

 Key: SPARK-2532
 URL: https://issues.apache.org/jira/browse/SPARK-2532
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical

 Will file PR with changes as soon as merge is done (earlier merge became 
 outdated in 2 weeks unfortunately :) ).
 Consolidated shuffle is broken in multiple ways in spark :
 a) Task failure(s) can cause the state to become inconsistent.
 b) Multiple revert's or combination of close/revert/close can cause the state 
 to be inconsistent.
 (As part of exception/error handling).
 c) Some of the api in block writer causes implementation issues - for 
 example: a revert is always followed by close : but the implemention tries to 
 keep them separate, resulting in surface for errors.
 d) Fetching data from consolidated shuffle files can go badly wrong if the 
 file is being actively written to : it computes length by subtracting next 
 offset from current offset (or length if this is last offset)- the latter 
 fails when fetch is happening in parallel to write.
 Note, this happens even if there are no task failures of any kind !
 This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3688) LogicalPlan can't resolve column correctlly

2014-09-25 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-3688:
---
Description: 
How to reproduce this problem:
create a table:
{code}
create table test (a string, b string);
{code}
execute sql:
{code}
select a.b ,count(1) from test a join test t group by a.b;
{code}

  was:
How to reproduce this problem:
create a table:
{quote}
create table test (a string, b string);
{quote}
execute sql:
{quote}
select a.b ,count(1) from test a join test t group by a.b;
{quote}


 LogicalPlan can't resolve column correctlly
 ---

 Key: SPARK-3688
 URL: https://issues.apache.org/jira/browse/SPARK-3688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian

 How to reproduce this problem:
 create a table:
 {code}
 create table test (a string, b string);
 {code}
 execute sql:
 {code}
 select a.b ,count(1) from test a join test t group by a.b;
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3686) flume.SparkSinkSuite.Success is flaky