[jira] [Closed] (SPARK-5232) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-13 Thread Jimmy Xiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jimmy Xiang closed SPARK-5232.
--
Resolution: Invalid

Wrong project.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: SPARK-5232
 URL: https://issues.apache.org/jira/browse/SPARK-5232
 Project: Spark
  Issue Type: Improvement
Reporter: Jimmy Xiang

 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5123) Stabilize Spark SQL data type API

2015-01-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5123.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Stabilize Spark SQL data type API
 -

 Key: SPARK-5123
 URL: https://issues.apache.org/jira/browse/SPARK-5123
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 Having two versions of the data type APIs (one for Java, one for Scala) 
 requires downstream libraries to also have two versions of the APIs if the 
 library wants to support both Java and Scala. I took a look at the Scala 
 version of the data type APIs - it can actually work out pretty well for Java 
 out of the box. 
 The proposal is to move Spark SQL data type definitions from 
 org.apache.spark.sql.catalyst.types into org.apache.spark.sql.types, and make 
 the existing Scala type API usable in Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver

2015-01-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276350#comment-14276350
 ] 

Saisai Shao commented on SPARK-5220:


Hi Max, as I said in the mail, this is an expected behavior of receiver and 
block generator because of locking mechanism of BlockGenerator. The receiver 
will block on the locks for adding data into BlockGenerator, and the 
BlockGenerator is waiting for pushing thread to put data into HDFS and BM. 
Because of unmatched speed, it is expected from my understanding.

 keepPushingBlocks in BlockGenerator terminated when an exception occurs, 
 which causes the block pushing thread to terminate and blocks receiver  
 -

 Key: SPARK-5220
 URL: https://issues.apache.org/jira/browse/SPARK-5220
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu

 I am running a Spark streaming application with ReliableKafkaReceiver. It 
 uses BlockGenerator to push blocks to BlockManager. However, writing WALs to 
 HDFS may time out that causes keepPushingBlocks in BlockGenerator to 
 terminate.
 15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126)
 at 
 org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124)
 at 
 org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207)
 at 
 org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154)
 at 
 org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86)
 Then the block pushing thread is done and no subsequent blocks can be pushed 
 into blockManager. In turn this blocks receiver from receiving new data.
 So when running my app and the TimeoutException happens, the 
 ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. 
 The application rogues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276380#comment-14276380
 ] 

RJ Nowling commented on SPARK-4894:
---

Hi @lmcguire,

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?

 Add Bernoulli-variant of Naive Bayes
 

 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1
Reporter: RJ Nowling

 MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
 version of Naive Bayes is more useful for situations where the features are 
 binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1805) Error launching cluster when master and slaves machines are of different visualization types

2015-01-13 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1805:

Issue Type: Bug  (was: Improvement)

 Error launching cluster when master and slaves machines are of different 
 visualization types
 

 Key: SPARK-1805
 URL: https://issues.apache.org/jira/browse/SPARK-1805
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Han JU
Priority: Minor

 In current EC2 script, the AMI image object is loaded only once. This is ok 
 when master and slave machines are of the same visualization type (pvm or 
 hvm). But this won't work if, say, master is pvm and slaves are hvm since the 
 AMI is not compatible between these two kinds of visualization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1805) Error launching cluster when master and slave machines are of different virtualization types

2015-01-13 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-1805:

 Description: In the current EC2 script, the AMI image object is loaded 
only once. This is OK when the master and slave machines are of the same 
virtualization type (pv or hvm). But this won't work if, say, the master is pv 
and the slaves are hvm since the AMI is not compatible across these two kinds 
of virtualization.  (was: In current EC2 script, the AMI image object is loaded 
only once. This is ok when master and slave machines are of the same 
visualization type (pvm or hvm). But this won't work if, say, master is pvm and 
slaves are hvm since the AMI is not compatible between these two kinds of 
visualization. )
Target Version/s: 1.3.0
 Summary: Error launching cluster when master and slave machines 
are of different virtualization types  (was: Error launching cluster when 
master and slaves machines are of different visualization types)

 Error launching cluster when master and slave machines are of different 
 virtualization types
 

 Key: SPARK-1805
 URL: https://issues.apache.org/jira/browse/SPARK-1805
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.1, 1.2.0
Reporter: Han JU
Priority: Minor

 In the current EC2 script, the AMI image object is loaded only once. This is 
 OK when the master and slave machines are of the same virtualization type (pv 
 or hvm). But this won't work if, say, the master is pv and the slaves are hvm 
 since the AMI is not compatible across these two kinds of virtualization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-13 Thread Alex Baretta (JIRA)
Alex Baretta created SPARK-5235:
---

 Summary: java.io.NotSerializableException: 
org.apache.spark.sql.SQLConf
 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Bug
Reporter: Alex Baretta


The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
the stack trace I get when running SQL queries against a Parquet file.

Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
due to stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.spark.sql.SQLConf
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3678) Yarn app name reported in RM is different between cluster and client mode

2015-01-13 Thread WangTaoTheTonic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276429#comment-14276429
 ] 

WangTaoTheTonic commented on SPARK-3678:


In SparkHdfsLR there has {quote}val sparkConf = new 
SparkConf().setAppName(SparkHdfsLR){quote}.

And in client mode, the register to yarn happens in YarnClientSchedulerBackend, 
which is after the setAppName above.
While in cluster mode, the register happens in yarn.Client, which is before 
setAppName above.

So it is the register sequence that makes the difference.

 Yarn app name reported in RM is different between cluster and client mode
 -

 Key: SPARK-3678
 URL: https://issues.apache.org/jira/browse/SPARK-3678
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 If you launch an application in yarn cluster mode the name of the application 
 in the ResourceManager generally shows up as the full name 
 org.apache.spark.examples.SparkHdfsLR.  If you start the same app in client 
 mode it shows up as SparkHdfsLR.
 We should be consistent between them.  
 I haven't looked at it in detail, perhaps its only the examples but I think 
 I've seen this with customer apps also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-13 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5234:
-

 Summary: examples for ml don't have sparkContext.stop
 Key: SPARK-5234
 URL: https://issues.apache.org/jira/browse/SPARK-5234
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
 Environment: all
Reporter: yuhao yang
Priority: Trivial
 Fix For: 1.3.0


Not sure why sc.stop() is not in the 
org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
SimpleTextClassificationPipeline}. 

I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276471#comment-14276471
 ] 

Nicholas Chammas commented on SPARK-3821:
-

[~shivaram] Are we ready to open a PR against {{mesos/spark-ec2}} and start a 
review discussion there?

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276436#comment-14276436
 ] 

Florian Verhein commented on SPARK-3185:


I'm also getting this, though with Server IPC version 9 now that I'm using 
hadoop 2.4.1 (modification of the various hadoop init.sh scripts). I'm also 
using spark 1.2.0.

My understanding is that spark-1.2.0-bin-hadoop2.4.tgz is built against hadoop 
2.4 and tachyon 0.4.1. 
But I suspect the tachyon 0.4.1 that is installed in the spark-ec2 scripts is 
built against hadoop 1...

Does this mean building tachyon against hadoop 2.4.1 would fix this?

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276505#comment-14276505
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

[~nchammas] Yes -- That sounds good

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276507#comment-14276507
 ] 

Apache Spark commented on SPARK-5235:
-

User 'alexbaretta' has created a pull request for this issue:
https://github.com/apache/spark/pull/4031

 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 --

 Key: SPARK-5235
 URL: https://issues.apache.org/jira/browse/SPARK-5235
 Project: Spark
  Issue Type: Bug
Reporter: Alex Baretta

 The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
 the stack trace I get when running SQL queries against a Parquet file.
 Exception in thread Thread-43 org.apache.spark.SparkException: Job aborted 
 due to stage failure: Task not serializable: 
 java.io.NotSerializableException: org.apache.spark.sql.SQLConf
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5233) Error replay of WAL when recovered from driver failue

2015-01-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5233:
---
Description: 
Spark Streaming will write all the event into WAL for driver recovery, the 
sequence in the WAL may be like this:

{code}

BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 
BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- 
BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- 
BlockAdditionEvent --- BatchAllocationEvent

{code}

When driver recovered from failure, it will replay all the existed metadata WAL 
to get the right status, in this situation, two BatchAdditionEvent before down 
will put into received block queue. After driver started, new incoming blocking 
will also put into this queue and a follow-up BlockAllocationEvent will 
allocate an allocatedBlocks with queue draining out. So old, not this batch's 
data will also mix into this batch, here is the partial log:

{code}
15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 
142114075 ms

15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,46704,480)
197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,47188,480)
197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,47672,480)
197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,48156,480) 
 
197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,48640,480)
197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,49124,480)
197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,0,44184)
197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,44188,58536)
197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,102728,60168)
197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,162900,64584)
197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,227488,51240)
{code}

The old log 
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
 is obviously far older than current batch interval, and will fetch again to 
add to process.

This issue is subtle, because in the previous code we never delete the old 
received data WAL. This will lead to unwanted result as I know.

Basically because we miss some BlockAllocationEvent when recovered from 
failure. I think we need to correctly replay and insert all the events 
correctly.

  was:
Spark Streaming will write all the event into WAL for driver recovery, the 
sequence in the WAL may be like this:

{code}

BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 
BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- 
BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- 
BlockAdditionEvent --- BatchAllocationEvent

{code}

When driver recovered from failure, it will replay all the existed metadata WAL 
to get the right status, in this situation, two BatchAdditionEvent before down 
will put into received block queue. After driver started, new incoming blocking 
will also put into this queue and a follow-up BlockAllocationEvent will 
allocate an allocatedBlocks with queue draining out. So old, not this 

[jira] [Created] (SPARK-5236) parquet.io.ParquetDecodingException: Can not read value at 0 in block 0

2015-01-13 Thread Alex Baretta (JIRA)
Alex Baretta created SPARK-5236:
---

 Summary: parquet.io.ParquetDecodingException: Can not read value 
at 0 in block 0
 Key: SPARK-5236
 URL: https://issues.apache.org/jira/browse/SPARK-5236
 Project: Spark
  Issue Type: Bug
Reporter: Alex Baretta


15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
(TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value at 
0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
at 
org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
at 
parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
at 
parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
at 
parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
... 27 more




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5233) Error replay of WAL when recovered from driver failue

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276513#comment-14276513
 ] 

Apache Spark commented on SPARK-5233:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4032

 Error replay of WAL when recovered from driver failue
 -

 Key: SPARK-5233
 URL: https://issues.apache.org/jira/browse/SPARK-5233
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

 Spark Streaming will write all the event into WAL for driver recovery, the 
 sequence in the WAL may be like this:
 {code}
 BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 
 BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- 
 BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- 
 BlockAdditionEvent --- BatchAllocationEvent
 {code}
 When driver recovered from failure, it will replay all the existed metadata 
 WAL to get the right status, in this situation, two BatchAdditionEvent before 
 down will put into received block queue. After driver started, new incoming 
 blocking will also put into this queue and a follow-up BlockAllocationEvent 
 will allocate an allocatedBlocks with queue draining out. So old, not this 
 batch's data will also mix into this batch, here is the partial log:
 {code}
 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for 
 batch 142114075 ms
 
 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,46704,480)
 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,47188,480)
 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,47672,480)
 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,48156,480)   

 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,48640,480)
 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,49124,480)
 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,0,44184)
 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,44188,58536)
 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,102728,60168)
 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,162900,64584)
 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,227488,51240)
 {code}
 The old log 
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
  is obviously far older than current batch interval, and will fetch again to 
 add to process.
 This issue is subtle, because in the previous code we never delete the old 
 received data WAL. This will lead to unwanted result as I know.
 Basically because we miss some BlockAllocationEvent when recovered from 
 failure. I think we need to correctly replay and insert all the events 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Created] (SPARK-5237) UDTF don't work on SparK SQL

2015-01-13 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-5237:
--

 Summary: UDTF don't work on SparK SQL
 Key: SPARK-5237
 URL: https://issues.apache.org/jira/browse/SPARK-5237
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5233) Error replay of WAL when recovered from driver failue

2015-01-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5233:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5238

 Error replay of WAL when recovered from driver failue
 -

 Key: SPARK-5233
 URL: https://issues.apache.org/jira/browse/SPARK-5233
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

 Spark Streaming will write all the event into WAL for driver recovery, the 
 sequence in the WAL may be like this:
 {code}
 BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 
 BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- 
 BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- 
 BlockAdditionEvent --- BatchAllocationEvent
 {code}
 When driver recovered from failure, it will replay all the existed metadata 
 WAL to get the right status, in this situation, two BatchAdditionEvent before 
 down will put into received block queue. After driver started, new incoming 
 blocking will also put into this queue and a follow-up BlockAllocationEvent 
 will allocate an allocatedBlocks with queue draining out. So old, not this 
 batch's data will also mix into this batch, here is the partial log:
 {code}
 15/01/13 17:19:10 INFO KafkaInputDStream: block store result for 
 batch 142114075 ms
 
 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,46704,480)
 197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,47188,480)
 197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,47672,480)
 197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,48156,480)   

 197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,48640,480)
 197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
53201,49124,480)
 197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,0,44184)
 197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,44188,58536)
 197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,102728,60168)
 197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,162900,64584)
 197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
 WriteAheadLogFileSegment(file:   
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
07074,227488,51240)
 {code}
 The old log 
 /home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
  is obviously far older than current batch interval, and will fetch again to 
 add to process.
 This issue is subtle, because in the previous code we never delete the old 
 received data WAL. This will lead to unwanted result as I know.
 Basically because we miss some BlockAllocationEvent when recovered from 
 failure. I think we need to correctly replay and insert all the events 
 correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5237) UDTF don't work on SparK SQL

2015-01-13 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-5237:
---
Description: Hive query with UDTF don't work on Spark SQL

15/01/14 13:23:50 INFO ParseDriver: Parse Completed
15/01/14 13:23:50 WARN HiveConf: DEPRECATED: Configuration property 
hive.metastore.local no longer has any effect. Make sure to provide a valid 
value for hive.metastore.uris if you are connecting to a remote metastore.
15/01/14 13:23:50 INFO ParseDriver: Parsing command: INSERT INTO TABLE 
q10_spark_RUN_QUERY_0_result
SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, 
review_sentence, sentiment, sentiment_word)
FROM product_reviews
15/01/14 13:23:50 INFO ParseDriver: Parse Completed
15/01/14 13:23:50 ERROR SparkSQLDriver: Failed in [

INSERT INTO TABLE ${hiveconf:RESULT_TABLE}
SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, 
review_sentence, sentiment, sentiment_word)
FROM product_reviews
]
java.lang.RuntimeException:
Unsupported language features in query: INSERT INTO TABLE 
q10_spark_RUN_QUERY_0_result
SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, 
review_sentence, sentiment, sentiment_word)
FROM product_reviews
TOK_QUERY
  TOK_FROM
TOK_TABREF
  TOK_TABNAME
product_reviews
  TOK_INSERT
TOK_INSERT_INTO
  TOK_TAB
TOK_TABNAME
  q10_spark_RUN_QUERY_0_result
TOK_SELECT
  TOK_SELEXPR
TOK_FUNCTION
  extract_sentiment
  TOK_TABLE_OR_COL
pr_item_sk
  TOK_TABLE_OR_COL
pr_review_content
pr_item_sk
review_sentence
sentiment
sentiment_word

scala.NotImplementedError: No parse rules for:
 TOK_SELEXPR
  TOK_FUNCTION
extract_sentiment
TOK_TABLE_OR_COL
  pr_item_sk
TOK_TABLE_OR_COL
  pr_review_content
  pr_item_sk
  review_sentence
  sentiment
  sentiment_word

org.apache.spark.sql.hive.HiveQl$.selExprNodeToExpr(HiveQl.scala:862)

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:251)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:133)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
at 
org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 

[jira] [Updated] (SPARK-5238) Improve the robustness of Spark Streaming WAL mechanism

2015-01-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5238:
---
Description: 
Several issues identified in Spark Streaming's WAL mechanism, this is a cap of 
all the related issues.


 Improve the robustness of Spark Streaming WAL mechanism
 ---

 Key: SPARK-5238
 URL: https://issues.apache.org/jira/browse/SPARK-5238
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

 Several issues identified in Spark Streaming's WAL mechanism, this is a cap 
 of all the related issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5238) Improve the robustness of Spark Streaming WAL mechanism

2015-01-13 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5238:
--

 Summary: Improve the robustness of Spark Streaming WAL mechanism
 Key: SPARK-5238
 URL: https://issues.apache.org/jira/browse/SPARK-5238
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5147:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5238

 write ahead logs from streaming receiver are not purged because 
 cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
 --

 Key: SPARK-5147
 URL: https://issues.apache.org/jira/browse/SPARK-5147
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu
Priority: Blocker

 Hi all,
 We are running a Spark streaming application with ReliableKafkaReceiver. We 
 have spark.streaming.receiver.writeAheadLog.enable set to true so write 
 ahead logs (WALs) for received data are created under receivedData/streamId 
 folder in the checkpoint directory. 
 However, old WALs are never purged by time. receivedBlockMetadata and 
 checkpoint files are purged correctly though. I went through the code, 
 WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
 responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
 which is never called by any class. ReceiverSupervisorImpl class holds a 
 WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
 method to create WALs but never calls cleanupOldBlocks method to purge old 
 WALs.
 The size of the WAL folder increases constantly on HDFS. This is preventing 
 us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
 look.
 Thanks,
 Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.

2015-01-13 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-5142:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5238

 Possibly data may be ruined in Spark Streaming's WAL mechanism.
 ---

 Key: SPARK-5142
 URL: https://issues.apache.org/jira/browse/SPARK-5142
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

 Currently in Spark Streaming's WAL manager, data will be written into HDFS 
 with multiple tries when meeting failure, because of lacking of transactional 
 guarantee, previously partial-written data is not rolled back and the retried 
 data will be appended to the last, this will ruin the file and make the 
 WriteAheadLogReader to read data with failure.
 Firstly I think this problem is hard to fix because HDFS do not support 
 truncate operation(HDFS-3107) or random write with specific offset.
 Secondly, I think if we meet such write exception, it is better not to try 
 again, try again will ruin the file and make read abnormal.
 Sorry if I misunderstand anything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z

2015-01-13 Thread Gankun Luo (JIRA)
Gankun Luo created SPARK-5239:
-

 Summary: JdbcRDD throws java.lang.AbstractMethodError: 
oracle.jdbc.driver.xx.isClosed()Z
 Key: SPARK-5239
 URL: https://issues.apache.org/jira/browse/SPARK-5239
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.1.1
 Environment: centos6.4 + ojdbc14
Reporter: Gankun Luo
Priority: Minor


I try use JdbcRDD to operate the table of Oracle database, but failed. My test 
code as follows:

{code}
import java.sql.DriverManager
import org.apache.spark.SparkContext
import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.SparkConf

object JdbcRDD4Oracle {
  def main(args: Array[String]) {
val sc = new SparkContext(new 
SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2]))
val rdd = new JdbcRDD(sc,
  () = getConnection, getSQL, 12987, 13055, 3,
  r = {
(r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION))
  })
println(rdd.collect.toList)

sc.stop()
  }

  def getConnection() = {
Class.forName(oracle.jdbc.driver.OracleDriver).newInstance()
DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, 
scott, tiger)
  }
  
  def getSQL() = {
select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE 
HISTORY_ID =? AND HISTORY_ID =? 
  }
}
{code}

Run the example, I get the following exception:
{code}
09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in 
TaskCompletionListener
java.lang.AbstractMethodError: 
oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z
at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99)
at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at 
org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
at 
org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
at 
org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85)
at 
org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110)
at 
org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in 
TaskCompletionListener
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5239) JdbcRDD throws java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276540#comment-14276540
 ] 

Apache Spark commented on SPARK-5239:
-

User 'luogankun' has created a pull request for this issue:
https://github.com/apache/spark/pull/4033

 JdbcRDD throws java.lang.AbstractMethodError: 
 oracle.jdbc.driver.xx.isClosed()Z
 -

 Key: SPARK-5239
 URL: https://issues.apache.org/jira/browse/SPARK-5239
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment: centos6.4 + ojdbc14
Reporter: Gankun Luo
Priority: Minor

 I try use JdbcRDD to operate the table of Oracle database, but failed. My 
 test code as follows:
 {code}
 import java.sql.DriverManager
 import org.apache.spark.SparkContext
 import org.apache.spark.rdd.JdbcRDD
 import org.apache.spark.SparkConf
 object JdbcRDD4Oracle {
   def main(args: Array[String]) {
 val sc = new SparkContext(new 
 SparkConf().setAppName(JdbcRDD4Oracle).setMaster(local[2]))
 val rdd = new JdbcRDD(sc,
   () = getConnection, getSQL, 12987, 13055, 3,
   r = {
 (r.getObject(HISTORY_ID), r.getObject(APPROVE_OPINION))
   })
 println(rdd.collect.toList)
 
 sc.stop()
   }
   def getConnection() = {
 Class.forName(oracle.jdbc.driver.OracleDriver).newInstance()
 DriverManager.getConnection(jdbc:oracle:thin:@hadoop000:1521/ORCL, 
 scott, tiger)
   }
   
   def getSQL() = {
   select HISTORY_ID,APPROVE_OPINION from CI_APPROVE_HISTORY WHERE 
 HISTORY_ID =? AND HISTORY_ID =? 
   }
 }
 {code}
 Run the example, I get the following exception:
 {code}
 09:56:48,302 [Executor task launch worker-0] ERROR Logging$class : Error in 
 TaskCompletionListener
 java.lang.AbstractMethodError: 
 oracle.jdbc.driver.OracleResultSetImpl.isClosed()Z
   at org.apache.spark.rdd.JdbcRDD$$anon$1.close(JdbcRDD.scala:99)
   at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
   at 
 org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
   at 
 org.apache.spark.rdd.JdbcRDD$$anon$1$$anonfun$1.apply(JdbcRDD.scala:71)
   at 
 org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:85)
   at 
 org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:110)
   at 
 org.apache.spark.TaskContext$$anonfun$markTaskCompleted$1.apply(TaskContext.scala:108)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at org.apache.spark.TaskContext.markTaskCompleted(TaskContext.scala:108)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 09:56:48,302 [Executor task launch worker-1] ERROR Logging$class : Error in 
 TaskCompletionListener
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-13 Thread Alex Baretta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Baretta updated SPARK-5236:

Summary: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableInt  (was: 
parquet.io.ParquetDecodingException: Can not read value at 0 in block 0)

 java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
 -

 Key: SPARK-5236
 URL: https://issues.apache.org/jira/browse/SPARK-5236
 Project: Spark
  Issue Type: Bug
Reporter: Alex Baretta

 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
 at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
 at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
 at 
 org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
 at 
 org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
 at 
 parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
 at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
 at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
 ... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276558#comment-14276558
 ] 

Apache Spark commented on SPARK-4923:
-

User 'rcsenkbeil' has created a pull request for this issue:
https://github.com/apache/spark/pull/4034

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5240) Adding `createDataSourceTable` interface to Catalog

2015-01-13 Thread wangfei (JIRA)
wangfei created SPARK-5240:
--

 Summary: Adding `createDataSourceTable` interface to Catalog
 Key: SPARK-5240
 URL: https://issues.apache.org/jira/browse/SPARK-5240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei


Adding `createDataSourceTable` interface to Catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-13 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276562#comment-14276562
 ] 

Chip Senkbeil commented on SPARK-4923:
--

As the nice bot has stated, I created a pull request for this issue. I detailed 
why I marked each method/field public and provided Scaladocs for each of them 
to make the exposure of the REPL API a little nicer.

As stated in the pull request, I only tackled Scala 2.10 for now as the Scala 
2.11 did not appear to be ready, although I could easily be mistaken. I 
merely glanced at the SparkIMain and noticed that it did not have the class 
server declaration to ship the compiled class files nor was it - or any of the 
other classes - in the org.apache.spark.repl package.

 Maven build should keep publishing spark-repl
 -

 Key: SPARK-4923
 URL: https://issues.apache.org/jira/browse/SPARK-4923
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Affects Versions: 1.2.0
Reporter: Peng Cheng
Priority: Critical
  Labels: shell
 Attachments: 
 SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Spark-repl installation and deployment has been discontinued (see 
 SPARK-3452). But its in the dependency list of a few projects that extends 
 its initialization process.
 Please remove the 'skip' setting in spark-repl and make it an 'official' API 
 to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5006) spark.port.maxRetries doesn't work

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5006.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: WangTaoTheTonic
Target Version/s: 1.3.0

 spark.port.maxRetries doesn't work
 --

 Key: SPARK-5006
 URL: https://issues.apache.org/jira/browse/SPARK-5006
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
 Fix For: 1.3.0


 We normally config spark.port.maxRetries in properties file or SparkConf. But 
 in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose 
 env need to be set after JVM is launched and Utils.scala is also an object. 
 So in most cases portMaxRetries will get the default value 16.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275600#comment-14275600
 ] 

Apache Spark commented on SPARK-3288:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4020

 All fields in TaskMetrics should be private and use getters/setters
 ---

 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Dale Richardson
  Labels: starter

 This is particularly bad because we expose this as a developer API. 
 Technically a library could create a TaskMetrics object and then change the 
 values inside of it and pass it onto someone else. It can be written pretty 
 compactly like below:
 {code}
   /**
* Number of bytes written for the shuffle by this task
*/
   @volatile private var _shuffleBytesWritten: Long = _
   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
 value
   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
 value
   def shuffleBytesWritten = _shuffleBytesWritten
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API

2015-01-13 Thread Davies Liu (JIRA)
Davies Liu created SPARK-5223:
-

 Summary: Use pickle instead of MapConvert and ListConvert in MLlib 
Python API
 Key: SPARK-5223
 URL: https://issues.apache.org/jira/browse/SPARK-5223
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Critical


It will introduce problems if the object in dict/list/tuple can not support by 
py4j, such as Vector.

Also, pickle may have better performance for larger object (less RPC).

In some cases that the object in dict/list can not be pickled (such as 
JavaObject), we should still use MapConvert/ListConvert.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4697) System properties should override environment variables

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4697:
-
Affects Version/s: 1.0.0

 System properties should override environment variables
 ---

 Key: SPARK-4697
 URL: https://issues.apache.org/jira/browse/SPARK-4697
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: WangTaoTheTonic

 I found some arguments in yarn module take environment variables before 
 system properties while the latter override the former in core module.
 This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5219:
-
Assignee: Shixiong Zhu

 Race condition in TaskSchedulerImpl and TaskSetManager
 --

 Key: SPARK-5219
 URL: https://issues.apache.org/jira/browse/SPARK-5219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu

 TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults 
 and TaskSetManager.abort will access variables which are used in multiple 
 threads, but they don't use a correct lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5219:
-
Affects Version/s: 1.2.0

 Race condition in TaskSchedulerImpl and TaskSetManager
 --

 Key: SPARK-5219
 URL: https://issues.apache.org/jira/browse/SPARK-5219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Shixiong Zhu

 TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults 
 and TaskSetManager.abort will access variables which are used in multiple 
 threads, but they don't use a correct lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3885) Provide mechanism to remove accumulators once they are no longer used

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275607#comment-14275607
 ] 

Apache Spark commented on SPARK-3885:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4021

 Provide mechanism to remove accumulators once they are no longer used
 -

 Key: SPARK-3885
 URL: https://issues.apache.org/jira/browse/SPARK-3885
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Josh Rosen

 Spark does not currently provide any mechanism to delete accumulators after 
 they are no longer used.  This can lead to OOMs for long-lived SparkContexts 
 that create many large accumulators.
 Part of the problem is that accumulators are registered in a global 
 {{Accumulators}} registry.  Maybe the fix would be as simple as using weak 
 references in the Accumulators registry so that accumulators can be GC'd once 
 they can no longer be used.
 In the meantime, here's a workaround that users can try:
 Accumulators have a public setValue() method that can be called (only by the 
 driver) to change an accumulator’s value.  You might be able to use this to 
 reset accumulators’ values to smaller objects (e.g. the “zero” object of 
 whatever your accumulator type is, or ‘null’ if you’re sure that the 
 accumulator will never be accessed again).
 This issue was originally reported by [~nkronenfeld] on the dev mailing list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-Accumulator-question-td8709.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5222) YARN client and cluster modes have different app name behaviors

2015-01-13 Thread Andrew Or (JIRA)
Andrew Or created SPARK-5222:


 Summary: YARN client and cluster modes have different app name 
behaviors
 Key: SPARK-5222
 URL: https://issues.apache.org/jira/browse/SPARK-5222
 Project: Spark
  Issue Type: Bug
Reporter: Andrew Or
Assignee: WangTaoTheTonic


The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
https://github.com/apache/spark/pull/3557

SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
This results in the strange behavior where the app name changes if the user 
runs the same application but uses a different deploy mode from before.

We should make sure the app name behavior is consistent across deploy modes 
regardless of what variable or config is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5222:
-
Description: 
The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
https://github.com/apache/spark/pull/3557

SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
This results in the strange behavior where the app name changes if the user 
runs the same application but uses a different deploy mode from before. We 
should make sure the app name behavior is consistent across deploy modes 
regardless of what variable or config is set.

Additionally, it should be noted that because spark.app.name is required of 
all applications, the setting of SPARK_YARN_APP_NAME will not take effect 
unless we handle it preemptively in Spark submit.

  was:
The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
https://github.com/apache/spark/pull/3557

SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
This results in the strange behavior where the app name changes if the user 
runs the same application but uses a different deploy mode from before.

We should make sure the app name behavior is consistent across deploy modes 
regardless of what variable or config is set.


 YARN client and cluster modes have different app name behaviors
 ---

 Key: SPARK-5222
 URL: https://issues.apache.org/jira/browse/SPARK-5222
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: WangTaoTheTonic

 The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
 https://github.com/apache/spark/pull/3557
 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
 This results in the strange behavior where the app name changes if the user 
 runs the same application but uses a different deploy mode from before. We 
 should make sure the app name behavior is consistent across deploy modes 
 regardless of what variable or config is set.
 Additionally, it should be noted that because spark.app.name is required of 
 all applications, the setting of SPARK_YARN_APP_NAME will not take effect 
 unless we handle it preemptively in Spark submit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275671#comment-14275671
 ] 

Apache Spark commented on SPARK-5223:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4023

 Use pickle instead of MapConvert and ListConvert in MLlib Python API
 

 Key: SPARK-5223
 URL: https://issues.apache.org/jira/browse/SPARK-5223
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Critical

 It will introduce problems if the object in dict/list/tuple can not support 
 by py4j, such as Vector.
 Also, pickle may have better performance for larger object (less RPC).
 In some cases that the object in dict/list can not be pickled (such as 
 JavaObject), we should still use MapConvert/ListConvert.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4697) System properties should override environment variables

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4697:
-
Assignee: WangTaoTheTonic

 System properties should override environment variables
 ---

 Key: SPARK-4697
 URL: https://issues.apache.org/jira/browse/SPARK-4697
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic

 I found some arguments in yarn module take environment variables before 
 system properties while the latter override the former in core module.
 This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-733) Add documentation on use of accumulators in lazy transformation

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275608#comment-14275608
 ] 

Apache Spark commented on SPARK-733:


User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/4022

 Add documentation on use of accumulators in lazy transformation
 ---

 Key: SPARK-733
 URL: https://issues.apache.org/jira/browse/SPARK-733
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Josh Rosen

 Accumulators updates are side-effects of RDD computations.  Unlike RDDs, 
 accumulators do not carry lineage that would allow them to be computed when 
 their values are accessed on the master.
 This can lead to confusion when accumulators are used in lazy transformations 
 like `map`:
 {code}
 val acc = sc.accumulator(0)
 data.map(x = acc += x; f(x))
 // Here, acc is 0 because no actions have cause the `map` to be computed.
 {code}
 As far as I can tell, our  documentation only includes examples of using 
 accumulators in `foreach`, for which this problem does not occur.
 This pattern of using accumulators in map() occurs in Bagel and other Spark 
 code found in the wild.
 It might be nice to document this behavior in the accumulators section of the 
 Spark programming guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5222:
-
Affects Version/s: 1.0.0

 YARN client and cluster modes have different app name behaviors
 ---

 Key: SPARK-5222
 URL: https://issues.apache.org/jira/browse/SPARK-5222
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: WangTaoTheTonic

 The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
 https://github.com/apache/spark/pull/3557
 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
 This results in the strange behavior where the app name changes if the user 
 runs the same application but uses a different deploy mode from before.
 We should make sure the app name behavior is consistent across deploy modes 
 regardless of what variable or config is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5008) Persistent HDFS does not recognize EBS Volumes

2015-01-13 Thread Brad Willard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275586#comment-14275586
 ] 

Brad Willard commented on SPARK-5008:
-

[~nchammas] I went ahead and created a cluster with this

./spark-ec2 -v 1.2.0 --wait 235 -k ... --copy-aws-credentials 
--hadoop-major-version 1 -z us-east-1c -s 2 -m c1.medium -t c1.medium launch 
spark-hdfs-bug --ebs-vol-size 10 --ebs-vol-type gp2 --ebs-vol-num 1

I updated the core-site.xml and switched /vol - to /vol0. ran copy-dir and 
restarted via stop-all.sh and start-all.sh.
That brings it up in a broken state. However if I then modify the core-site.xml 
back to /vol on master and restart, it works correctly.

So that's a partial solution. I assume this is because the master node doesn't 
get an ebs volume.

 Persistent HDFS does not recognize EBS Volumes
 --

 Key: SPARK-5008
 URL: https://issues.apache.org/jira/browse/SPARK-5008
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
 Environment: 8 Node Cluster Generated from 1.2.0 spark-ec2 script.
 -m c3.2xlarge -t c3.8xlarge --ebs-vol-size 300 --ebs-vol-type gp2 
 --ebs-vol-num 1
Reporter: Brad Willard

 Cluster is built with correct size EBS volumes. It creates the volume at 
 /dev/xvds and it mounted to /vol0. However when you start persistent hdfs 
 with start-all script, it starts but it isn't correctly configured to use the 
 EBS volume.
 I'm assuming some sym links or expected mounts are not correctly configured.
 This has worked flawlessly on all previous versions of spark.
 I have a stupid workaround by installing pssh and mucking with it by mounting 
 it to /vol, which worked, however it doesn't not work between restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5222) YARN client and cluster modes have different app name behaviors

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5222:
-
Component/s: YARN

 YARN client and cluster modes have different app name behaviors
 ---

 Key: SPARK-5222
 URL: https://issues.apache.org/jira/browse/SPARK-5222
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: WangTaoTheTonic

 The behavior is summarized in a table produced by [~WangTaoTheTonic] here: 
 https://github.com/apache/spark/pull/3557
 SPARK_YARN_APP_NAME is respected only in client mode but not in cluster mode. 
 This results in the strange behavior where the app name changes if the user 
 runs the same application but uses a different deploy mode from before.
 We should make sure the app name behavior is consistent across deploy modes 
 regardless of what variable or config is set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4955:
-
Summary: Dynamic allocation doesn't work in YARN cluster mode  (was: 
Executor does not get killed after configured interval.)

 Dynamic allocation doesn't work in YARN cluster mode
 

 Key: SPARK-4955
 URL: https://issues.apache.org/jira/browse/SPARK-4955
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Chengxiang Li

 With executor dynamic scaling enabled, in yarn-cluster mode, after query 
 finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
 number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4955:
-
Priority: Critical  (was: Major)

 Dynamic allocation doesn't work in YARN cluster mode
 

 Key: SPARK-4955
 URL: https://issues.apache.org/jira/browse/SPARK-4955
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Chengxiang Li
Priority: Critical

 With executor dynamic scaling enabled, in yarn-cluster mode, after query 
 finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
 number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode

2015-01-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4955:
-
Assignee: Lianhui Wang

 Dynamic allocation doesn't work in YARN cluster mode
 

 Key: SPARK-4955
 URL: https://issues.apache.org/jira/browse/SPARK-4955
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Chengxiang Li
Assignee: Lianhui Wang
Priority: Critical

 With executor dynamic scaling enabled, in yarn-cluster mode, after query 
 finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
 number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available

2015-01-13 Thread Vladimir Grigor (JIRA)
Vladimir Grigor created SPARK-5242:
--

 Summary: ec2/spark_ec2.py lauch does not work with VPC if no 
public DNS or IP is available
 Key: SPARK-5242
 URL: https://issues.apache.org/jira/browse/SPARK-5242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor


How to reproduce: user starting cluster in VPC needs to wait forever:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...
Searching for existing cluster SparkByScript...
Spark AMI: ami-1ae0166d
Launching instances...
Launched 1 slaves in eu-west-1a, regid = r-e70c5502
Launched master in eu-west-1a, regid = r-bf0f565a
Waiting for cluster to enter 'ssh-ready' state..{forever}
{code}

Problem is that current code makes wrong assumption that VPC instance has 
public_dns_name or public ip_address. Actually more common is that VPC instance 
has only private_ip_address.


The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276572#comment-14276572
 ] 

Florian Verhein commented on SPARK-3821:


Thanks [~nchammas], that makes sense.

Created #SPARK-5241.
I'm not sure about the pre-built scenario, but am guessing e.g. 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop2.4.tgz != 
http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz. So 
perhaps the intent is that the spark-ec2 scripts only support cdh 
distributions...  

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276581#comment-14276581
 ] 

Apache Spark commented on SPARK-5147:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4037

 write ahead logs from streaming receiver are not purged because 
 cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
 --

 Key: SPARK-5147
 URL: https://issues.apache.org/jira/browse/SPARK-5147
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu
Priority: Blocker

 Hi all,
 We are running a Spark streaming application with ReliableKafkaReceiver. We 
 have spark.streaming.receiver.writeAheadLog.enable set to true so write 
 ahead logs (WALs) for received data are created under receivedData/streamId 
 folder in the checkpoint directory. 
 However, old WALs are never purged by time. receivedBlockMetadata and 
 checkpoint files are purged correctly though. I went through the code, 
 WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
 responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
 which is never called by any class. ReceiverSupervisorImpl class holds a 
 WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
 method to create WALs but never calls cleanupOldBlocks method to purge old 
 WALs.
 The size of the WAL folder increases constantly on HDFS. This is preventing 
 us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
 look.
 Thanks,
 Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-01-13 Thread yuhao yang (JIRA)
yuhao yang created SPARK-5243:
-

 Summary: Spark will hang if (driver memory + executor memory) 
exceeds limit on a 1-worker cluster
 Key: SPARK-5243
 URL: https://issues.apache.org/jira/browse/SPARK-5243
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.2.0
 Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor


Spark will hang if calling spark-submit under the conditions:

1. the cluster has only one worker.
2. driver memory + executor memory  worker memory
3. deploy-mode = cluster

This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the output 
of the spark-submit.

I am preparing PR for the case. And I would like to know your opinions about if 
a fix is needed and better fix options.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276590#comment-14276590
 ] 

Apache Spark commented on SPARK-5242:
-

User 'voukka' has created a pull request for this issue:
https://github.com/apache/spark/pull/4038

 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is 
 available
 ---

 Key: SPARK-5242
 URL: https://issues.apache.org/jira/browse/SPARK-5242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor
  Labels: easyfix

 How to reproduce: user starting cluster in VPC needs to wait forever:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 Searching for existing cluster SparkByScript...
 Spark AMI: ami-1ae0166d
 Launching instances...
 Launched 1 slaves in eu-west-1a, regid = r-e70c5502
 Launched master in eu-west-1a, regid = r-bf0f565a
 Waiting for cluster to enter 'ssh-ready' state..{forever}
 {code}
 Problem is that current code makes wrong assumption that VPC instance has 
 public_dns_name or public ip_address. Actually more common is that VPC 
 instance has only private_ip_address.
 The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276594#comment-14276594
 ] 

Apache Spark commented on SPARK-5236:
-

User 'alexbaretta' has created a pull request for this issue:
https://github.com/apache/spark/pull/4039

 java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
 -

 Key: SPARK-5236
 URL: https://issues.apache.org/jira/browse/SPARK-5236
 Project: Spark
  Issue Type: Bug
Reporter: Alex Baretta

 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
 (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
 at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1331)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
 at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
 at 
 org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter.updateInt(ParquetConverter.scala:375)
 at 
 org.apache.spark.sql.parquet.CatalystPrimitiveConverter.addInt(ParquetConverter.scala:434)
 at 
 parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:237)
 at 
 parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:353)
 at 
 parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:402)
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
 ... 27 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available

2015-01-13 Thread Vladimir Grigor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276591#comment-14276591
 ] 

Vladimir Grigor commented on SPARK-5242:


This bug is fixed in https://github.com/apache/spark/pull/4038

 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is 
 available
 ---

 Key: SPARK-5242
 URL: https://issues.apache.org/jira/browse/SPARK-5242
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor
  Labels: easyfix

 How to reproduce: user starting cluster in VPC needs to wait forever:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 Searching for existing cluster SparkByScript...
 Spark AMI: ami-1ae0166d
 Launching instances...
 Launched 1 slaves in eu-west-1a, regid = r-e70c5502
 Launched master in eu-west-1a, regid = r-bf0f565a
 Waiting for cluster to enter 'ssh-ready' state..{forever}
 {code}
 Problem is that current code makes wrong assumption that VPC instance has 
 public_dns_name or public ip_address. Actually more common is that VPC 
 instance has only private_ip_address.
 The bug is already fixed in my fork, I am going to submit pull request



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5241) spark-ec2 spark init scripts do not handle all hadoop (or tachyon?) dependencies correctly

2015-01-13 Thread Florian Verhein (JIRA)
Florian Verhein created SPARK-5241:
--

 Summary: spark-ec2 spark init scripts do not handle all hadoop (or 
tachyon?) dependencies correctly
 Key: SPARK-5241
 URL: https://issues.apache.org/jira/browse/SPARK-5241
 Project: Spark
  Issue Type: Bug
  Components: Build, EC2
Reporter: Florian Verhein



spark-ec2/spark/init.sh doesn't completely adhere to hadoop dependencies. This 
may also be an issue for tachyon dependencies. Related: tachyon appears require 
builds against the right version of hadoop also (probably causes this: 
SPARK-3185). 

Applies to the spark build from git checkout in spark/init.sh (I suspect this 
should also be changed to using mvn as that's the reference build according to 
the docs?).

May apply to pre-built spark in spark/init.sh as well, but I'm not sure about 
this. E.g. I thought that the hadoop2.4 and cdh4.2 builds of spark are 
different.

Also note that hadoop native is built from hadoop 2.4.1 on the AMI, and this is 
used regardless of HADOOP_MAJOR_VERSION in the *-hdfs modules.

Tachyon is hard coded to 0.4.1 (which is probably built against hadoop1.x?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5240) Adding `createDataSourceTable` interface to Catalog

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276566#comment-14276566
 ] 

Apache Spark commented on SPARK-5240:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4036

 Adding `createDataSourceTable` interface to Catalog
 ---

 Key: SPARK-5240
 URL: https://issues.apache.org/jira/browse/SPARK-5240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 Adding `createDataSourceTable` interface to Catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-01-13 Thread Josh Devins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276607#comment-14276607
 ] 

Josh Devins commented on SPARK-5095:


Nice one, gonna try and test it this week.




 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-13 Thread Davies Liu (JIRA)
Davies Liu created SPARK-5224:
-

 Summary: parallelize list/ndarray is really slow
 Key: SPARK-5224
 URL: https://issues.apache.org/jira/browse/SPARK-5224
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker


After the default batchSize changed to 0 (batched based on the size of object), 
but parallelize() still use BatchedSerializer with batchSize=1.

Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275711#comment-14275711
 ] 

Apache Spark commented on SPARK-5224:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4024

 parallelize list/ndarray is really slow
 ---

 Key: SPARK-5224
 URL: https://issues.apache.org/jira/browse/SPARK-5224
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker

 After the default batchSize changed to 0 (batched based on the size of 
 object), but parallelize() still use BatchedSerializer with batchSize=1.
 Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5225) Support coalesed Input Metrics from different sources

2015-01-13 Thread Kostas Sakellis (JIRA)
Kostas Sakellis created SPARK-5225:
--

 Summary: Support coalesed Input Metrics from different sources
 Key: SPARK-5225
 URL: https://issues.apache.org/jira/browse/SPARK-5225
 Project: Spark
  Issue Type: Bug
Reporter: Kostas Sakellis


Currently, If task reads data from more than one block and it is from different 
read methods we ignore the second read method bytes. For example:
 CoalescedRDD
| 
Task1 
  / |\
/   |  \   
  hadoop   hadoop  cached

if Task1 starts reading from the hadoop blocks first, then the input metrics 
for Task1 will only contain input metrics from the hadoop blocks and ignre the 
input metrics from cached blocks. We need to change the way we collect input 
metrics so that it is not a single value but rather a collection of input 
metrics for a task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2909) Indexing for SparseVector in pyspark

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275741#comment-14275741
 ] 

Apache Spark commented on SPARK-2909:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/4025

 Indexing for SparseVector in pyspark
 

 Key: SPARK-2909
 URL: https://issues.apache.org/jira/browse/SPARK-2909
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor

 SparseVector in pyspark does not currently support indexing, except by 
 examining the internal representation.  Though indexing is a pricy operation, 
 it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) 
 and operating on a single feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-13 Thread Zach Fry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272291#comment-14272291
 ] 

Zach Fry edited comment on SPARK-4879 at 1/13/15 7:53 PM:
--

Hey Josh,

I was able to reproduce the missing file using the speculation settings in my 
previous comment:

{code}
scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 
0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of 
task: attempt_201501091833__m_42_113

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)

org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/01/09 18:33:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, redacted-03): java.io.IOException: The temporary job-output directory 
hdfs://redacted-01:8020/test2/_temporary doesn't exist!

org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)

org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)

org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

Notice here that there are only 99 part files and part-00042 is missing (as 
seen in the stacktrace above)
{code}
  $ hadoop fs -ls /test2 | grep part | wc -l
99
 $ hadoop fs -ls /test2 | grep part-0004
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00040
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00041
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00043
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00044
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00045
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00046
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00047
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00048
-rw-r--r--   3 redacted supergroup  8 2015-01-09 18:33 
/test2/part-00049
{code}




was (Author: zfry):
Hey Josh,

I was able to reproduce the missing file using the speculation settings in my 
previous comment:

{code}
scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 
0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of 
task: attempt_201501091833__m_42_113

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)

org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


[jira] [Updated] (SPARK-5225) Support coalesed Input Metrics from different sources

2015-01-13 Thread Kostas Sakellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Sakellis updated SPARK-5225:
---
Description: 
Currently, If task reads data from more than one block and it is from different 
read methods we ignore the second read method bytes. For example:
{noformat}
  CoalescedRDD
   | 
 Task1 
 /  |  \   
 hadoop  hadoop  cached
{noformat}
if Task1 starts reading from the hadoop blocks first, then the input metrics 
for Task1 will only contain input metrics from the hadoop blocks and ignre the 
input metrics from cached blocks. We need to change the way we collect input 
metrics so that it is not a single value but rather a collection of input 
metrics for a task. 

  was:
Currently, If task reads data from more than one block and it is from different 
read methods we ignore the second read method bytes. For example:
 CoalescedRDD
| 
Task1 
  / |\
/   |  \   
  hadoop   hadoop  cached

if Task1 starts reading from the hadoop blocks first, then the input metrics 
for Task1 will only contain input metrics from the hadoop blocks and ignre the 
input metrics from cached blocks. We need to change the way we collect input 
metrics so that it is not a single value but rather a collection of input 
metrics for a task. 


 Support coalesed Input Metrics from different sources
 -

 Key: SPARK-5225
 URL: https://issues.apache.org/jira/browse/SPARK-5225
 Project: Spark
  Issue Type: Bug
Reporter: Kostas Sakellis

 Currently, If task reads data from more than one block and it is from 
 different read methods we ignore the second read method bytes. For example:
 {noformat}
   CoalescedRDD
| 
  Task1 
  /  |  \   
  hadoop  hadoop  cached
 {noformat}
 if Task1 starts reading from the hadoop blocks first, then the input metrics 
 for Task1 will only contain input metrics from the hadoop blocks and ignre 
 the input metrics from cached blocks. We need to change the way we collect 
 input metrics so that it is not a single value but rather a collection of 
 input metrics for a task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275846#comment-14275846
 ] 

Apache Spark commented on SPARK-5211:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4026

 Restore HiveMetastoreTypes.toDataType
 -

 Key: SPARK-5211
 URL: https://issues.apache.org/jira/browse/SPARK-5211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-01-13 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3185:

Description: 
{code}
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4
{code}

When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
exception is thrown when Formatting JOURNAL_FOLDER.

No exception occurs when I launch on Hadoop 1.

Launch used:
{code}
./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
--zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
sparkProd
{code}

{code}
log snippet
Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Exception in thread main java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4
at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
at tachyon.Format.main(Format.java:54)
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
... 3 more
Killed 0 processes
Killed 0 processes
ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
---end snippet---
{code}

*I don't have this problem when I launch without the --hadoop-major-version=2 
(which defaults to Hadoop 1.x).*

  was:
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4

When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
exception is thrown when Formatting JOURNAL_FOLDER.

No exception occurs when I launch on Hadoop 1.

Launch used:
./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
--zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
sparkProd

log snippet
Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Exception in thread main java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4
at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
at tachyon.Format.main(Format.java:54)
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at 

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276263#comment-14276263
 ] 

Florian Verhein commented on SPARK-3821:


This is great stuff! It'll also help serve as some documentation for AMI 
requirements when using the spark-ec2 scripts.  

Re the above, I think everything in create_image.sh can be refactored to packer 
(+ duplicate removal - e.g. root login). I've attempted to do this in a fork of 
[~nchammas]'s work, but my use case is a bit different in that I need to go 
from a fresh centos6 minimal (rather than an amazon linux AMI) and then add 
other things.

Possibly related to AMI generation in general: I've noticed that the version 
dependencies in the spark-ec2 scripts are broken. I suspect this will need to 
be handled in both the image and the setup. For example:
- It looks like Spark needs to be built with the right hadoop profile to work, 
but this isn't adhered to. This applies when spark is built from a git checkout 
or from an existing build. This is likely also the case with Tachyon too. 
Probably the cause of https://issues.apache.org/jira/browse/SPARK-3185
- The hadoop native libs are built on the image using 2.4.1, but then copied 
into whatever hadoop build is downloaded in the ephemeral-hdfs and 
persistent-hdfs scripts. I suspect that could cause issues too. Since building 
hadoop is very time consuming, it's something you'd wan't on the image - hence 
creating a dependency. 
- The version dependencies for other things like ganglia aren't documented (I 
believe this is installed on the image but duplicated again in 
spark-ec2/ganglia). I've found that the ganglia config doesn't work for me (but 
recall I'm using a different base AMI, so I'll likely get a different ganglia 
version). I have a sneaky suspicion that the hadoop configs in spark-ec2 won't 
work across the hadoop versions either (but, fingers crossed!).

Re the above, I might try keeping the entire hadoop build (from the image 
creation) for the hdfs setup.

Sorry for the sidetrack, but struggling though all this so hoping it might ring 
a bell for someone.  

p.s. With the image automation, it might also be worth considering putting more 
on the image as an option (esp for people happy to build their own AMIs). For 
example, I see no reason why the module init.sh scripts can't be run from 
packer in order to speed start-up times of the cluster :) 


 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5167) Move Row into sql package and make it usable for Java

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276327#comment-14276327
 ] 

Apache Spark commented on SPARK-5167:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4030

 Move Row into sql package and make it usable for Java
 -

 Key: SPARK-5167
 URL: https://issues.apache.org/jira/browse/SPARK-5167
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 This will help us eliminate the duplicated Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-13 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276380#comment-14276380
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/14/15 2:06 AM:


Hi [~lmcguire]

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?


was (Author: rnowling):
Hi @lmcguire,

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing 
code yet.  The docs for NB mention that using binary features will cause the 
multinomial NB to act like Bernoulli NB.  I don't believe the documentation is 
correct, at least when smoothing is used since P(0) != 1 - P(1).I was 
planning on comparing the sklearn implementation with the Spark implementation 
and showing that the docs were wrong.  Once verified, I think the changes will 
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though.  If you have time now and want to 
tackle this, I'd be happy to hand it over to you and review any patches.  (I'm 
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if 
you want to wait until I have a patch and test it, that could work, too.  What 
do you think?

 Add Bernoulli-variant of Naive Bayes
 

 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.1.1
Reporter: RJ Nowling

 MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
 version of Naive Bayes is more useful for situations where the features are 
 binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2015-01-13 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276278#comment-14276278
 ] 

Cheng Lian commented on SPARK-4296:
---

Yeah, I think whenever we use expressions that are not {{NamedExpression}} in 
GROUP BY, this issue may be triggered, because an intermediate alias is 
introduced during analysis phase. That's why I tried to fix all similar aliases 
in PR #3910 (but failed).

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Shixiong Zhu
Assignee: Cheng Lian
Priority: Blocker

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5233) Error replay of WAL when recovered from driver failue

2015-01-13 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-5233:
--

 Summary: Error replay of WAL when recovered from driver failue
 Key: SPARK-5233
 URL: https://issues.apache.org/jira/browse/SPARK-5233
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao


Spark Streaming will write all the event into WAL for driver recovery, the 
sequence in the WAL may be like this:

{code}

BlockAdditionEvent --- BlockAdditionEvent --- BlockAdditionEvent --- 
BatchAllocationEvent --- BatchCleanupEvent --- BlockAdditionEvent --- 
BlockAdditionEvent --- 'Driver Down Time' --- BlockAdditionEvent --- 
BlockAdditionEvent --- BatchAllocationEvent

{code}

When driver recovered from failure, it will replay all the existed metadata WAL 
to get the right status, in this situation, two BatchAdditionEvent before down 
will put into received block queue. After driver started, new incoming blocking 
will also put into this queue and a follow-up BlockAllocationEvent will 
allocate an allocatedBlocks with queue draining out. So old, not this batch's 
data will also mix into this batch, here is the partial log:

{code}
15/01/13 17:19:10 INFO KafkaInputDStream: block store result for batch 
142114075 ms

15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,46704,480)
197757 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,47188,480)
197758 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,47672,480)
197759 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,48156,480) 
 
197760 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,48640,480)
197761 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
   53201,49124,480)
197762 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,0,44184)
197763 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,44188,58536)
197764 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,102728,60168)
197765 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,162900,64584)
197766 15/01/13 17:19:10 INFO KafkaInputDStream: log segment: 
WriteAheadLogFileSegment(file:   
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140747074-14211408
   07074,227488,51240)
{code}

The old log 
/home/jerryshao/project/apache-spark/checkpoint-wal-test/receivedData/0/log-1421140593201-14211406
 is obviously far older than current batch interval, and will fetch again to 
add to process.

This issue is implicit, because in the previous code we never delete the old 
received data WAL. This will lead to unwanted result as I know.

Basically because we miss some BlockAllocationEvent when recovered from 
failure. I think we need to correctly replay and insert all the events 
correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276411#comment-14276411
 ] 

Nicholas Chammas commented on SPARK-3821:
-

Hi [~florianverhein] and thanks for chiming in!

{quote}
Re the above, I think everything in create_image.sh can be refactored to packer 
(+ duplicate removal - e.g. root login).
{quote}

Definitely. I'm hoping to make as few changes as possible to the existing 
{{create_image.sh}} script to reduce the review burden, but after this initial 
proposal is accepted it makes sense to refactor these scripts. There is some 
related work proposed in [SPARK-5189].

Some of the things you call out regarding version mismatches and whatnot sound 
like they might merit their own JIRA issues.

For example:

{quote}
It looks like Spark needs to be built with the right hadoop profile to work, 
but this isn't adhered to. 
{quote}

I haven't tested this out, but from the Spark init script, it looks like the 
correct version of Spark is used in [the pre-built 
scenario|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/init.sh#L109].
 Not so in the [build-from-git 
scenario|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/init.sh#L21],
 so nice catch. Could you file a JIRA issue for that?

{quote}
For example, I see no reason why the module init.sh scripts can't be run from 
packer in order to speed start-up times of the cluster
{quote}

Regarding this and other ideas regarding pre-baking more on the images, [that's 
how this proposal started, 
actually|https://github.com/nchammas/spark-ec2/blob/9c28878694171ba085a10acd4405c702397d28ce/packer/README.md#base-vs-spark-pre-installed]
 (here's the [original Packer 
template|https://github.com/nchammas/spark-ec2/blob/9c28878694171ba085a10acd4405c702397d28ce/packer/spark-packer.json#L118-L133]).
 We decided to rip that out to reduce the complexity of the initial proposal 
and make it easier to specify different versions of Spark and Hadoop at launch 
time.

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5213) Support the SQL Parser Registry

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274908#comment-14274908
 ] 

Apache Spark commented on SPARK-5213:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4015

 Support the SQL Parser Registry
 ---

 Key: SPARK-5213
 URL: https://issues.apache.org/jira/browse/SPARK-5213
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao

 Currently, the SQL Parser dialect is hard code in SQLContext, which is not 
 easy to extend, we need to provide a SQL Parser Dialect Factory util to 
 manage them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5213) Support the SQL Parser Registry

2015-01-13 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-5213:


 Summary: Support the SQL Parser Registry
 Key: SPARK-5213
 URL: https://issues.apache.org/jira/browse/SPARK-5213
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


Currently, the SQL Parser dialect is hard code in SQLContext, which is not easy 
to extend, we need to provide a SQL Parser Dialect Factory util to manage them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1507) Spark on Yarn: Add support for user to specify # cores for ApplicationMaster

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275244#comment-14275244
 ] 

Apache Spark commented on SPARK-1507:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/4018

 Spark on Yarn: Add support for user to specify # cores for ApplicationMaster
 

 Key: SPARK-1507
 URL: https://issues.apache.org/jira/browse/SPARK-1507
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves

 Now that Hadoop 2.x can schedule cores as a resource we should allow the user 
 to specify the # of cores for the ApplicationMaster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager

2015-01-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-5219:
---

 Summary: Race condition in TaskSchedulerImpl and TaskSetManager
 Key: SPARK-5219
 URL: https://issues.apache.org/jira/browse/SPARK-5219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu


TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults 
and TaskSetManager.abort will access variables which are used in multiple 
threads, but they don't use a correct lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5219) Race condition in TaskSchedulerImpl and TaskSetManager

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275284#comment-14275284
 ] 

Apache Spark commented on SPARK-5219:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/4019

 Race condition in TaskSchedulerImpl and TaskSetManager
 --

 Key: SPARK-5219
 URL: https://issues.apache.org/jira/browse/SPARK-5219
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu

 TaskSchedulerImpl.handleTaskGettingResult, TaskSetManager.canFetchMoreResults 
 and TaskSetManager.abort will access variables which are used in multiple 
 threads, but they don't use a correct lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5215) concat support in sqlcontext

2015-01-13 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-5215:
--

 Summary: concat support in sqlcontext
 Key: SPARK-5215
 URL: https://issues.apache.org/jira/browse/SPARK-5215
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang


define concat follow rules in
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5215) concat support in sqlcontext

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275011#comment-14275011
 ] 

Apache Spark commented on SPARK-5215:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4017

 concat support in sqlcontext
 

 Key: SPARK-5215
 URL: https://issues.apache.org/jira/browse/SPARK-5215
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang

 define concat follow rules in
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5218) Report per stage remaining time estimate for each stage.

2015-01-13 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-5218:
--

 Summary: Report per stage remaining time estimate for each stage.
 Key: SPARK-5218
 URL: https://issues.apache.org/jira/browse/SPARK-5218
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Affects Versions: 1.3.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5217) Spark UI should report waiting stages during job execution.

2015-01-13 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-5217:
---
Attachment: waiting_stages.png

 Spark UI should report waiting stages during job execution.
 ---

 Key: SPARK-5217
 URL: https://issues.apache.org/jira/browse/SPARK-5217
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Affects Versions: 1.3.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Attachments: waiting_stages.png


 This is a first step. 
 Spark listener already reports all the stages at the time of job submission 
 and of which we only show active, failed and completed. This addition has no 
 overhead and seems straight forward to achieve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-01-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275014#comment-14275014
 ] 

Shixiong Zhu commented on SPARK-5124:
-

For 1) I prefer to finish it before this JIRA. See 
[#4016|https://github.com/apache/spark/pull/4016]

For 2), I will write some prototype codes to see if the current API design is 
sufficient.

 Standardize internal RPC interface
 --

 Key: SPARK-5124
 URL: https://issues.apache.org/jira/browse/SPARK-5124
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
 Attachments: Pluggable RPC - draft 1.pdf


 In Spark we use Akka as the RPC layer. It would be great if we can 
 standardize the internal RPC interface to facilitate testing. This will also 
 provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop

2015-01-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-5214:
---

 Summary: Add EventLoop and change DAGScheduler to an EventLoop
 Key: SPARK-5214
 URL: https://issues.apache.org/jira/browse/SPARK-5214
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu


As per discussion in SPARK-5124, DAGScheduler can simply use a queue  event 
loop to process events. It would be great when we want to decouple Akka in the 
future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.

2015-01-13 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-5216:
---
Description: 
Per stage feedback on estimated remaining time can help user get a grasp on how 
much time the job is going to take. This will only require changes on the 
UI/JobProgressListener side of code since we already have most of the 
information needed. 

In the initial cut, plan is to estimate time based on statistics of running job 
i.e. average time taken by each task and number of task per stage. This will 
makes sense when jobs are long. And then if this makes sense, then more 
heuristics can be added like projected time saved if the rdd is cached and so 
on. 

More precise details will come as this evolves. In the meantime thoughts on 
alternate ways and suggestion on usefulness are welcome.

 Spark Ui should report estimated time remaining for each stage.
 ---

 Key: SPARK-5216
 URL: https://issues.apache.org/jira/browse/SPARK-5216
 Project: Spark
  Issue Type: Wish
  Components: Spark Core, Web UI
Affects Versions: 1.3.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma

 Per stage feedback on estimated remaining time can help user get a grasp on 
 how much time the job is going to take. This will only require changes on the 
 UI/JobProgressListener side of code since we already have most of the 
 information needed. 
 In the initial cut, plan is to estimate time based on statistics of running 
 job i.e. average time taken by each task and number of task per stage. This 
 will makes sense when jobs are long. And then if this makes sense, then more 
 heuristics can be added like projected time saved if the rdd is cached and so 
 on. 
 More precise details will come as this evolves. In the meantime thoughts on 
 alternate ways and suggestion on usefulness are welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5214) Add EventLoop and change DAGScheduler to an EventLoop

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275007#comment-14275007
 ] 

Apache Spark commented on SPARK-5214:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/4016

 Add EventLoop and change DAGScheduler to an EventLoop
 -

 Key: SPARK-5214
 URL: https://issues.apache.org/jira/browse/SPARK-5214
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu

 As per discussion in SPARK-5124, DAGScheduler can simply use a queue  event 
 loop to process events. It would be great when we want to decouple Akka in 
 the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.

2015-01-13 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-5216:
--

 Summary: Spark Ui should report estimated time remaining for each 
stage.
 Key: SPARK-5216
 URL: https://issues.apache.org/jira/browse/SPARK-5216
 Project: Spark
  Issue Type: Wish
  Components: Spark Core, Web UI
Affects Versions: 1.3.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5220) keepPushingBlocks in BlockGenerator terminated when an exception occurs, which causes the block pushing thread to terminate and blocks receiver

2015-01-13 Thread Max Xu (JIRA)
Max Xu created SPARK-5220:
-

 Summary: keepPushingBlocks in BlockGenerator terminated when an 
exception occurs, which causes the block pushing thread to terminate and blocks 
receiver  
 Key: SPARK-5220
 URL: https://issues.apache.org/jira/browse/SPARK-5220
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu


I am running a Spark streaming application with ReliableKafkaReceiver. It uses 
BlockGenerator to push blocks to BlockManager. However, writing WALs to HDFS 
may time out that causes keepPushingBlocks in BlockGenerator to terminate.

15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:176)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:160)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:126)
at 
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:124)
at 
org.apache.spark.streaming.kafka.ReliableKafkaReceiver.org$apache$spark$streaming$kafka$ReliableKafkaReceiver$$storeBlockAndCommitOffset(ReliableKafkaReceiver.scala:207)
at 
org.apache.spark.streaming.kafka.ReliableKafkaReceiver$GeneratedBlockHandler.onPushBlock(ReliableKafkaReceiver.scala:275)
at 
org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:181)
at 
org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:154)
at 
org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:86)

Then the block pushing thread is done and no subsequent blocks can be pushed 
into blockManager. In turn this blocks receiver from receiving new data.

So when running my app and the TimeoutException happens, the 
ReliableKafkaReceiver stays in ACTIVE status but doesn't do anything at all. 
The application rogues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4794) Wrong parse of GROUP BY query

2015-01-13 Thread Damien Carol (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275464#comment-14275464
 ] 

Damien Carol commented on SPARK-4794:
-

[~marmbrus] Sorry for the late answer.

For the record, I'm testing this query every commit (trunk branch on git) 
without sucess since I created this ticket.

Here the details (EXPLAIN query) :
{noformat}
explain [...]
{noformat}

{noformat}
== Physical Plan ==
Project [Annee#3676,Mois#3677,Jour#3678,Heure#3679,Societe#3680,Magasin#3681,CF 
Presentee#3682,CompteCarteFidelite#3683,NbCompteCarteFidelite#3684,DetentionCF#3685,NbCarteFidelite#3686,PlageDUCB#3687,NbCheque#3688L,CACheque#3689,NbImpaye#3690,NbEnsemble#3691L,NbCompte#3692,ResteDuImpaye#3693]
 !Sort [annee#3695 ASC,mois#3696 ASC,jour#3697 ASC,heure#3698 
ASC,nom_societe#3699 ASC,id_magasin#3700 ASC,CarteFidelitePresentee#3702 
ASC,CompteCarteFidelite#3705 ASC,NbCompteCarteFidelite#3706 
ASC,DetentionCF#3703 ASC,NbCarteFidelite#3704 ASC,Id_CF_Dim_DUCB#3707 ASC], true
  !Exchange (RangePartitioning [annee#3695 ASC,mois#3696 ASC,jour#3697 
ASC,heure#3698 ASC,nom_societe#3699 ASC,id_magasin#3700 
ASC,CarteFidelitePresentee#3702 ASC,CompteCarteFidelite#3705 
ASC,NbCompteCarteFidelite#3706 ASC,DetentionCF#3703 ASC,NbCarteFidelite#3704 
ASC,Id_CF_Dim_DUCB#3707 ASC], 200)
   !OutputFaker 
[Annee#3676,Mois#3677,Jour#3678,Heure#3679,Societe#3680,Magasin#3681,CF 
Presentee#3682,CompteCarteFidelite#3683,NbCompteCarteFidelite#3684,DetentionCF#3685,NbCarteFidelite#3686,PlageDUCB#3687,NbCheque#3688L,CACheque#3689,NbImpaye#3690,NbEnsemble#3691L,NbCompte#3692,ResteDuImpaye#3693,Mois#3677,Annee#3676,Jour#3678,id_magasin#3700,DetentionCF#3685,CompteCarteFidelite#3683,nom_societe#3699,NbCarteFidelite#3686,NbCompteCarteFidelite#3684,CarteFidelitePresentee#3702,Id_CF_Dim_DUCB#3707,Heure#3679]
Project [annee#3715 AS Annee#3676,mois#3716 AS Mois#3677,jour#3717 AS 
Jour#3678,heure#3718 AS Heure#3679,nom_societe#3719 AS 
Societe#3680,id_magasin#3720 AS Magasin#3681,CarteFidelitePresentee#3722 AS CF 
Presentee#3682,CompteCarteFidelite#3725 AS 
CompteCarteFidelite#3683,NbCompteCarteFidelite#3726 AS 
NbCompteCarteFidelite#3684,DetentionCF#3723 AS 
DetentionCF#3685,NbCarteFidelite#3724 AS 
NbCarteFidelite#3686,Id_CF_Dim_DUCB#3727 AS PlageDUCB#3687,NbCheque#3729L AS 
NbCheque#3688L,CACheque#3730 AS CACheque#3689,NbImpaye#3731 AS 
NbImpaye#3690,Id_Ensemble#3732L AS NbEnsemble#3691L,ZIBZIN#3734 AS 
NbCompte#3692,ResteDuImpaye#3733 AS 
ResteDuImpaye#3693,mois#3716,annee#3715,jour#3717,id_magasin#3720,DetentionCF#3723,CompteCarteFidelite#3725,nom_societe#3719,NbCarteFidelite#3724,NbCompteCarteFidelite#3726,CarteFidelitePresentee#3722,Id_CF_Dim_DUCB#3727,heure#3718]
 Filter annee#3715 = 2014)  (mois#3716 = 1))  (jour#3717 = 25))  
(id_magasin#3720 = 649))
  ParquetTableScan 
[Id_CF_Dim_DUCB#3727,ResteDuImpaye#3733,NbCarteFidelite#3724,heure#3718,mois#3716,CompteCarteFidelite#3725,annee#3715,CarteFidelitePresentee#3722,CACheque#3730,NbImpaye#3731,ZIBZIN#3734,NbCompteCarteFidelite#3726,DetentionCF#3723,id_magasin#3720,nom_societe#3719,Id_Ensemble#3732L,jour#3717,NbCheque#3729L],
 (ParquetRelation 
hdfs://nc-h07/user/hive/warehouse/testsimon3.db/cf_encaissement_fact_pq, 
Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@7db3bcc, []), []
{noformat}

Complete stack trace :
{noformat}


15/01/13 17:10:32 INFO SparkExecuteStatementOperation: Running query ' CACHE 
TABLE A_1421165432909 AS select
`cf_encaissement_fact_pq`.`annee` as `Annee`,
`cf_encaissement_fact_pq`.`mois` as `Mois`,
`cf_encaissement_fact_pq`.`jour` as `Jour`,
`cf_encaissement_fact_pq`.`heure` as `Heure`,
`cf_encaissement_fact_pq`.`nom_societe` as `Societe`,
`cf_encaissement_fact_pq`.`id_magasin` as `Magasin`,
`cf_encaissement_fact_pq`.`CarteFidelitePresentee` as `CF Presentee`,
`cf_encaissement_fact_pq`.`CompteCarteFidelite` as `CompteCarteFidelite`,
`cf_encaissement_fact_pq`.`NbCompteCarteFidelite` as 
`NbCompteCarteFidelite`,
`cf_encaissement_fact_pq`.`DetentionCF` as `DetentionCF`,
`cf_encaissement_fact_pq`.`NbCarteFidelite` as `NbCarteFidelite`,
`cf_encaissement_fact_pq`.`Id_CF_Dim_DUCB` as `PlageDUCB`,
`cf_encaissement_fact_pq`.`NbCheque` as `NbCheque`,
`cf_encaissement_fact_pq`.`CACheque` as `CACheque`,
`cf_encaissement_fact_pq`.`NbImpaye` as `NbImpaye`,
`cf_encaissement_fact_pq`.`Id_Ensemble` as `NbEnsemble`,
`cf_encaissement_fact_pq`.`ZIBZIN` as `NbCompte`,
`cf_encaissement_fact_pq`.`ResteDuImpaye` as `ResteDuImpaye`
from
`testsimon3`.`cf_encaissement_fact_pq` as `cf_encaissement_fact_pq`
where
`cf_encaissement_fact_pq`.`annee` = 2014
and
`cf_encaissement_fact_pq`.`mois` = 1
and
`cf_encaissement_fact_pq`.`jour` = 25
and
 

[jira] [Created] (SPARK-5221) FileInputDStream remember window in certain situations causes files to be ignored

2015-01-13 Thread Jem Tucker (JIRA)
Jem Tucker created SPARK-5221:
-

 Summary: FileInputDStream remember window in certain situations 
causes files to be ignored 
 Key: SPARK-5221
 URL: https://issues.apache.org/jira/browse/SPARK-5221
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0, 1.1.1
Reporter: Jem Tucker
Priority: Minor


When batch times are greater than 1 minute, if a file begins to be moved into a 
directory just before FileInputDStream.findNewFiles() is called but does not 
become visible untill after it has excecuted and therefore is not included in 
that batch, the file is then ignored in the following batch as its mod time is 
less than the modTimeIgnoreThreshold. This causes data to be ignored in spark 
streaming that shouldnt be, especially when large files are being moved into 
the directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4796) Spark does not remove temp files

2015-01-13 Thread Fabian Gebert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275494#comment-14275494
 ] 

Fabian Gebert commented on SPARK-4796:
--

suffering from this issue as well and can't see any workaround

 Spark does not remove temp files
 

 Key: SPARK-4796
 URL: https://issues.apache.org/jira/browse/SPARK-4796
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.1.0
 Environment: I'm runnin spark on mesos and mesos slaves are docker 
 containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, 
 docker 1.2.0.
Reporter: Ian Babrou

 I started a job that cannot fill into memory and got no space left on 
 device. That was fair, because docker containers only have 10gb of disk 
 space and some is taken by OS already.
 But then I found out when job failed it didn't release any disk space and 
 left container without any free disk space.
 Then I decided to check if spark removes temp files in any case, because many 
 mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after 
 spark task is finished. I attached with strace to running job:
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36
  unfinished ...
 [pid 30212] ... unlink resumed )  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a
  unfinished ...
 [pid 30212] ... unlink resumed )  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d
  unfinished ...
 [pid 30212] ... unlink resumed )  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc)
  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898
  unfinished ...
 [pid 30212] ... unlink resumed )  = 0
 [pid 30212] 
 unlink(/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0)
  = 0
 And after job is finished, some files are still there:
 /tmp/spark-local-20141209091330-48b5/
 /tmp/spark-local-20141209091330-48b5/11
 /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4
 /tmp/spark-local-20141209091330-48b5/32
 /tmp/spark-local-20141209091330-48b5/04
 /tmp/spark-local-20141209091330-48b5/05
 /tmp/spark-local-20141209091330-48b5/0f
 /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2
 /tmp/spark-local-20141209091330-48b5/3d
 /tmp/spark-local-20141209091330-48b5/0e
 /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1
 /tmp/spark-local-20141209091330-48b5/15
 /tmp/spark-local-20141209091330-48b5/0d
 /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0
 /tmp/spark-local-20141209091330-48b5/36
 /tmp/spark-local-20141209091330-48b5/31
 /tmp/spark-local-20141209091330-48b5/12
 /tmp/spark-local-20141209091330-48b5/21
 /tmp/spark-local-20141209091330-48b5/10
 /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3
 /tmp/spark-local-20141209091330-48b5/1e
 /tmp/spark-local-20141209091330-48b5/35
 If I look into my mesos slaves, there are mostly shuffle files, overall 
 picture for single node:
 root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l
 781
 root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l
 10
 root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle
 /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3
 /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111
 /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705
 /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e
 /tmp/spark-local-20141119144512-67c4/24/temp_f0e4cc43-6cc9-42a7-882d-f8a031fa4dc3
 /tmp/spark-local-20141119144512-67c4/29/temp_a8bbe2cb-f590-4b71-8ef8-9c0324beddc7
 /tmp/spark-local-20141119144512-67c4/3a/temp_9fc08519-f23a-40ac-a3fd-e58df6871460
 

[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-13 Thread Zach Fry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272291#comment-14272291
 ] 

Zach Fry edited comment on SPARK-4879 at 1/13/15 7:51 PM:
--

Hey Josh,

I was able to reproduce the missing file using the speculation settings in my 
previous comment:

{code}
scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 
0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of 
task: attempt_201501091833__m_42_113

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)

org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
15/01/09 18:33:47 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
(TID 0, redacted-03): java.io.IOException: The temporary job-output directory 
hdfs://redacted-01:8020/test2/_temporary doesn't exist!

org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)

org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:240)

org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:89)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:980)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

Notice here that there are only 99 part files and part-00042 is missing (as 
seen in the stacktrace above)
{code}
 $ hadoop fs -ls /test2 | grep part | wc -l
99
palantir@pd-support-01 (/home/palantir/homes/zfry) (master)
 $ hadoop fs -ls /test2 | grep part-0004
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00040
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00041
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00043
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00044
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00045
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00046
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00047
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00048
-rw-r--r--   3 palantir supergroup  8 2015-01-09 18:33 /test2/part-00049
{code}




was (Author: zfry):
Hey Josh,

I was able to reproduce the missing file using the speculation settings in my 
previous comment:

{code}
scala 15/01/09 18:33:28 WARN scheduler.TaskSetManager: Lost task 42.1 in stage 
0.0 (TID 113, redacted-03): java.io.IOException: Failed to save output of 
task: attempt_201501091833__m_42_113

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)

org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)

org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)


[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-01-13 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276094#comment-14276094
 ] 

Timothy Chen commented on SPARK-5095:
-

[~joshdevins][~maasg]
I have a PR out now, I wonder if you guys can try it? 
https://github.com/apache/spark/pull/4027

 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5095) Support launching multiple mesos executors in coarse grained mesos mode

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276093#comment-14276093
 ] 

Apache Spark commented on SPARK-5095:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4027

 Support launching multiple mesos executors in coarse grained mesos mode
 ---

 Key: SPARK-5095
 URL: https://issues.apache.org/jira/browse/SPARK-5095
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Currently in coarse grained mesos mode, it's expected that we only launch one 
 Mesos executor that launches one JVM process to launch multiple spark 
 executors.
 However, this become a problem when the JVM process launched is larger than 
 an ideal size (30gb is recommended value from databricks), which causes GC 
 problems reported on the mailing list.
 We should support launching mulitple executors when large enough resources 
 are available for spark to use, and these resources are still under the 
 configured limit.
 This is also applicable when users want to specifiy number of executors to be 
 launched on each node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5228) Hide tables for Active Jobs/Completed Jobs/Failed Jobs when they are empty

2015-01-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276109#comment-14276109
 ] 

Apache Spark commented on SPARK-5228:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/4028

 Hide tables for Active Jobs/Completed Jobs/Failed Jobs when they are empty
 

 Key: SPARK-5228
 URL: https://issues.apache.org/jira/browse/SPARK-5228
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta

 In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages 
 and Failed Stages are hidden when they are empty while tables for Active 
 Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API

2015-01-13 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-5223:
--
Description: 
It will introduce problems if the object in dict/list/tuple can not support by 
py4j, such as Vector.

Also, pickle may have better performance for larger object (less RPC).

In some cases that the object in dict/list can not be pickled (such as 
JavaObject), we should still use MapConvert/ListConvert.

discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html

  was:
It will introduce problems if the object in dict/list/tuple can not support by 
py4j, such as Vector.

Also, pickle may have better performance for larger object (less RPC).

In some cases that the object in dict/list can not be pickled (such as 
JavaObject), we should still use MapConvert/ListConvert.


 Use pickle instead of MapConvert and ListConvert in MLlib Python API
 

 Key: SPARK-5223
 URL: https://issues.apache.org/jira/browse/SPARK-5223
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Critical

 It will introduce problems if the object in dict/list/tuple can not support 
 by py4j, such as Vector.
 Also, pickle may have better performance for larger object (less RPC).
 In some cases that the object in dict/list can not be pickled (such as 
 JavaObject), we should still use MapConvert/ListConvert.
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-13 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275960#comment-14275960
 ] 

Mohit Jaggi commented on SPARK-5097:


minor comment: mutate existing can do 
df(x) = df(x) 

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-13 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275981#comment-14275981
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


Although I can't assign this task to myself, I am interested to do it.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Muhammad-Ali A'rabi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5179) Spark UI history job duration is wrong

2015-01-13 Thread Olivier Toupin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Toupin updated SPARK-5179:
--
Target Version/s: 1.2.1

 Spark UI history job duration is wrong
 --

 Key: SPARK-5179
 URL: https://issues.apache.org/jira/browse/SPARK-5179
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Olivier Toupin
Priority: Minor

 In the Web UI, the jobs duration times are wrong when using reviewing the job 
 with the history. The stages duration times are ok.
 Jobs are shown with milliseconds duration, which is wrong. However, it's only 
 an history issue, while the job is running, it works.
 More details in that discussion on the mailing list:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tc10010.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API

2015-01-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5223.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 4023
[https://github.com/apache/spark/pull/4023]

 Use pickle instead of MapConvert and ListConvert in MLlib Python API
 

 Key: SPARK-5223
 URL: https://issues.apache.org/jira/browse/SPARK-5223
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 It will introduce problems if the object in dict/list/tuple can not support 
 by py4j, such as Vector.
 Also, pickle may have better performance for larger object (less RPC).
 In some cases that the object in dict/list can not be pickled (such as 
 JavaObject), we should still use MapConvert/ListConvert.
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5223) Use pickle instead of MapConvert and ListConvert in MLlib Python API

2015-01-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5223:
-
Assignee: Davies Liu

 Use pickle instead of MapConvert and ListConvert in MLlib Python API
 

 Key: SPARK-5223
 URL: https://issues.apache.org/jira/browse/SPARK-5223
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 It will introduce problems if the object in dict/list/tuple can not support 
 by py4j, such as Vector.
 Also, pickle may have better performance for larger object (less RPC).
 In some cases that the object in dict/list can not be pickled (such as 
 JavaObject), we should still use MapConvert/ListConvert.
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Python-to-Java-object-conversion-of-numpy-array-td10065.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4912) Persistent data source tables

2015-01-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4912.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3960
[https://github.com/apache/spark/pull/3960]

 Persistent data source tables
 -

 Key: SPARK-4912
 URL: https://issues.apache.org/jira/browse/SPARK-4912
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.0


 It would be good if tables created through the new data sources api could be 
 persisted to the hive metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >