[jira] [Commented] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread dai zhiyuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093816#comment-14093816
 ] 

dai zhiyuan commented on SPARK-2982:


[~srowen] Please see the attached file.

> Glitch of spark streaming
> -
>
> Key: SPARK-2982
> URL: https://issues.apache.org/jira/browse/SPARK-2982
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: dai zhiyuan
> Attachments: cpu.png, io.png, network.png
>
>
> spark streaming task startup time is very focused,It creates a problem which 
> is glitch of (network and cpu) , and cpu and network  is in an idle state at 
> lot of time,which is  wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093813#comment-14093813
 ] 

Apache Spark commented on SPARK-2981:
-

User 'larryxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/1902

> PartitionStrategy: VertexID hash overflow
> -
>
> Key: SPARK-2981
> URL: https://issues.apache.org/jira/browse/SPARK-2981
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with 
> a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.
> The Long is overflowed, and when cast to Int:
> {quote}
> scala> (1125899906842597L*1).toInt
> res1: Int = -27
> scala> (1125899906842597L*2).toInt
> res2: Int = -54
> scala> (1125899906842597L*3).toInt
> res3: Int = -81
> {quote}
> As the cast produce number that are multiplies of 3, the partition is not 
> useable when partitioning to multiples of 3.
> for example when you partition to 6 or 9 parts:
> {quote}
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0))
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
> so the vertices are partitioned to 0,3 for 6; and 0 for 9
> {quote}
> I think solution is to cast after mod.
> {quote}
> scala> (1125899906842597L*3)
> res4: Long = 3377699720527791
> scala> (1125899906842597L*3) % 9
> res5: Long = 3
> scala> ((1125899906842597L*3) % 9).toInt
> res5: Int = 3
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-2981:
--

Description: 
In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with a 
mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}

  was:
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}


> PartitionStrategy: VertexID hash overflow
> -
>
> Key: SPARK-2981
> URL: https://issues.apache.org/jira/browse/SPARK-2981
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with 
> a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.
> The Long is overflowed, and when cast to Int:
> {quote}
> scala> (1125899906842597L*1).toInt
> res1: Int = -27
> scala> (1125899906842597L*2).toInt
> res2: Int = -54
> scala> (1125899906842597L*3).toInt
> res3: Int = -81
> {quote}
> As the cast produce number that are multiplies of 3, the partition is not 
> useable when partitioning to multiples of 3.
> for example when you partition to 6 or 9 parts:
> {quote}
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0))
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
> so the vertices are partitioned to 0,3 for 6; and 0 for 9
> {quote}
> I think solution is to cast after mod.
> {quote}
> scala> (1125899906842597L*3)
> res4: Long = 3377699720527791
> scala> (1125899906842597L*3) % 9
> res5: Long = 3
> scala> ((1125899906842597L*3) % 9).toInt
> res5: Int = 3
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread dai zhiyuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dai zhiyuan updated SPARK-2982:
---

Attachment: network.png
io.png
cpu.png

> Glitch of spark streaming
> -
>
> Key: SPARK-2982
> URL: https://issues.apache.org/jira/browse/SPARK-2982
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: dai zhiyuan
> Attachments: cpu.png, io.png, network.png
>
>
> spark streaming task startup time is very focused,It creates a problem which 
> is glitch of (network and cpu) , and cpu and network  is in an idle state at 
> lot of time,which is  wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread dai zhiyuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dai zhiyuan updated SPARK-2982:
---

Description: spark streaming task startup time is very focused,It creates a 
problem which is glitch of (network and cpu) , and cpu and network  is in an 
idle state at lot of time,which is  wasteful for system resources.  (was: spark 
streaming task startup time is very focused,It creates a problem which is 
network and cpu glitch, and cpu and network  is in an idle state at lot of 
time,which is very wasteful for system resources.)

> Glitch of spark streaming
> -
>
> Key: SPARK-2982
> URL: https://issues.apache.org/jira/browse/SPARK-2982
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: dai zhiyuan
>
> spark streaming task startup time is very focused,It creates a problem which 
> is glitch of (network and cpu) , and cpu and network  is in an idle state at 
> lot of time,which is  wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093794#comment-14093794
 ] 

Sean Owen commented on SPARK-2982:
--

I find it hard to understand the problem or solution that this is attempting to 
describe. Please provide a lot more clear detail?

> Glitch of spark streaming
> -
>
> Key: SPARK-2982
> URL: https://issues.apache.org/jira/browse/SPARK-2982
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: dai zhiyuan
>
> spark streaming task startup time is very focused,It creates a problem which 
> is network and cpu glitch, and cpu and network  is in an idle state at lot of 
> time,which is very wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2985) BlockGenerator not available.

2014-08-11 Thread dai zhiyuan (JIRA)
dai zhiyuan created SPARK-2985:
--

 Summary: BlockGenerator not available.
 Key: SPARK-2985
 URL: https://issues.apache.org/jira/browse/SPARK-2985
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: dai zhiyuan
Priority: Critical


If recevierTracker crashes,the buffer data of BlockGenerator will be lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2650) Caching tables larger than memory causes OOMs

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093785#comment-14093785
 ] 

Apache Spark commented on SPARK-2650:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/1901

> Caching tables larger than memory causes OOMs
> -
>
> Key: SPARK-2650
> URL: https://issues.apache.org/jira/browse/SPARK-2650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.1.0
>
>
> The logic for setting up the initial column buffers is different for Spark 
> SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
> than available memory (where shark was okay).
> Two suspicious things: the intialSize is always set to 0 so we always go with 
> the default.  The default looks like it was copied from code like 10 * 1024 * 
> 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2984) FileNotFoundException on _temporary directory

2014-08-11 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-2984:
--

Description: 
We've seen several stacktraces and threads on the user mailing list where 
people are having issues with a FileNotFoundException stemming from an HDFS 
path containing _temporary.

I think this may be related to spark.speculation.  I think the error condition 
might manifest in this circumstance:

1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist!  exception

Some samples:

{noformat}
14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
140774430 ms.0
java.io.FileNotFoundException: File 
hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at 
org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
-- Chen Song at 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html



{noformat}
I am running a Spark Streaming job that uses saveAsTextFiles to save results 
into hdfs files. However, it has an exception after 20 batches

result-140631234/_temporary/0/task_201407251119__m_03 does not 
exist.
{noformat}
and
{noformat}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.ja

[jira] [Created] (SPARK-2984) FileNotFoundException on _temporary directory

2014-08-11 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-2984:
-

 Summary: FileNotFoundException on _temporary directory
 Key: SPARK-2984
 URL: https://issues.apache.org/jira/browse/SPARK-2984
 Project: Spark
  Issue Type: Bug
Reporter: Andrew Ash
Priority: Critical


We've seen several stacktraces and threads on the user mailing list where 
people are having issues with a FileNotFoundException stemming from an HDFS 
path containing _temporary.

I think this may be related to spark.speculation.  I think the error condition 
might manifest in this circumstance:

1) task T starts on a executor E1
2) it takes a long time, so task T' is started on another executor E2
3) T finishes in E1 so moves its data from _temporary to the final destination 
and deletes the _temporary directory during cleanup
4) T' finishes in E2 and attempts to move its data from _temporary, but those 
files no longer exist!  exception

Some samples:

{noformat}
14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
140774430 ms.0
java.io.FileNotFoundException: File 
hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
at 
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at 
org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
-- Chen Song at 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html



{noformat}
I am running a Spark Streaming job that uses saveAsTextFiles to save results 
into hdfs files. However, it has an exception after 20 batches

result-140631234/_temporary/0/task_201407251119__m_03 does not 
exist.
{noformat}
and
{noformat}
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.a

[jira] [Created] (SPARK-2983) improve performance of sortByKey()

2014-08-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2983:
-

 Summary: improve performance of sortByKey()
 Key: SPARK-2983
 URL: https://issues.apache.org/jira/browse/SPARK-2983
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.2, 0.9.0, 1.1.0
Reporter: Davies Liu


For large datasets with many partitions (N), sortByKey() will be very slow, 
because it will take O(N) time in rangePartitioner.

This could be improved by using binary search, the time will be reduced to 
O(logN).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2923) Implement some basic linalg operations in MLlib

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2923.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1849
[https://github.com/apache/spark/pull/1849]

> Implement some basic linalg operations in MLlib
> ---
>
> Key: SPARK-2923
> URL: https://issues.apache.org/jira/browse/SPARK-2923
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.1.0
>
>
> We use breeze for linear algebra operations. Breeze operations are 
> user-friendly but there are some concerns:
> 1. creating temp objects, e.g., `val z = a * x + b * y`
> 2. multi-method is not used in some operators, e.g., `axpy`. If we pass in 
> SparseVector as a generic Vector, it will use activeIterator, which is slow
> 3. calling native BLAS if it is available, which might not be good for 
> level-1 methods
> Having some basic BLAS operations implemented in MLlib can help simplify the 
> current implementation and improve some performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-11 Thread Jianshi Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093746#comment-14093746
 ] 

Jianshi Huang commented on SPARK-2890:
--

My use case:

The result will be parsed into (id, type, start, end, properties) tuples. 
Properties might or might not contain any of (id, type, start end). So it's 
easier just to list them at the end and not to worry about duplicated names.

Jianshi

> Spark SQL should allow SELECT with duplicated columns
> -
>
> Key: SPARK-2890
> URL: https://issues.apache.org/jira/browse/SPARK-2890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jianshi Huang
>
> Spark reported error java.lang.IllegalArgumentException with messages:
> java.lang.IllegalArgumentException: requirement failed: Found fields with the 
> same name.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.catalyst.types.StructType.(dataTypes.scala:317)
> at 
> org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
> at 
> org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
> After trial and error, it seems it's caused by duplicated columns in my 
> select clause.
> I made the duplication on purpose for my code to parse correctly. I think we 
> should allow users to specify duplicated columns as return value.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2982) Glitch of spark streaming

2014-08-11 Thread dai zhiyuan (JIRA)
dai zhiyuan created SPARK-2982:
--

 Summary: Glitch of spark streaming
 Key: SPARK-2982
 URL: https://issues.apache.org/jira/browse/SPARK-2982
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: dai zhiyuan


spark streaming task startup time is very focused,It creates a problem which is 
network and cpu glitch, and cpu and network  is in an idle state at lot of 
time,which is very wasteful for system resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2934.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1862
[https://github.com/apache/spark/pull/1862]

> Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer  
> --
>
> Key: SPARK-2934
> URL: https://issues.apache.org/jira/browse/SPARK-2934
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2826) Reduce the Memory Copy for HashOuterJoin

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2826.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Reduce the Memory Copy for HashOuterJoin
> 
>
> Key: SPARK-2826
> URL: https://issues.apache.org/jira/browse/SPARK-2826
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.1.0
>
>
> This is actually a follow up for 
> https://issues.apache.org/jira/browse/SPARK-2212 , the previous 
> implementation has potential memory copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Larry Xiao updated SPARK-2981:
--

Description: 
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 

so the vertices are partitioned to 0,3 for 6; and 0 for 9
{quote}

I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}

  was:
In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}


> PartitionStrategy: VertexID hash overflow
> -
>
> Key: SPARK-2981
> URL: https://issues.apache.org/jira/browse/SPARK-2981
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>  Labels: newbie
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In PartitionStrategy.scala a PartitionID is calculated by multiplying 
> VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod 
> numParts.
> The Long is overflowed, and when cast to Int:
> {quote}
> scala> (1125899906842597L*1).toInt
> res1: Int = -27
> scala> (1125899906842597L*2).toInt
> res2: Int = -54
> scala> (1125899906842597L*3).toInt
> res3: Int = -81
> {quote}
> As the cast produce number that are multiplies of 3, the partition is not 
> useable when partitioning to multiples of 3.
> for example when you partition to 6 or 9 parts:
> {quote}
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0))
> 14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), 
> (1,0), (2,0), (3,3832578), (4,0), (5,0)) 
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
> 14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), 
> (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
> so the vertices are partitioned to 0,3 for 6; and 0 for 9
> {quote}
> I think solution is to cast after mod.
> {quote}
> scala> (1125899906842597L*3)
> res4: Long = 3377699720527791
> scala> (1125899906842597L*3) % 9
> res5: Long = 3
> scala> ((1125899906842597L*3) % 9).toInt
> res5: Int = 3
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2981) PartitionStrategy: VertexID hash overflow

2014-08-11 Thread Larry Xiao (JIRA)
Larry Xiao created SPARK-2981:
-

 Summary: PartitionStrategy: VertexID hash overflow
 Key: SPARK-2981
 URL: https://issues.apache.org/jira/browse/SPARK-2981
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Larry Xiao


In PartitionStrategy.scala a PartitionID is calculated by multiplying VertexId 
with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

The Long is overflowed, and when cast to Int:

{quote}
scala> (1125899906842597L*1).toInt
res1: Int = -27

scala> (1125899906842597L*2).toInt
res2: Int = -54

scala> (1125899906842597L*3).toInt
res3: Int = -81
{quote}
As the cast produce number that are multiplies of 3, the partition is not 
useable when partitioning to multiples of 3.

for example when you partition to 6 or 9 parts:
{quote}
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0))
14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), 
(2,0), (3,3832578), (4,0), (5,0)) 

14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), 
(2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0)) 
{quote}
I think solution is to cast after mod.
{quote}
scala> (1125899906842597L*3)
res4: Long = 3377699720527791

scala> (1125899906842597L*3) % 9
res5: Long = 3

scala> ((1125899906842597L*3) % 9).toInt
res5: Int = 3
{quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2650) Caching tables larger than memory causes OOMs

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2650.
-

  Resolution: Fixed
   Fix Version/s: 1.1.0
Assignee: Michael Armbrust  (was: Cheng Lian)
Target Version/s: 1.1.0  (was: 1.2.0)

> Caching tables larger than memory causes OOMs
> -
>
> Key: SPARK-2650
> URL: https://issues.apache.org/jira/browse/SPARK-2650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.1.0
>
>
> The logic for setting up the initial column buffers is different for Spark 
> SQL compared to Shark and I'm seeing OOMs when caching tables that are larger 
> than available memory (where shark was okay).
> Two suspicious things: the intialSize is always set to 0 so we always go with 
> the default.  The default looks like it was copied from code like 10 * 1024 * 
> 1024... but in Spark SQL its 10 * 102 * 1024.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2968) Fix nullabilities of Explode.

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2968.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

> Fix nullabilities of Explode.
> -
>
> Key: SPARK-2968
> URL: https://issues.apache.org/jira/browse/SPARK-2968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0
>
>
> Output nullabilities of {{Explode}} could be detemined by 
> {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2965) Fix HashOuterJoin output nullabilities.

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2965.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Takuya Ueshin

> Fix HashOuterJoin output nullabilities.
> ---
>
> Key: SPARK-2965
> URL: https://issues.apache.org/jira/browse/SPARK-2965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0
>
>
> Output attributes of opposite side of {{OuterJoin}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2590) Add config property to disable incremental collection used in Thrift server

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2590.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Add config property to disable incremental collection used in Thrift server
> ---
>
> Key: SPARK-2590
> URL: https://issues.apache.org/jira/browse/SPARK-2590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.1.0
>
>
> {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the 
> result set one partition at a time. This is useful to avoid OOM when the 
> result is large, but introduces extra job scheduling costs as each partition 
> is collected with a separate job. Users may want to disable this when the 
> result set is expected to be small.
> *UPDATE* Incremental collection hurts performance because tasks of the last 
> stage of the RDD DAG generated from the SQL query plan are executed 
> sequentially. Thus we decided to disable it by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2844) Existing JVM Hive Context not correctly used in Python Hive Context

2014-08-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2844.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

> Existing JVM Hive Context not correctly used in Python Hive Context
> ---
>
> Key: SPARK-2844
> URL: https://issues.apache.org/jira/browse/SPARK-2844
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Ahir Reddy
>Assignee: Ahir Reddy
> Fix For: 1.1.0
>
>
> Unlike the SQLContext, assing an existing JVM HiveContext object into the 
> Python HiveContext constructor does not actually re-use that object. Instead 
> it will create a new HiveContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2934) Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2934:
-

Assignee: DB Tsai

> Adding LogisticRegressionWithLBFGS for training with LBFGS Optimizer  
> --
>
> Key: SPARK-2934
> URL: https://issues.apache.org/jira/browse/SPARK-2934
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2980) Python support for chi-squared test

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2980:
-

Assignee: (was: Doris Xin)

> Python support for chi-squared test
> ---
>
> Key: SPARK-2980
> URL: https://issues.apache.org/jira/browse/SPARK-2980
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Doris Xin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2980) Python support for chi-squared test

2014-08-11 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2980:


 Summary: Python support for chi-squared test
 Key: SPARK-2980
 URL: https://issues.apache.org/jira/browse/SPARK-2980
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2515) Chi-squared test

2014-08-11 Thread Doris Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin updated SPARK-2515:
-

Summary: Chi-squared test  (was: Hypothesis testing)

> Chi-squared test
> 
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
> Fix For: 1.1.0
>
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-2515) Chi-squared test

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-2515.


  Resolution: Implemented
Target Version/s: 1.1.0

> Chi-squared test
> 
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
> Fix For: 1.1.0
>
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2515) Hypothesis testing

2014-08-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2515:
-

Fix Version/s: 1.1.0

> Hypothesis testing
> --
>
> Key: SPARK-2515
> URL: https://issues.apache.org/jira/browse/SPARK-2515
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Doris Xin
> Fix For: 1.1.0
>
>
> Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2979) Improve the convergence rate by minimizing the condition number in LOR with LBFGS

2014-08-11 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-2979:
---

Summary: Improve the convergence rate by minimizing the condition number in 
LOR with LBFGS  (was: Improve the convergence rate by minimize the condition 
number in LOR with LBFGS)

> Improve the convergence rate by minimizing the condition number in LOR with 
> LBFGS
> -
>
> Key: SPARK-2979
> URL: https://issues.apache.org/jira/browse/SPARK-2979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Scaling to minimize the condition number:
> 
> During the optimization process, the convergence (rate) depends on the 
> condition number of the training dataset. Scaling the variables often reduces 
> this condition number, thus mproving the convergence rate dramatically. 
> Without reducing the condition number, some training datasets mixing the 
> columns with different scales may not be able to converge.
>  
> GLMNET and LIBSVM packages perform the scaling to reduce the condition 
> number, and return the weights in the original scale.
> See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
>  
> Here, if useFeatureScaling is enabled, we will standardize the training 
> features by dividing the variance of each column (without subtracting the 
> mean), and train the model in the scaled space. Then we transform the 
> coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
> do.
>
> Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093604#comment-14093604
 ] 

Apache Spark commented on SPARK-2979:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1897

> Improve the convergence rate by minimize the condition number in LOR with 
> LBFGS
> ---
>
> Key: SPARK-2979
> URL: https://issues.apache.org/jira/browse/SPARK-2979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Scaling to minimize the condition number:
> 
> During the optimization process, the convergence (rate) depends on the 
> condition number of the training dataset. Scaling the variables often reduces 
> this condition number, thus mproving the convergence rate dramatically. 
> Without reducing the condition number, some training datasets mixing the 
> columns with different scales may not be able to converge.
>  
> GLMNET and LIBSVM packages perform the scaling to reduce the condition 
> number, and return the weights in the original scale.
> See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
>  
> Here, if useFeatureScaling is enabled, we will standardize the training 
> features by dividing the variance of each column (without subtracting the 
> mean), and train the model in the scaled space. Then we transform the 
> coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
> do.
>
> Currently, it's only enabled in LogisticRegressionWithLBFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2979) Improve the convergence rate by minimize the condition number in LOR with LBFGS

2014-08-11 Thread DB Tsai (JIRA)
DB Tsai created SPARK-2979:
--

 Summary: Improve the convergence rate by minimize the condition 
number in LOR with LBFGS
 Key: SPARK-2979
 URL: https://issues.apache.org/jira/browse/SPARK-2979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


Scaling to minimize the condition number:

During the optimization process, the convergence (rate) depends on the 
condition number of the training dataset. Scaling the variables often reduces 
this condition number, thus mproving the convergence rate dramatically. Without 
reducing the condition number, some training datasets mixing the columns with 
different scales may not be able to converge.
 
GLMNET and LIBSVM packages perform the scaling to reduce the condition number, 
and return the weights in the original scale.

See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
 
Here, if useFeatureScaling is enabled, we will standardize the training 
features by dividing the variance of each column (without subtracting the 
mean), and train the model in the scaled space. Then we transform the 
coefficients from the scaled space to the original scale as GLMNET and LIBSVM 
do.
   
Currently, it's only enabled in LogisticRegressionWithLBFGS




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide an MR-style shuffle 
transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition


> Provide an MR-style shuffle transformation
> --
>
> Key: SPARK-2978
> URL: https://issues.apache.org/jira/browse/SPARK-2978
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide a transformation with the 
> semantics of the Hadoop MR shuffle, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle"
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle", maybe?
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide a transformation with the 
semantics of the Hadoop MR shuffle, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition


> Provide an MR-style shuffle transformation
> --
>
> Key: SPARK-2978
> URL: https://issues.apache.org/jira/browse/SPARK-2978
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide a transformation with the 
> semantics of the Hadoop MR shuffle, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle", maybe?
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-2978:
--

Description: 
For Hive on Spark joins in particular, and for running legacy MR code in 
general, I think it would be useful to provide an MR-style shuffle 
transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark in particular, and running legacy MR code in general, I think 
it would be useful to provide an MR-style shuffle transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition


> Provide an MR-style shuffle transformation
> --
>
> Key: SPARK-2978
> URL: https://issues.apache.org/jira/browse/SPARK-2978
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in 
> general, I think it would be useful to provide an MR-style shuffle 
> transformation, i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", 
> "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle"
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2978) Provide an MR-style shuffle transformation

2014-08-11 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-2978:
-

 Summary: Provide an MR-style shuffle transformation
 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza


For Hive on Spark in particular, and running legacy MR code in general, I think 
it would be useful to provide an MR-style shuffle transformation, i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", 
"hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2975:
--

Priority: Critical  (was: Minor)

I'm raising the priority of this issue to 'critical', since it causes problems 
when running on a cluster if some tasks are small enough to be run locally on 
the driver.

Here's an example exception:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in 
stage 0.0 failed 1 times, most recent failure: Lost task 21.0 in stage 0.0 (TID 
21, localhost): java.io.IOException: No such file or directory
java.io.UnixFileSystem.createFileExclusively(Native Method)
java.io.File.createNewFile(File.java:1006)
java.io.File.createTempFile(File.java:1989)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:335)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:342)

org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:340)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)

scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
scala.collection.mutable.HashMap.foreach(HashMap.scala:98)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:340)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:682)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1359)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

> SPARK_LOCAL_DIRS may cause problems when running in local mode
> --
>
> Key: SPARK-2975
> URL: https://issues.apache.org/jira/browse/SPARK-2975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Josh Rosen
>Priority: Critical
>
> If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the 
> {{Executor}} modifies SparkConf so that this value overrides 
> {{spark.local.dir}}.  Normally, this is safe because the modification takes 
> place before SparkEnv is created.  In local mode, the Executor uses an 
> existing SparkEnv rather than creating a new one, so it winds up with a 
> DiskBlockManager that created local directories with the original 
> {{spark.local.dir}} setting, but other components attempt to use directories 
> specified in the _new_ {{

[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093468#comment-14093468
 ] 

Sean Owen commented on SPARK-1297:
--

Yes I think you'd need to reflect that in changes to the build instructions. 
They are under docs/

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093466#comment-14093466
 ] 

Ted Yu commented on SPARK-1297:
---

w.r.t. build, by default, hbase-hadoop1 would be used.
If user specifies any of the hadoop-2 profiles, hbase-hadoop2 should be 
specified as well.

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1065) PySpark runs out of memory with large broadcast variables

2014-08-11 Thread Vlad Frolov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093413#comment-14093413
 ] 

Vlad Frolov commented on SPARK-1065:


I am facing the same issue in my project, where I use PySpark. As a proof of 
that the big objects I have could easily fit into nodes' memory, I am going to 
use dummy solution of saving my big objects into HDFS and load them on Python 
nodes.

Does anybody have an idea how to fix the issue in a better way? I don't have 
enough either Scala nor Java knowledge to fix this in Spark core. However, I 
feel like broadcast variables could be reimplemented on Python side though it 
seems a bit dangerous idea because we don't want to have separate 
implementations of one thing in both languages. That will also save memory, 
because while we use broadcasts through Scala we have 1 copy in JVM, 1 pickled 
copy in Python and 1 constructed object copy in Python.

> PySpark runs out of memory with large broadcast variables
> -
>
> Key: SPARK-1065
> URL: https://issues.apache.org/jira/browse/SPARK-1065
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.7.3, 0.8.1, 0.9.0
>Reporter: Josh Rosen
>
> PySpark's driver components may run out of memory when broadcasting large 
> variables (say 1 gigabyte).
> Because PySpark's broadcast is implemented on top of Java Spark's broadcast 
> by broadcasting a pickled Python as a byte array, we may be retaining 
> multiple copies of the large object: a pickled copy in the JVM and a 
> deserialized copy in the Python driver.
> The problem could also be due to memory requirements during pickling.
> PySpark is also affected by broadcast variables not being garbage collected.  
> Adding an unpersist() method to broadcast variables may fix this: 
> https://github.com/apache/incubator-spark/pull/543.
> As a first step to fixing this, we should write a failing test to reproduce 
> the error.
> This was discovered by [~sandy]: ["trouble with broadcast variables on 
> pyspark"|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093295#comment-14093295
 ] 

Apache Spark commented on SPARK-2931:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1896

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2891) Daemon failed to launch worker

2014-08-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093265#comment-14093265
 ] 

Davies Liu edited comment on SPARK-2891 at 8/11/14 8:45 PM:


duplicated to 2898 https://issues.apache.org/jira/browse/SPARK-2898


was (Author: davies):
duplicated to 2898

> Daemon failed to launch worker
> --
>
> Key: SPARK-2891
> URL: https://issues.apache.org/jira/browse/SPARK-2891
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Priority: Critical
> Fix For: 1.1.0
>
>
> daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit 
> --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest 
> --num-tasks=1 --num-trials=4 --inter-trial-wait=1
> 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying 
> again on port 4041. - Failure(java.net.BindException: Address already in use)
> Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
> unavailable
> 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID 
> 19777)
> java.lang.IllegalStateException: Python daemon failed to launch worker
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
> unavailable
> 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID 
> 19781)
> java.lang.IllegalStateException: Python daemon failed to launch worker
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID 
> 19777, localhost): java.lang.IllegalStateException: Python daemon failed to 
> launch worker
> 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
> 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
> 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
> 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
> org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
>   

[jira] [Resolved] (SPARK-2891) Daemon failed to launch worker

2014-08-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-2891.
---

   Resolution: Duplicate
Fix Version/s: 1.1.0

duplicated to 2898

> Daemon failed to launch worker
> --
>
> Key: SPARK-2891
> URL: https://issues.apache.org/jira/browse/SPARK-2891
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Priority: Critical
> Fix For: 1.1.0
>
>
> daviesliu@dm:~/work/spark-perf$ /Users/daviesliu/work/spark/bin/spark-submit 
> --master spark://dm:7077 pyspark-tests/tests.py SchedulerThroughputTest 
> --num-tasks=1 --num-trials=4 --inter-trial-wait=1
> 14/08/06 17:58:04 WARN JettyUtils: Failed to create UI on port 4040. Trying 
> again on port 4041. - Failure(java.net.BindException: Address already in use)
> Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
> unavailable
> 14/08/06 17:59:25 ERROR Executor: Exception in task 9777.0 in stage 1.0 (TID 
> 19777)
> java.lang.IllegalStateException: Python daemon failed to launch worker
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Daemon failed to fork PySpark worker: [Errno 35] Resource temporarily 
> unavailable
> 14/08/06 17:59:25 ERROR Executor: Exception in task 9781.0 in stage 1.0 (TID 
> 19781)
> java.lang.IllegalStateException: Python daemon failed to launch worker
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/08/06 17:59:25 WARN TaskSetManager: Lost task 9777.0 in stage 1.0 (TID 
> 19777, localhost): java.lang.IllegalStateException: Python daemon failed to 
> launch worker
> 
> org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:71)
> 
> org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83)
> 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82)
> 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55)
> org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101)
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Thread

[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-11 Thread Jim Blomo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093219#comment-14093219
 ] 

Jim Blomo commented on SPARK-1284:
--

I will try to reproduce on the 1.1 branch later this week, thanks for the 
update!

> pyspark hangs after IOError on Executor
> ---
>
> Key: SPARK-1284
> URL: https://issues.apache.org/jira/browse/SPARK-1284
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Jim Blomo
>Assignee: Davies Liu
>
> When running a reduceByKey over a cached RDD, Python fails with an exception, 
> but the failure is not detected by the task runner.  Spark and the pyspark 
> shell hang waiting for the task to finish.
> The error is:
> {code}
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in 
> dump_stream
> self._write_with_length(obj, stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in 
> _write_with_length
> stream.write(serialized)
> IOError: [Errno 104] Connection reset by peer
> 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
> 4257 bytes in 47 ms
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in 
> launch_worker
> worker(listen_sock)
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> {code}
> I can reproduce the error by running take(10) on the cached RDD before 
> running reduceByKey (which looks at the whole input file).
> Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2420) Dependency changes for compatibility with Hive

2014-08-11 Thread Brock Noland (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brock Noland updated SPARK-2420:


Labels: Hive  (was: )

> Dependency changes for compatibility with Hive
> --
>
> Key: SPARK-2420
> URL: https://issues.apache.org/jira/browse/SPARK-2420
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
>  Labels: Hive
> Attachments: spark_1.0.0.patch
>
>
> During the prototyping of HIVE-7292, many library conflicts showed up because 
> Spark build contains versions of libraries that's vastly different from 
> current major Hadoop version. It would be nice if we can choose versions 
> that's in line with Hadoop or shading them in the assembly. Here are the wish 
> list:
> 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
> 2. Shading Spark's jetty and servlet dependency in the assembly.
> 3. guava version difference. Spark is using a higher version. I'm not sure 
> what's the best solution for this.
> The list may grow as HIVE-7292 proceeds.
> For information only, the attached is a patch that we applied on Spark in 
> order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2976) There are too many tabs in some source files

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093175#comment-14093175
 ] 

Apache Spark commented on SPARK-2976:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1895

> There are too many tabs in some source files
> 
>
> Key: SPARK-2976
> URL: https://issues.apache.org/jira/browse/SPARK-2976
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> Currently, there are too many tabs in source file, which does not correspond 
> to coding style.
> I saw following 3 files have tabs.
> * sorttable.js
> * JavaPageRank.java
> * JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2101) Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2101.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

> Python unit tests fail on Python 2.6 because of lack of unittest.skipIf()
> -
>
> Key: SPARK-2101
> URL: https://issues.apache.org/jira/browse/SPARK-2101
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Uri Laserson
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> PySpark tests fail with Python 2.6 because they currently depend on 
> {{unittest.skipIf}}, which was only introduced in Python 2.7.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager

2014-08-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2977:
-

 Summary: Fix handling of short shuffle manager names in 
ShuffleBlockManager
 Key: SPARK-2977
 URL: https://issues.apache.org/jira/browse/SPARK-2977
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Josh Rosen


Since we allow short names for {{spark.shuffle.manager}}, all code that reads 
that configuration property should be prepared to handle the short names.

See my comment at 
https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this as 
a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2910) Test with Python 2.6 on Jenkins

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2910.
---


> Test with Python 2.6 on Jenkins
> ---
>
> Key: SPARK-2910
> URL: https://issues.apache.org/jira/browse/SPARK-2910
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, PySpark
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> As long as we continue to support Python 2.6 in PySpark, Jenkins should test  
> with Python 2.6.
> We could downgrade the system Python to 2.6, but it might be easier / cleaner 
> to install 2.6 alongside the current Python and {{export 
> PYSPARK_PYTHON=python2.6}} in the test runner script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2954) PySpark MLlib serialization tests fail on Python 2.6

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2954.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

> PySpark MLlib serialization tests fail on Python 2.6
> 
>
> Key: SPARK-2954
> URL: https://issues.apache.org/jira/browse/SPARK-2954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> The PySpark MLlib tests currently fail on Python 2.6 due to problems 
> unpacking data from bytearray using struct.unpack:
> {code}
> **
> File "pyspark/mllib/_common.py", line 181, in __main__._deserialize_double
> Failed example:
> _deserialize_double(_serialize_double(1L)) == 1.0
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py",
>  line 1253, in __run
> compileflags, 1) in test.globs
>   File "", line 1, in 
> _deserialize_double(_serialize_double(1L)) == 1.0
>   File "pyspark/mllib/_common.py", line 194, in _deserialize_double
> return struct.unpack("d", ba[offset:])[0]
> error: unpack requires a string argument of length 8
> **
> File "pyspark/mllib/_common.py", line 184, in __main__._deserialize_double
> Failed example:
> _deserialize_double(_serialize_double(sys.float_info.max)) == x
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py",
>  line 1253, in __run
> compileflags, 1) in test.globs
>   File "", line 1, in 
> _deserialize_double(_serialize_double(sys.float_info.max)) == x
>   File "pyspark/mllib/_common.py", line 194, in _deserialize_double
> return struct.unpack("d", ba[offset:])[0]
> error: unpack requires a string argument of length 8
> **
> File "pyspark/mllib/_common.py", line 187, in __main__._deserialize_double
> Failed example:
> _deserialize_double(_serialize_double(sys.float_info.max)) == y
> Exception raised:
> Traceback (most recent call last):
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py",
>  line 1253, in __run
> compileflags, 1) in test.globs
>   File "", line 1, in 
> _deserialize_double(_serialize_double(sys.float_info.max)) == y
>   File "pyspark/mllib/_common.py", line 194, in _deserialize_double
> return struct.unpack("d", ba[offset:])[0]
> error: unpack requires a string argument of length 8
> **
> {code}
> It looks like one solution is to wrap the {{bytearray}} with {{buffer()}}: 
> http://stackoverflow.com/a/15467046/590203



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2948.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

> PySpark doesn't work on Python 2.6
> --
>
> Key: SPARK-2948
> URL: https://issues.apache.org/jira/browse/SPARK-2948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: CentOS 6.5 / Python 2.6.6
>Reporter: Kousuke Saruta
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> In serializser.py, collections.namedtuple is redefined as follows.
> {code}
> def namedtuple(name, fields, verbose=False, rename=False):
>   
>   
> cls = _old_namedtuple(name, fields, verbose, rename)  
>   
>   
> return _hack_namedtuple(cls)  
>   
>   
>  
> {code}
> The number of arguments is 4 but the number of arguments of namedtuple for 
> Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2700) Hidden files (such as .impala_insert_staging) should be filtered out by sqlContext.parquetFile

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093150#comment-14093150
 ] 

Yin Huai commented on SPARK-2700:
-

Can we resolve it?

> Hidden files (such as .impala_insert_staging) should be filtered out by 
> sqlContext.parquetFile
> --
>
> Key: SPARK-2700
> URL: https://issues.apache.org/jira/browse/SPARK-2700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Teng Qiu
> Fix For: 1.1.0
>
>
> when creating a table in impala, a hidden folder .impala_insert_staging will 
> be created in the folder of table.
> if we want to load such a table using Spark SQL API sqlContext.parquetFile, 
> this hidden folder makes trouble, spark try to get metadata from this folder, 
> you will see the exception:
> {code:borderStyle=solid}
> Caused by: java.io.IOException: Could not read footer for file 
> FileStatus{path=hdfs://xxx:8020/user/hive/warehouse/parquet_strings/.impala_insert_staging;
>  isDirectory=true; modification_time=1406333729252; access_time=0; 
> owner=hdfs; group=hdfs; permission=rwxr-xr-x; isSymlink=false}
> ...
> ...
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path is 
> not a file: /user/hive/warehouse/parquet_strings/.impala_insert_staging
> {code}
> and impala side do not think this is their problem: 
> https://issues.cloudera.org/browse/IMPALA-837 (IMPALA-837 Delete 
> .impala_insert_staging directory after INSERT)
> so maybe we should filter out these hidden folder/file by reading parquet 
> tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093133#comment-14093133
 ] 

Apache Spark commented on SPARK-2790:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1894

> PySpark zip() doesn't work properly if RDDs have different serializers
> --
>
> Key: SPARK-2790
> URL: https://issues.apache.org/jira/browse/SPARK-2790
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
>
> In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have 
> different serializers (e.g. batched vs. unbatched), even if those RDDs have 
> the same number of partitions and same numbers of elements.  This problem 
> occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of 
> LabelledPoints with a JavaRDD of batch-serialized Python objects.
> This is problematic because whether zip() succeeds or errors depends on the 
> partitioning / batching strategy, and we don't want to surface the 
> serialization details to users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-08-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093137#comment-14093137
 ] 

Davies Liu commented on SPARK-1284:
---

[~jblomo], could you reproduce this on master or 1.1 branch?

Maybe the pyspark did not hange after this error message, the take() had 
finished successfully before the error message pop up. The noisy error messages 
had been fixed in PR https://github.com/apache/spark/pull/1625 

> pyspark hangs after IOError on Executor
> ---
>
> Key: SPARK-1284
> URL: https://issues.apache.org/jira/browse/SPARK-1284
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Jim Blomo
>Assignee: Davies Liu
>
> When running a reduceByKey over a cached RDD, Python fails with an exception, 
> but the failure is not detected by the task runner.  Spark and the pyspark 
> shell hang waiting for the task to finish.
> The error is:
> {code}
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in 
> dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in 
> dump_stream
> self._write_with_length(obj, stream)
>   File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in 
> _write_with_length
> stream.write(serialized)
> IOError: [Errno 104] Connection reset by peer
> 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
> 4257 bytes in 47 ms
> Traceback (most recent call last):
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in 
> launch_worker
> worker(listen_sock)
>   File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> {code}
> I can reproduce the error by running take(10) on the cached RDD before 
> running reduceByKey (which looks at the whole input file).
> Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2890) Spark SQL should allow SELECT with duplicated columns

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093119#comment-14093119
 ] 

Yin Huai commented on SPARK-2890:
-

What is the semantic when you have columns with same names?

> Spark SQL should allow SELECT with duplicated columns
> -
>
> Key: SPARK-2890
> URL: https://issues.apache.org/jira/browse/SPARK-2890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jianshi Huang
>
> Spark reported error java.lang.IllegalArgumentException with messages:
> java.lang.IllegalArgumentException: requirement failed: Found fields with the 
> same name.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.catalyst.types.StructType.(dataTypes.scala:317)
> at 
> org.apache.spark.sql.catalyst.types.StructType$.fromAttributes(dataTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToString(ParquetTypes.scala:306)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:83)
> at 
> org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:433)
> After trial and error, it seems it's caused by duplicated columns in my 
> select clause.
> I made the duplication on purpose for my code to parse correctly. I think we 
> should allow users to specify duplicated columns as return value.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2976) There are too many tabs in some source files

2014-08-11 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2976:
-

 Summary: There are too many tabs in some source files
 Key: SPARK-2976
 URL: https://issues.apache.org/jira/browse/SPARK-2976
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor


Currently, there are too many tabs in source file, which does not correspond to 
coding style.

I saw following 3 files have tabs.

* sorttable.js
* JavaPageRank.java
* JavaKinesisWordCountASL.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is incomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: The description about building to use HiveServer and CLI is 
incomplete  (was: The description about building to use HiveServer and CLI is 
imcomplete)

> The description about building to use HiveServer and CLI is incomplete
> --
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2931:
---

Fix Version/s: (was: 1.1.0)

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2931:
---

Target Version/s: 1.1.0

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2975) SPARK_LOCAL_DIRS may cause problems when running in local mode

2014-08-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2975:
-

 Summary: SPARK_LOCAL_DIRS may cause problems when running in local 
mode
 Key: SPARK-2975
 URL: https://issues.apache.org/jira/browse/SPARK-2975
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Priority: Minor


If we're running Spark in local mode and {{SPARK_LOCAL_DIRS}} is set, the 
{{Executor}} modifies SparkConf so that this value overrides 
{{spark.local.dir}}.  Normally, this is safe because the modification takes 
place before SparkEnv is created.  In local mode, the Executor uses an existing 
SparkEnv rather than creating a new one, so it winds up with a DiskBlockManager 
that created local directories with the original {{spark.local.dir}} setting, 
but other components attempt to use directories specified in the _new_ 
{{spark.local.dir}}, leading to problems.

I discovered this issue while testing Spark 1.1.0-snapshot1, but I think it 
will also affect Spark 1.0 (haven't confirmed this, though).

(I posted some comments at 
https://github.com/apache/spark/pull/299#discussion-diff-15975800, but also 
opening this JIRA so this isn't forgotten.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2717:
---

Priority: Critical  (was: Major)

> BasicBlockFetchIterator#next should log when it gets stuck
> --
>
> Key: SPARK-2717
> URL: https://issues.apache.org/jira/browse/SPARK-2717
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>Priority: Critical
>
> If this is stuck for a long time waiting for blocks, we should log what nodes 
> it is waiting for to help debugging. One way to do this is to call take() 
> with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
> for the blocks it is still waiting for. This could all happen in a loop so 
> that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-08-11 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2717:
---

Priority: Major  (was: Blocker)

> BasicBlockFetchIterator#next should log when it gets stuck
> --
>
> Key: SPARK-2717
> URL: https://issues.apache.org/jira/browse/SPARK-2717
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>
> If this is stuck for a long time waiting for blocks, we should log what nodes 
> it is waiting for to help debugging. One way to do this is to call take() 
> with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
> for the blocks it is still waiting for. This could all happen in a loop so 
> that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093018#comment-14093018
 ] 

Apache Spark commented on SPARK-1297:
-

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/1893

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2974) Utils.getLocalDir() may return non-existent spark.local.dir directory

2014-08-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2974:
-

 Summary: Utils.getLocalDir() may return non-existent 
spark.local.dir directory
 Key: SPARK-2974
 URL: https://issues.apache.org/jira/browse/SPARK-2974
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Josh Rosen
Priority: Blocker


The patch for [SPARK-2324] modified Spark to ignore a certain number of invalid 
local directories.  Unfortunately, the {{Utils.getLocalDir()}} method returns 
the _first_ local directory from {{spark.local.dir}}, which might not exist.  
This can lead to confusing FileNotFound errors when executors attempt to fetch 
files. 

(I commented on this at 
https://github.com/apache/spark/pull/1274#issuecomment-51537965, but I'm 
opening a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093012#comment-14093012
 ] 

Ted Yu commented on SPARK-1297:
---

https://github.com/apache/spark/pull/1893

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092988#comment-14092988
 ] 

Ted Yu commented on SPARK-1297:
---

HBase client doesn't need to specify dependency on hbase-hadoop1-compat or 
hbase-hadoop2-compat

I can open a PR once there is positive feedback on the approach - I came from a 
project where reviews mostly happen on JIRA :-)

Can someone assign this issue to me ?

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job

2014-08-11 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2973:
-

 Summary: Add a way to show tables without executing a job
 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson


Right now, sql("show tables").collect() will start a Spark job which shows up 
in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2972) APPLICATION_COMPLETE not created in Python unless context explicitly stopped

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2972:


 Summary: APPLICATION_COMPLETE not created in Python unless context 
explicitly stopped
 Key: SPARK-2972
 URL: https://issues.apache.org/jira/browse/SPARK-2972
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: Cloudera 5.1, yarn master on ubuntu precise
Reporter: Shay Rojansky


If you don't explicitly stop a SparkContext at the end of a Python application 
with sc.stop(), an APPLICATION_COMPLETE file isn't created and the job doesn't 
get picked up by the history server.

This can be easily reproduced with pyspark (but affects scripts as well).

The current workaround is to wrap the entire script with a try/finally and stop 
manually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092967#comment-14092967
 ] 

Sean Owen commented on SPARK-1297:
--

I think you may want to open a PR rather than post patches. Code reviews happen 
on github.com

I see what you did there by triggering one or the other profile with the 
hbase.profile property. Yeah, that may be the least disruptive way to play 
this. But don't the profiles need to select the hadoop-compat module 
appropriate for Hadoop 1 vs Hadoop 2?

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092966#comment-14092966
 ] 

Josh Rosen commented on SPARK-2931:
---

Thanks for investigating and reproducing this issue.  Is someone planning to 
open a PR with a fix?  If not, I can probably do it later this afternoon, since 
this bug is a blocker for many of the spark-perf tests that I'm running.

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-11 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v4.txt

Patch v4 adds two profiles to examples/pom.xml :

hbase-hadoop1 (default)
hbase-hadoop2

I verified that compilation passes with either profile active.

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Priority: Minor
> Attachments: spark-1297-v2.txt, spark-1297-v4.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Description: Currently, if we'd like to use HiveServer or CLI for SparkSQL, 
we need to use -Phive-thriftserver option when building but it's description is 
incomplete.  (was: Currently, if we'd like to use HiveServer or CLI for 
SparkSQL, we need to use -Phive-thriftserver option when building but it's 
implicit.
I think we need to describe how to build.)

> The description about building to use HiveServer and CLI is imcomplete
> --
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2963:
--

Summary: The description about building to use HiveServer and CLI is 
imcomplete  (was: There no documentation about building to use HiveServer and 
CLI for SparkSQL)

> The description about building to use HiveServer and CLI is imcomplete
> --
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's implicit.
> I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2963) The description about building to use HiveServer and CLI is imcomplete

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092894#comment-14092894
 ] 

Kousuke Saruta commented on SPARK-2963:
---

I've updated this title and Github's one.

> The description about building to use HiveServer and CLI is imcomplete
> --
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's implicit.
> I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL

2014-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092889#comment-14092889
 ] 

Cheng Lian edited comment on SPARK-2963 at 8/11/14 3:31 PM:


Actually [there 
is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server],
 but the Spark CLI part is incomplete. Would you mind to update the Issue title 
and description? Thanks.


was (Author: lian cheng):
Actually [there 
is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server]
 but the Spark CLI part is incomplete. Would you mind to update the Issue title 
and description? Thanks.

> There no documentation about building to use HiveServer and CLI for SparkSQL
> 
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's implicit.
> I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2963) There no documentation about building to use HiveServer and CLI for SparkSQL

2014-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092889#comment-14092889
 ] 

Cheng Lian commented on SPARK-2963:
---

Actually [there 
is|https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md#running-the-thrift-jdbc-server]
 but the Spark CLI part is incomplete. Would you mind to update the Issue title 
and description? Thanks.

> There no documentation about building to use HiveServer and CLI for SparkSQL
> 
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's implicit.
> I think we need to describe how to build.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-08-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092881#comment-14092881
 ] 

Thomas Graves commented on SPARK-2089:
--

Sandy, just wondering if you have any ETA on fix for this?

> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092879#comment-14092879
 ] 

Kousuke Saruta commented on SPARK-2970:
---

[~liancheng] Thank you pointing my mistake. I've modified the description.

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
> shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
> When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
> try to create a file to mark the application finished but the hook of 
> FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2970:
--

Description: 
When spark-sql script run with spark.eventLog.enabled set true, it ends with 
IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS.

It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
shutdown hook of org.apache.hadoop.fs.FileSystem is executed.

When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally try 
to create a file to mark the application finished but the hook of FileSystem 
try to close FileSystem.

  was:
When spark-sql script run with spark.eventLog.enabled set true, it ends with 
IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS.
I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
It has a code as follows.

{code}
  public void close() throws HiveSQLException {
try {
acquire();
ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
cancelDelegationToken();
} finally {
  release();
  super.close();
}
  }
{code}

When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
which extends HadoopShimSecure.

HadoopShimSecure#closeAllForUGI is implemented as follows.

{code}
  @Override
  public void closeAllForUGI(UserGroupInformation ugi) {
try {
  FileSystem.closeAllForUGI(ugi);
} catch (IOException e) {
  LOG.error("Could not clean up file-system handles for UGI: " + ugi, e);
}
  }
{code}




> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
> shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
> When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
> try to create a file to mark the application finished but the hook of 
> FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092860#comment-14092860
 ] 

Cheng Lian commented on SPARK-2970:
---

[~sarutak] Would you mind to update the issue description? Otherwise it can be 
confusing for people that don't see your comments below. Thanks.

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
> It has a code as follows.
> {code}
>   public void close() throws HiveSQLException {
> try {
> acquire();
> ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
> cancelDelegationToken();
> } finally {
>   release();
>   super.close();
> }
>   }
> {code}
> When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
> which extends HadoopShimSecure.
> HadoopShimSecure#closeAllForUGI is implemented as follows.
> {code}
>   @Override
>   public void closeAllForUGI(UserGroupInformation ugi) {
> try {
>   FileSystem.closeAllForUGI(ugi);
> } catch (IOException e) {
>   LOG.error("Could not clean up file-system handles for UGI: " + ugi, e);
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092826#comment-14092826
 ] 

Apache Spark commented on SPARK-1777:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/1892

> Pass "cached" blocks directly to disk if memory is not large enough
> ---
>
> Key: SPARK-1777
> URL: https://issues.apache.org/jira/browse/SPARK-1777
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.1.0
>
> Attachments: spark-1777-design-doc.pdf
>
>
> Currently in Spark we entirely unroll a partition and then check whether it 
> will cause us to exceed the storage limit. This has an obvious problem - if 
> the partition itself is enough to push us over the storage limit (and 
> eventually over the JVM heap), it will cause an OOM.
> This can happen in cases where a single partition is very large or when 
> someone is running examples locally with a small heap.
> https://github.com/apache/spark/blob/f6ff2a61d00d12481bfb211ae13d6992daacdcc2/core/src/main/scala/org/apache/spark/CacheManager.scala#L148
> We should think a bit about the most elegant way to fix this - it shares some 
> similarities with the external aggregation code.
> A simple idea is to periodically check the size of the buffer as we are 
> unrolling and see if we are over the memory limit. If we are we could prepend 
> the existing buffer to the iterator and write that entire thing out to disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-11 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092807#comment-14092807
 ] 

Mridul Muralidharan commented on SPARK-2962:


On further investigation :

a) The primary issue is a combination of SPARK-2089 and current schedule 
behavior for pendingTasksWithNoPrefs.
SPARK-2089 leads to very bad allocation of nodes - particularly has an impact 
on bigger clusters.
It leads to a lot of block having no data or rack local executors - causing 
them to end up in pendingTasksWithNoPrefs.

While loading data off dfs, when an executor is being scheduled, even though 
there might be rack local schedules available for it (or, on waiting a while, 
data local too - see (b) below), because of current scheduler behavior, tasks 
from pendingTasksWithNoPrefs get scheduled : causing a large number of ANY 
tasks to be scheduled at the very onset.

The combination of these, with lack of marginal alleviation via (b) is what 
caused the performance impact.

b) spark.scheduler.minRegisteredExecutorsRatio was not yet been used in the 
workload - so that might alleviate some of the non deterministic waiting and 
ensuring adequate executors are allocated ! Thanks [~lirui]



> Suboptimal scheduling in spark
> --
>
> Key: SPARK-2962
> URL: https://issues.apache.org/jira/browse/SPARK-2962
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: All
>Reporter: Mridul Muralidharan
>
> In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
> are always scheduled with PROCESS_LOCAL
> pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
> locations - but which could come in 'later' : particularly relevant when 
> spark app is just coming up and containers are still being added.
> This causes a large number of non node local tasks to be scheduled incurring 
> significant network transfers in the cluster when running with non trivial 
> datasets.
> The comment "// Look for no-pref tasks after rack-local tasks since they can 
> run anywhere." is misleading in the method code : locality levels start from 
> process_local down to any, and so no prefs get scheduled much before rack.
> Also note that, currentLocalityIndex is reset to the taskLocality returned by 
> this method - so returning PROCESS_LOCAL as the level will trigger wait times 
> again. (Was relevant before recent change to scheduler, and might be again 
> based on resolution of this issue).
> Found as part of writing test for SPARK-2931
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2971) Orphaned YARN ApplicationMaster lingers forever

2014-08-11 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2971:


 Summary: Orphaned YARN ApplicationMaster lingers forever
 Key: SPARK-2971
 URL: https://issues.apache.org/jira/browse/SPARK-2971
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Python yarn client mode, Cloudera 5.1.0 on Ubuntu precise
Reporter: Shay Rojansky


We have cases where if CTRL-C is hit during a Spark job startup, a YARN 
ApplicationMaster is created but cannot connect to the driver (presumably 
because the driver has terminated). Once an AM enters this state it never exits 
it, and has to be manually killed in YARN.

Here's an excerpt from the AM logs:

{noformat}
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/yarn/nm/usercache/roji/filecache/40/spark-assembly-1.0.2-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/08/11 16:29:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/08/11 16:29:39 INFO SecurityManager: Changing view acls to: roji
14/08/11 16:29:39 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(roji)
14/08/11 16:29:40 INFO Slf4jLogger: Slf4jLogger started
14/08/11 16:29:40 INFO Remoting: Starting remoting
14/08/11 16:29:40 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkyar...@g024.grid.eaglerd.local:34075]
14/08/11 16:29:40 INFO RMProxy: Connecting to ResourceManager at 
master.grid.eaglerd.local/192.168.41.100:8030
14/08/11 16:29:40 INFO ExecutorLauncher: ApplicationAttemptId: 
appattempt_1407759736957_0014_01
14/08/11 16:29:40 INFO ExecutorLauncher: Registering the ApplicationMaster
14/08/11 16:29:40 INFO ExecutorLauncher: Waiting for Spark driver to be 
reachable.
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
14/08/11 16:29:40 ERROR ExecutorLauncher: Failed to connect to driver at 
master.grid.eaglerd.local:44911, retrying ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092710#comment-14092710
 ] 

Apache Spark commented on SPARK-2970:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1891

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
> It has a code as follows.
> {code}
>   public void close() throws HiveSQLException {
> try {
> acquire();
> ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
> cancelDelegationToken();
> } finally {
>   release();
>   super.close();
> }
>   }
> {code}
> When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
> which extends HadoopShimSecure.
> HadoopShimSecure#closeAllForUGI is implemented as follows.
> {code}
>   @Override
>   public void closeAllForUGI(UserGroupInformation ugi) {
> try {
>   FileSystem.closeAllForUGI(ugi);
> } catch (IOException e) {
>   LOG.error("Could not clean up file-system handles for UGI: " + ugi, e);
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092705#comment-14092705
 ] 

Kousuke Saruta commented on SPARK-2970:
---

I noticed it's not caused by the reason above.
It's caused by shutdown hook of FileSystem.
I have already resolved it to execute shutdown hook for stopping 
SparkSQLContext before the shutdown hook for FileSystem.

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
> It has a code as follows.
> {code}
>   public void close() throws HiveSQLException {
> try {
> acquire();
> ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
> cancelDelegationToken();
> } finally {
>   release();
>   super.close();
> }
>   }
> {code}
> When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
> which extends HadoopShimSecure.
> HadoopShimSecure#closeAllForUGI is implemented as follows.
> {code}
>   @Override
>   public void closeAllForUGI(UserGroupInformation ugi) {
> try {
>   FileSystem.closeAllForUGI(ugi);
> } catch (IOException e) {
>   LOG.error("Could not clean up file-system handles for UGI: " + ugi, e);
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-11 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2970:
-

 Summary: spark-sql script ends with IOException when EventLogging 
is enabled
 Key: SPARK-2970
 URL: https://issues.apache.org/jira/browse/SPARK-2970
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: CDH5.1.0 (Hadoop 2.3.0)
Reporter: Kousuke Saruta


When spark-sql script run with spark.eventLog.enabled set true, it ends with 
IOException because FileLogger can not create APPLICATION_COMPLETE file in HDFS.
I think it's because FIleSystem is closed by HiveSessionImplWithUGI.
It has a code as follows.

{code}
  public void close() throws HiveSQLException {
try {
acquire();
ShimLoader.getHadoopShims().closeAllForUGI(sessionUgi);
cancelDelegationToken();
} finally {
  release();
  super.close();
}
  }
{code}

When using Hadoop 2.0+, ShimLoader.getHadoopShim above returns Hadoop23Shim 
which extends HadoopShimSecure.

HadoopShimSecure#closeAllForUGI is implemented as follows.

{code}
  @Override
  public void closeAllForUGI(UserGroupInformation ugi) {
try {
  FileSystem.closeAllForUGI(ugi);
} catch (IOException e) {
  LOG.error("Could not clean up file-system handles for UGI: " + ugi, e);
}
  }
{code}





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092677#comment-14092677
 ] 

Apache Spark commented on SPARK-2878:
-

User 'GrahamDennis' has created a pull request for this issue:
https://github.com/apache/spark/pull/1890

> Inconsistent Kryo serialisation with custom Kryo Registrator
> 
>
> Key: SPARK-2878
> URL: https://issues.apache.org/jira/browse/SPARK-2878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
> Environment: Linux RedHat EL 6, 4-node Spark cluster.
>Reporter: Graham Dennis
>
> The custom Kryo Registrator (a class with the 
> org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
> Kryo instance created, and this causes inconsistent serialisation and 
> deserialisation.
> The Kryo Registrator is sometimes not used because of a ClassNotFound 
> exception that only occurs if it *isn't* the Worker thread (of an Executor) 
> that tries to create the KryoRegistrator.
> A complete description of the problem and a project reproducing the problem 
> can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
> I have currently only tested this with Spark 1.0.0, but will try to test 
> against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2878) Inconsistent Kryo serialisation with custom Kryo Registrator

2014-08-11 Thread Graham Dennis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092675#comment-14092675
 ] 

Graham Dennis commented on SPARK-2878:
--

I've created a pull request with work-in-progress changes that I'd like 
feedback on: https://github.com/apache/spark/pull/1890

> Inconsistent Kryo serialisation with custom Kryo Registrator
> 
>
> Key: SPARK-2878
> URL: https://issues.apache.org/jira/browse/SPARK-2878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
> Environment: Linux RedHat EL 6, 4-node Spark cluster.
>Reporter: Graham Dennis
>
> The custom Kryo Registrator (a class with the 
> org.apache.spark.serializer.KryoRegistrator trait) is not used with every 
> Kryo instance created, and this causes inconsistent serialisation and 
> deserialisation.
> The Kryo Registrator is sometimes not used because of a ClassNotFound 
> exception that only occurs if it *isn't* the Worker thread (of an Executor) 
> that tries to create the KryoRegistrator.
> A complete description of the problem and a project reproducing the problem 
> can be found at https://github.com/GrahamDennis/spark-kryo-serialisation
> I have currently only tested this with Spark 1.0.0, but will try to test 
> against 1.0.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092671#comment-14092671
 ] 

Apache Spark commented on SPARK-2969:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1889

> Make ScalaReflection be able to handle MapType.containsNull and 
> MapType.valueContainsNull.
> --
>
> Key: SPARK-2969
> URL: https://issues.apache.org/jira/browse/SPARK-2969
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Make {{ScalaReflection}} be able to handle like:
> - Seq\[Int] as ArrayType(IntegerType, containsNull = false)
> - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
> - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
> - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, 
> valueContainsNull = true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-2969:
-

Description: 
Make {{ScalaReflection}} be able to handle like:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)

  was:
Make {{ScalaReflection}} be able to handle:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)


> Make ScalaReflection be able to handle MapType.containsNull and 
> MapType.valueContainsNull.
> --
>
> Key: SPARK-2969
> URL: https://issues.apache.org/jira/browse/SPARK-2969
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Make {{ScalaReflection}} be able to handle like:
> - Seq\[Int] as ArrayType(IntegerType, containsNull = false)
> - Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
> - Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
> - Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, 
> valueContainsNull = true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2969) Make ScalaReflection be able to handle MapType.containsNull and MapType.valueContainsNull.

2014-08-11 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2969:


 Summary: Make ScalaReflection be able to handle 
MapType.containsNull and MapType.valueContainsNull.
 Key: SPARK-2969
 URL: https://issues.apache.org/jira/browse/SPARK-2969
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Make {{ScalaReflection}} be able to handle:

- Seq\[Int] as ArrayType(IntegerType, containsNull = false)
- Seq\[java.lang.Integer] as ArrayType(IntegerType, containsNull = true)
- Map\[Int, Long] as MapType(IntegerType, LongType, valueContainsNull = false)
- Map\[Int, java.lang.Long] as MapType(IntegerType, LongType, valueContainsNull 
= true)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2968) Fix nullabilities of Explode.

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092645#comment-14092645
 ] 

Apache Spark commented on SPARK-2968:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1888

> Fix nullabilities of Explode.
> -
>
> Key: SPARK-2968
> URL: https://issues.apache.org/jira/browse/SPARK-2968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Output nullabilities of {{Explode}} could be detemined by 
> {{ArrayType.containsNull}} or {{MapType.valueContainsNull}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2968) Fix nullabilities of Explode.

2014-08-11 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2968:


 Summary: Fix nullabilities of Explode.
 Key: SPARK-2968
 URL: https://issues.apache.org/jira/browse/SPARK-2968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


Output nullabilities of {{Explode}} could be detemined by 
{{ArrayType.containsNull}} or {{MapType.valueContainsNull}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2967) Several SQL unit test failed when sort-based shuffle is enabled

2014-08-11 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-2967:
--

 Summary: Several SQL unit test failed when sort-based shuffle is 
enabled
 Key: SPARK-2967
 URL: https://issues.apache.org/jira/browse/SPARK-2967
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Saisai Shao


Several SQLQuerySuite unit test failed when sort-based shuffle is enabled. 
Seems SQL test uses GenericMutableRow  which will make ExternalSorter's 
internal buffer all refered to the same object finally because of object's 
mutability. Seems row should be copied when feeding into ExternalSorter.

The error shows below, though have many failures, I only pasted part of them:

{noformat}
 SQLQuerySuite:
 - SPARK-2041 column name equals tablename
 - SPARK-2407 Added Parser of SQL SUBSTR()
 - index into array
 - left semi greater than predicate
 - index into array of arrays
 - agg *** FAILED ***
   Results do not match for query:
   Aggregate ['a], ['a,SUM('b) AS c1#38]
UnresolvedRelation None, testData2, None
   
   == Analyzed Plan ==
   Aggregate [a#4], [a#4,SUM(CAST(b#5, LongType)) AS c1#38L]
SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Aggregate false, [a#4], [a#4,SUM(PartialSum#40L) AS c1#38L]
Exchange (HashPartitioning [a#4], 200)
 Aggregate true, [a#4], [a#4,SUM(CAST(b#5, LongType)) AS PartialSum#40L]
  ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
   !Vector(1, 3)   [1,3]
   !Vector(2, 3)   [1,3]
   !Vector(3, 3)   [1,3] (QueryTest.scala:53)
 - aggregates with nulls
 - select *
 - simple select
 - sorting *** FAILED ***
   Results do not match for query:
   Sort ['a ASC,'b ASC]
Project [*]
 UnresolvedRelation None, testData2, None
   
   == Analyzed Plan ==
   Sort [a#4 ASC,b#5 ASC]
Project [a#4,b#5]
 SparkLogicalPlan (ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Sort [a#4 ASC,b#5 ASC], true
Exchange (RangePartitioning [a#4 ASC,b#5 ASC], 200)
 ExistingRdd [a#4,b#5], MapPartitionsRDD[7] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 6 ==   == Spark Answer - 6 ==
   !Vector(1, 1)   [3,2]
   !Vector(1, 2)   [3,2]
   !Vector(2, 1)   [3,2]
   !Vector(2, 2)   [3,2]
   !Vector(3, 1)   [3,2]
   !Vector(3, 2)   [3,2] (QueryTest.scala:53)
 - limit
 - average
 - average overflow *** FAILED ***
   Results do not match for query:
   Aggregate ['b], [AVG('a) AS c0#90,'b]
UnresolvedRelation None, largeAndSmallInts, None
   
   == Analyzed Plan ==
   Aggregate [b#3], [AVG(CAST(a#2, LongType)) AS c0#90,b#3]
SparkLogicalPlan (ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:215)
   
   == Physical Plan ==
   Aggregate false, [b#3], [(CAST(SUM(PartialSum#93L), DoubleType) / 
CAST(SUM(PartialCount#94L), DoubleType)) AS c0#90,b#3]
Exchange (HashPartitioning [b#3], 200)
 Aggregate true, [b#3], [b#3,COUNT(CAST(a#2, LongType)) AS 
PartialCount#94L,SUM(CAST(a#2, LongType)) AS PartialSum#93L]
  ExistingRdd [a#2,b#3], MapPartitionsRDD[4] at mapPartitions at 
basicOperators.scala:215
   
   == Results ==
   !== Correct Answer - 2 ==   == Spark Answer - 2 ==
   !Vector(2.0, 2) [2.147483645E9,1]
   !Vector(2.147483645E9, 1)   [2.147483645E9,1] (QueryTest.scala:53)
{noformat}





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-08-11 Thread Yu Ishikawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Ishikawa updated SPARK-2966:
---

Summary: Add an approximation algorithm for hierarchical clustering to 
MLlib  (was: Add an approximation algorithm for hierarchical clustering 
algorithm to MLlib)

> Add an approximation algorithm for hierarchical clustering to MLlib
> ---
>
> Key: SPARK-2966
> URL: https://issues.apache.org/jira/browse/SPARK-2966
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> A hierarchical clustering algorithm is a useful unsupervised learning method.
> Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
> (1).
> I would like to implement this method.
> I suggest adding an approximate hierarchical clustering algorithm to MLlib.
> I'd like this to be assigned to me.
> h3. Reference
> # Fast agglomerative hierarchical clustering algorithm using 
> Locality-Sensitive Hashing
> http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2966) Add an approximation algorithm for hierarchical clustering algorithm to MLlib

2014-08-11 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-2966:
--

 Summary: Add an approximation algorithm for hierarchical 
clustering algorithm to MLlib
 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor


A hierarchical clustering algorithm is a useful unsupervised learning method.
Koga. et al. proposed highly scalable hierarchical clustering altgorithm in (1).
I would like to implement this method.

I suggest adding an approximate hierarchical clustering algorithm to MLlib.
I'd like this to be assigned to me.

h3. Reference

# Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive 
Hashing
http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2862) DoubleRDDFunctions.histogram() throws exception for some inputs

2014-08-11 Thread Chandan Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandan Kumar updated SPARK-2862:
-

Description: 
histogram method call throws an IndexOutOfBoundsException when the choice of 
bucketCount partitions the RDD in irrational increments e.g. 

scala> val r = sc.parallelize(6 to 99)
r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at 
:12

scala> r.histogram(9)
java.lang.IndexOutOfBoundsException: 9
at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124)
at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:105)
at 
org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)

  was:
histogram method call throws the below stack trace when the choice of 
bucketCount partitions the RDD in irrational increments e.g. 

scala> val r = sc.parallelize(6 to 99)
r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at 
:12

scala> r.histogram(9)
java.lang.IndexOutOfBoundsException: 9
at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124)
at scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66)
at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237)
at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
at scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241)
at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249)
at scala.collection.AbstractTraversable.toArray(Traversable.scala:105)
at 
org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)


> DoubleRDDFunctions.histogram() throws exception for some inputs
> ---
>
> Key: SPARK-2862
> URL: https://issues.apache.org/jira/browse/SPARK-2862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1
> Environment: Scala version 2.9.2 (OpenJDK 64-Bit Server VM, Java 
> 1.7.0_55) running on Ubuntu 14.04
>Reporter: Chandan Kumar
>
> histogram method call throws an IndexOutOfBoundsException when the choice of 
> bucketCount partitions the RDD in irrational increments e.g. 
> scala> val r = sc.parallelize(6 to 99)
> r: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at 
> :12
> scala> r.histogram(9)
> java.lang.IndexOutOfBoundsException: 9
> at scala.collection.immutable.NumericRange.apply(NumericRange.scala:124)
> at 
> scala.collection.immutable.NumericRange$$anon$1.apply(NumericRange.scala:176)
> at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:66)
> at scala.collection.IterableLike$class.copyToArray(IterableLike.scala:237)
> at scala.collection.AbstractIterable.copyToArray(Iterable.scala:54)
> at 
> scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:241)
> at scala.collection.AbstractTraversable.copyToArray(Traversable.scala:105)
> at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:249)
> at scala.collection.AbstractTraversable.toArray(Traversable.scala:105)
> at 
> org.apache.spark.rdd.DoubleRDDFunctions.histogram(DoubleRDDFunctions.scala:116)
> at $iwC$$iwC$$iwC$$iwC.(:15)
> at $iwC$$iwC$$iwC.(:20)
> at $iwC$$iwC.(:22)
> at $iwC.(:24)
> at (:26)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2965) Fix HashOuterJoin output nullabilities.

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092584#comment-14092584
 ] 

Apache Spark commented on SPARK-2965:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1887

> Fix HashOuterJoin output nullabilities.
> ---
>
> Key: SPARK-2965
> URL: https://issues.apache.org/jira/browse/SPARK-2965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Output attributes of opposite side of {{OuterJoin}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2965) Fix HashOuterJoin output nullabilities.

2014-08-11 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-2965:


 Summary: Fix HashOuterJoin output nullabilities.
 Key: SPARK-2965
 URL: https://issues.apache.org/jira/browse/SPARK-2965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


Output attributes of opposite side of {{OuterJoin}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2964) Wrong silent option in spark-sql script

2014-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092515#comment-14092515
 ] 

Apache Spark commented on SPARK-2964:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1886

> Wrong silent option in spark-sql script
> ---
>
> Key: SPARK-2964
> URL: https://issues.apache.org/jira/browse/SPARK-2964
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In spark-sql script, -s option is handled as silent option but 
> org.apache.hadoop.hive.cli.OptionProcessor interpret -S (large character) as 
> silent mode option.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >