[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal

2014-09-05 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3410:
--
Summary: The priority of shutdownhook for ApplicationMaster should not be 
integer literal  (was: The priority of shutdownhook for ApplicationMaster 
should not be integer literal, rather than refer constant.)

 The priority of shutdownhook for ApplicationMaster should not be integer 
 literal
 

 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 In ApplicationMaster, the priority of shutdown hook is set to 30, which 
 expects higher than the priority of o.a.h.FileSystem.
 In FileSystem, the priority of shutdown hook is expressed as public constant 
 named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant 
 for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3412) Add Missing Types for Row API

2014-09-05 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3412:


 Summary: Add Missing Types for Row API
 Key: SPARK-3412
 URL: https://issues.apache.org/jira/browse/SPARK-3412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-09-05 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122579#comment-14122579
 ] 

Guoqiang Li commented on SPARK-2491:


Executor running multiple tasks at the same time,after {{System.exit}} is 
called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144]
 will delete Spark local dirs.


 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at 

[jira] [Created] (SPARK-3413) Spark Blocked due to Executor lost in FIFO MODE

2014-09-05 Thread Patrick Liu (JIRA)
Patrick Liu created SPARK-3413:
--

 Summary: Spark Blocked due to Executor lost in FIFO MODE
 Key: SPARK-3413
 URL: https://issues.apache.org/jira/browse/SPARK-3413
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2
Reporter: Patrick Liu


I run spark on yarn.
Spark scheduler is running in FIFO mode.
I have 80 worker instances setup. However, as time passes, some worker will be 
lost. (Killed by JVM when OOM, etc).
But some tasks will still run in those executors. 
Obviously the task will never finished.
Then the stage will not finish. So the later stages will be blocked.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611
 ] 

Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 8:44 AM:


Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do nasty workaround in my code).


was (Author: lukas.nalezenec):
Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do nasty workaround in my code).
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/#comments

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611
 ] 

Lukas Nalezenec commented on SPARK-3369:


Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do nasty workaround in my code).
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/#comments

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov commented on SPARK-3403:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-09-05 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122579#comment-14122579
 ] 

Guoqiang Li edited comment on SPARK-2491 at 9/5/14 9:00 AM:


Executor running multiple tasks at the same time.When one task throws an 
exception,{{System.exit}} is 
called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144]
 will delete Spark local dirs. Other task will throw an  
{{FileNotFoundException}} exception.


was (Author: gq):
Executor running multiple tasks at the same time,after {{System.exit}} is 
called,[DiskBlockManager.scala#L144|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L144]
 will delete Spark local dirs.


 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 

[jira] [Created] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3414:
-

 Summary: Case insensitivity breaks when unresolved relation 
contains attributes with upper case letter in their names
 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical


Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is now lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} is only executed once.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] 

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} is only executed once.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.

And then, when {{select * from boom}} is been analyzed, the input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at 

[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611
 ] 

Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 9:17 AM:


Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do workaround in my code).


was (Author: lukas.nalezenec):
Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do nasty workaround in my code).

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Summary: Case insensitivity breaks when unresolved relation contains 
attributes with uppercase letters in their names  (was: Case insensitivity 
breaks when unresolved relation contains attributes with upper case letter in 
their names)

 Case insensitivity breaks when unresolved relation contains attributes with 
 uppercase letters in their names
 

 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical

 Paste the following snippet to {{spark-shell}} (need Hive support) to 
 reproduce this issue:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 val hiveContext = new HiveContext(sc)
 import hiveContext._
 case class LogEntry(filename: String, message: String)
 case class LogFile(name: String)
 sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
 sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)
 val srdd = sql(
   
 SELECT name, message
 FROM rawLogs
 JOIN (
   SELECT name
   FROM logFiles
 ) files
 ON rawLogs.filename = files.name
   )
 srdd.registerTempTable(boom)
 sql(select * from boom)
 {code}
 Exception thrown:
 {code}
 SchemaRDD[7] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 Project [*]
  LowerCaseSchema
   Subquery boom
Project ['name,'message]
 Join Inner, Some(('rawLogs.filename = name#2))
  LowerCaseSchema
   Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
 MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
  Subquery files
   Project [name#2]
LowerCaseSchema
 Subquery logfiles
  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
 mapPartitions at basicOperators.scala:208)
 {code}
 Notice that {{rawLogs}} in the join operator is not lowercased.
 The reason is that, during analysis phase, the 
 {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
 {{Resolution}} batch.
 When {{srdd}} is registered as temporary table {{boom}}, its original 
 (unanalyzed) logical plan is stored into the catalog:
 {code}
 Join Inner, Some(('rawLogs.filename = 'files.name))
  UnresolvedRelation None, rawLogs, None
  Subquery files
   Project ['name]
UnresolvedRelation None, logFiles, None
 {code}
 notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
 not lowercased yet.
 And then, when {{select * from boom}} is been analyzed, its input logical 
 plan is:
 {code}
 Project [*]
  UnresolvedRelation None, boom, None
 {code}
 here the unresolved relation points to the unanalyzed logical plan of 
 {{srdd}}, which is later discovered by rule {{ResolveRelations}}:
 {code}
 === Applying Rule 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
  Project [*]Project [*]
 ! UnresolvedRelation None, boom, NoneLowerCaseSchema
 ! Subquery boom
 !  Project ['name,'message]
 !   Join Inner, 
 Some(('rawLogs.filename = 'files.name))
 !LowerCaseSchema
 ! Subquery rawlogs
 !  SparkLogicalPlan (ExistingRdd 
 [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
 basicOperators.scala:208)
 !Subquery files
 ! Project ['name]
 !  LowerCaseSchema
 !   Subquery logfiles
 !SparkLogicalPlan 
 (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:208)
 {code}
 Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
 {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
 is not lowercased, and thus causes the resolution failure.
 A reasonable fix for this could be always register analyzed logical plan to 
 the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch.

When {{srdd}} is registered as temporary table {{boom}}, its original 
(unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, 
which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd 

[jira] [Commented] (SPARK-3412) Add Missing Types for Row API

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122729#comment-14122729
 ] 

Apache Spark commented on SPARK-3412:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2284

 Add Missing Types for Row API
 -

 Key: SPARK-3412
 URL: https://issues.apache.org/jira/browse/SPARK-3412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-05 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122699#comment-14122699
 ] 

Alexander Ulanov edited comment on SPARK-3403 at 9/5/14 9:53 AM:
-

I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?
3)I didn't get any performance improvements with native libraries versus java 
arrays. My matrices are of size up to 10K-20K . Is it supposed to be so?


was (Author: avulanov):
I managed to compile OpenBLAS with MINGW64 and `USE_THREAD=0`. I got single 
threaded dll. With this dll my tests didn't fail and seem to be executed 
properly. Thank you for suggestion! 
1)Do you think that the same issue will remain in Linux?
2)What are the performance implications when using single threaded OpenBLAS 
through breeze?


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Lukas Nalezenec (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122611#comment-14122611
 ] 

Lukas Nalezenec edited comment on SPARK-3369 at 9/5/14 11:59 AM:
-

Hi, it looks like serious issue for me. How about breaking backward 
compatibility and make it right in Spark 1.2 ? API users would need only to add 
x.iterator()  to their code.

(BTW: I found the issue, I had to do workaround in my code).


was (Author: lukas.nalezenec):
Hi, it looks like serious issue for me. How about break backward compatibility 
and make it right in Spark 1.2 ?

(BTW: I found the issue, I had to do workaround in my code).

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3415) Using sys.stderr in pyspark results in error

2014-09-05 Thread Ward Viaene (JIRA)
Ward Viaene created SPARK-3415:
--

 Summary: Using sys.stderr in pyspark results in error
 Key: SPARK-3415
 URL: https://issues.apache.org/jira/browse/SPARK-3415
 Project: Spark
  Issue Type: Bug
Reporter: Ward Viaene


Using sys.stderr in pyspark results in: 
  File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in 
save_file
from ..transport.adapter import SerializingAdapter
ValueError: Attempted relative import beyond toplevel package

Code to reproduce (copy paste the code in pyspark):

import sys
  
class TestClass(object):
def __init__(self, out = sys.stderr):
self.out = out
def getOne(self):
return 'one'
  

def f():
print type(t)
return 'ok'

  
t = TestClass()
a = [ 1 , 2, 3, 4, 5 ]
b = sc.parallelize(a)
b.map(lambda x: f()).first()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122921#comment-14122921
 ] 

Sean Owen commented on SPARK-3369:
--

The API change is unlikely to happen. Making a bunch of flatMap2 methods is 
really ugly.

I suppose you could try wrapper the Iterator in this:

{code}
public class IteratorIterableT implements IterableT {
  
  private final IteratorT iterator;
  private boolean consumed;
  
  public IteratorIterable(IteratorT iterator) {
this.iterator = iterator;
  }
  
  @Override
  public IteratorT iterator() {
if (consumed) {
  throw new IllegalStateException(Iterator already consumed);
}
consumed = true;
return iterator;
  }
}
{code}

If, as I suspect, Spark actually only calls iterator() once, this will work, 
and this may be the most tolerable workaround until Spark 2.x. If it doesn't 
work, and iterator() is called multiple times, this will fail fast and at least 
we'd know. Can you try something like this?

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error

2014-09-05 Thread Ward Viaene (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ward Viaene updated SPARK-3415:
---
Component/s: PySpark

 Using sys.stderr in pyspark results in error
 

 Key: SPARK-3415
 URL: https://issues.apache.org/jira/browse/SPARK-3415
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Ward Viaene
  Labels: python

 Using sys.stderr in pyspark results in: 
   File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in 
 save_file
 from ..transport.adapter import SerializingAdapter
 ValueError: Attempted relative import beyond toplevel package
 Code to reproduce (copy paste the code in pyspark):
 import sys
   
 class TestClass(object):
 def __init__(self, out = sys.stderr):
 self.out = out
 def getOne(self):
 return 'one'
   
 
 def f():
 print type(t)
 return 'ok'
 
   
 t = TestClass()
 a = [ 1 , 2, 3, 4, 5 ]
 b = sc.parallelize(a)
 b.map(lambda x: f()).first()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error

2014-09-05 Thread Ward Viaene (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ward Viaene updated SPARK-3415:
---
Labels: python  (was: )

 Using sys.stderr in pyspark results in error
 

 Key: SPARK-3415
 URL: https://issues.apache.org/jira/browse/SPARK-3415
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Ward Viaene
  Labels: python

 Using sys.stderr in pyspark results in: 
   File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in 
 save_file
 from ..transport.adapter import SerializingAdapter
 ValueError: Attempted relative import beyond toplevel package
 Code to reproduce (copy paste the code in pyspark):
 import sys
   
 class TestClass(object):
 def __init__(self, out = sys.stderr):
 self.out = out
 def getOne(self):
 return 'one'
   
 
 def f():
 print type(t)
 return 'ok'
 
   
 t = TestClass()
 a = [ 1 , 2, 3, 4, 5 ]
 b = sc.parallelize(a)
 b.map(lambda x: f()).first()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3415) Using sys.stderr in pyspark results in error

2014-09-05 Thread Ward Viaene (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ward Viaene updated SPARK-3415:
---
Affects Version/s: 1.1.0
   1.0.2

 Using sys.stderr in pyspark results in error
 

 Key: SPARK-3415
 URL: https://issues.apache.org/jira/browse/SPARK-3415
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0
Reporter: Ward Viaene
  Labels: python

 Using sys.stderr in pyspark results in: 
   File /home/spark-1.1/dist/python/pyspark/cloudpickle.py, line 660, in 
 save_file
 from ..transport.adapter import SerializingAdapter
 ValueError: Attempted relative import beyond toplevel package
 Code to reproduce (copy paste the code in pyspark):
 import sys
   
 class TestClass(object):
 def __init__(self, out = sys.stderr):
 self.out = out
 def getOne(self):
 return 'one'
   
 
 def f():
 print type(t)
 return 'ok'
 
   
 t = TestClass()
 a = [ 1 , 2, 3, 4, 5 ]
 b = sc.parallelize(a)
 b.map(lambda x: f()).first()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123005#comment-14123005
 ] 

Yu Ishikawa commented on SPARK-2430:


Hi [~rnowling], 

 The community had suggested looking into scikit-learn's API so that is a good 
 idea.
I agree with that idea.

 May I suggest you work on SPARK-2966 / SPARK-2429 first?
All right. I will try it !

Thanks





 Standarized Clustering Algorithm API and Framework
 --

 Key: SPARK-2430
 URL: https://issues.apache.org/jira/browse/SPARK-2430
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Recently, there has been a chorus of voices on the mailing lists about adding 
 new clustering algorithms to MLlib.  To support these additions, we should 
 develop a common framework and API to reduce code duplication and keep the 
 APIs consistent.
 At the same time, we can also expand the current API to incorporate requested 
 features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications otherwise we cannot distinguish

2014-09-05 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3377:
--
Summary: Don't mix metrics from different applications otherwise we cannot 
distinguish  (was: Don't mix metrics from different applications)

 Don't mix metrics from different applications otherwise we cannot distinguish
 -

 Key: SPARK-3377
 URL: https://issues.apache.org/jira/browse/SPARK-3377
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical

 I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I 
 saw following 2 problems.
 (1) When applications which have same spark.app.name run on cluster at the 
 same time, some metrics names jumble up together. e.g, 
 SparkPi.DAGScheduler.stage.failedStages jumble.
 (2) When 2+ executors run on the same machine, JVM metrics of each executors 
 jumble. e.g, We current implementation cannot distinguish metric jvm.memory 
 is for which executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3260) Yarn - pass acls along with executor launch

2014-09-05 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3260.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

 Yarn - pass acls along with executor launch
 ---

 Key: SPARK-3260
 URL: https://issues.apache.org/jira/browse/SPARK-3260
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 1.2.0


 In https://github.com/apache/spark/pull/1196 I added passing the spark view 
 and modify acls into yarn.  Unfortunately we are only passing them into the 
 application master and I missed passing them in when we launch individual 
 containers (executors). 
 We need to modify the ExecutorRunnable.startContainer to set the acls in the 
 ContainerLaunchContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3375) spark on yarn container allocation issues

2014-09-05 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3375.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 spark on yarn container allocation issues
 -

 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker
 Fix For: 1.2.0


 It looks like if yarn doesn't get the containers immediately it stops asking 
 for them and the yarn application hangs with never getting any executors.  
 This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123029#comment-14123029
 ] 

Yu Ishikawa commented on SPARK-2966:


Hi [~rnowling],

{quote}
Based on my reading of the Spark contribution guidelines ( 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I 
think that the Spark community would prefer to have one good implementation of 
an algorithm instead of multiple similar algorithms.
{quote}
I got it.  Tnak you for let me know.

{quote}
would you like to take the lead on the hierarchical clustering?
{quote}
I'd love to! I read the Freeman's example code. Because I like it, I will try 
to implement it. And I would like you to review it.

thanks




 Add an approximation algorithm for hierarchical clustering to MLlib
 ---

 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 A hierarchical clustering algorithm is a useful unsupervised learning method.
 Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
 (1).
 I would like to implement this method.
 I suggest adding an approximate hierarchical clustering algorithm to MLlib.
 I'd like this to be assigned to me.
 h3. Reference
 # Fast agglomerative hierarchical clustering algorithm using 
 Locality-Sensitive Hashing
 http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3416) Add matrix operations for large data set

2014-09-05 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-3416:
--

 Summary: Add matrix operations for large data set
 Key: SPARK-3416
 URL: https://issues.apache.org/jira/browse/SPARK-3416
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa


I think matrix operations for large data set would be helpful. There is a 
method to multiply a RDD based matrix and a local matrix. However, there is not 
a method to operate a RDD based matrix and another one.

- multiplication
- addition / subraction
- power
- scalar
- multipy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3417) Use of old-style classes in pyspark

2014-09-05 Thread Matthew Rocklin (JIRA)
Matthew Rocklin created SPARK-3417:
--

 Summary: Use of old-style classes in pyspark
 Key: SPARK-3417
 URL: https://issues.apache.org/jira/browse/SPARK-3417
 Project: Spark
  Issue Type: Bug
Reporter: Matthew Rocklin
Priority: Minor


pyspark seems to use old-style classes

class Foo:

These are relatively ancient and should be replaced by

class Foo(object):

Many newer libraries depend on this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3417) Use of old-style classes in pyspark

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123059#comment-14123059
 ] 

Apache Spark commented on SPARK-3417:
-

User 'mrocklin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2288

 Use of old-style classes in pyspark
 ---

 Key: SPARK-3417
 URL: https://issues.apache.org/jira/browse/SPARK-3417
 Project: Spark
  Issue Type: Bug
Reporter: Matthew Rocklin
Priority: Minor

 pyspark seems to use old-style classes
 class Foo:
 These are relatively ancient and should be replaced by
 class Foo(object):
 Many newer libraries depend on this change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-05 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123077#comment-14123077
 ] 

RJ Nowling commented on SPARK-2966:
---

Wonderful!

If I can help or when you're ready for reviews, let me know!

 Add an approximation algorithm for hierarchical clustering to MLlib
 ---

 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 A hierarchical clustering algorithm is a useful unsupervised learning method.
 Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
 (1).
 I would like to implement this method.
 I suggest adding an approximate hierarchical clustering algorithm to MLlib.
 I'd like this to be assigned to me.
 h3. Reference
 # Fast agglomerative hierarchical clustering algorithm using 
 Locality-Sensitive Hashing
 http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3369:
---
Priority: Critical  (was: Major)
Target Version/s: 1.2.0

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Critical
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-05 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3174:
---
Assignee: Andrew Or

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or

 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-05 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123137#comment-14123137
 ] 

Patrick Wendell commented on SPARK-3174:


We should come up with a design doc for this. One other feature I saw on the 
user list was allowing the user to ask for more executors directly... that 
could be an interesting hook to expose as well, and maybe we just have a 
heuristic that calls this hook. It's also worth seeing how well this translates 
to Mesos and Standalone mode.

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or

 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123143#comment-14123143
 ] 

Davies Liu commented on SPARK-3399:
---

Could you give an example to show the problem?

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123167#comment-14123167
 ] 

Kousuke Saruta commented on SPARK-3399:
---

Some test for pyspark, for instance rdd.py, use NamedTemporaryFile for creating 
input data.
NamedTemporaryFile creates temporary file on /tmp on local filesystem.
rdd.py is kicked by pyspark script in python/run-tests.
If we set environment variables HADOOP_CONF_DIR or YARN_CONF_DIR in 
spark-env.sh before testing, pyspark command load values from those variables.
After loading those value, Spark expects input data is on the filesystem 
configured by the environment variables.


 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3418) Additional BLAS and Local Sparse Matrix support

2014-09-05 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-3418:
--

 Summary: Additional BLAS and Local Sparse Matrix support
 Key: SPARK-3418
 URL: https://issues.apache.org/jira/browse/SPARK-3418
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Burak Yavuz


Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For Multi-Model 
training, adding support for Level-3 BLAS functions is vital.

In addition, as most real data is sparse, support for Local Sparse Matrices 
will also be added, as supporting sparse matrices will save a lot of memory and 
will lead to better performance. The ability to left multiply a dense matrix 
with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` is a 
sparse matrix will also be added. However, `B` and `C` will remain as Dense 
Matrices for now.

I will post performance comparisons with other libraries that support sparse 
matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3418) [MLlib] Additional BLAS and Local Sparse Matrix support

2014-09-05 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-3418:
---
Summary: [MLlib] Additional BLAS and Local Sparse Matrix support  (was: 
Additional BLAS and Local Sparse Matrix support)

 [MLlib] Additional BLAS and Local Sparse Matrix support
 ---

 Key: SPARK-3418
 URL: https://issues.apache.org/jira/browse/SPARK-3418
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Burak Yavuz

 Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For 
 Multi-Model training, adding support for Level-3 BLAS functions is vital.
 In addition, as most real data is sparse, support for Local Sparse Matrices 
 will also be added, as supporting sparse matrices will save a lot of memory 
 and will lead to better performance. The ability to left multiply a dense 
 matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` 
 is a sparse matrix will also be added. However, `B` and `C` will remain as 
 Dense Matrices for now.
 I will post performance comparisons with other libraries that support sparse 
 matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123195#comment-14123195
 ] 

Davies Liu commented on SPARK-3399:
---

Thanks for the explain. I still can not reproduce the problem by putting  
HADOOP_CONF_DIR and YARN_CONF_DIR into conf/spark-env.sh, run-tests can run 
successfully. Did I miss something?

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch. And when {{srdd}} is registered as temporary table 
{{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}} 
above, which is later discovered by rule {{ResolveRelations}}, thus not touched 
by {{CaseInsensitiveAttributeReferences}} at all:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the 
{{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) 
is not lowercased, and thus causes the resolution failure.

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 

[jira] [Updated] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3414:
--
Description: 
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 Subquery files
  Project [name#2]
   LowerCaseSchema
Subquery logfiles
 SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is not lowercased.

The reason is that, during analysis phase, the 
{{CaseInsensitiveAttributeReferences}} batch is only executed before the 
{{Resolution}} batch. And when {{srdd}} is registered as temporary table 
{{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
 UnresolvedRelation None, rawLogs, None
 Subquery files
  Project ['name]
   UnresolvedRelation None, logFiles, None
{code}
notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
not lowercased yet.

And then, when {{select * from boom}} is been analyzed, its input logical plan 
is:
{code}
Project [*]
 UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}} 
above, which is later discovered by rule {{ResolveRelations}}, thus not touched 
by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is 
thus not lowercased:
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 Project [*]Project [*]
! UnresolvedRelation None, boom, NoneLowerCaseSchema
! Subquery boom
!  Project ['name,'message]
!   Join Inner, Some(('rawLogs.filename 
= 'files.name))
!LowerCaseSchema
! Subquery rawlogs
!  SparkLogicalPlan (ExistingRdd 
[filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
basicOperators.scala:208)
!Subquery files
! Project ['name]
!  LowerCaseSchema
!   Subquery logfiles
!SparkLogicalPlan (ExistingRdd 
[name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}

A reasonable fix for this could be always register analyzed logical plan to the 
catalog when registering temporary tables.

  was:
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce 
this issue:
{code}
import org.apache.spark.sql.hive.HiveContext

val hiveContext = new HiveContext(sc)
import hiveContext._

case class LogEntry(filename: String, message: String)
case class LogFile(name: String)

sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)

val srdd = sql(
  
SELECT name, message
FROM rawLogs
JOIN (
  SELECT name
  FROM logFiles
) files
ON rawLogs.filename = files.name
  )

srdd.registerTempTable(boom)
sql(select * from boom)
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
attributes: *, tree:
Project [*]
 LowerCaseSchema
  Subquery boom
   Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
 LowerCaseSchema
  Subquery rawlogs
   SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
 

[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123200#comment-14123200
 ] 

Kousuke Saruta commented on SPARK-3399:
---

Ah, did you set fs.defaultFs to like hdfs:// in core-site.xml on the path  
configured by HADOOP_CONF_DIR or YARN_CONF_DIR?

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-05 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123203#comment-14123203
 ] 

Nicholas Chammas commented on SPARK-3369:
-

{quote}
The API change is unlikely to happen.
{quote}

Is that due to the stable API promise made in 1.0? 

I guess if we're stuck with this API, we should at least tag this issue somehow 
for review during Spark 2.0 development. API changes should be fair game then.

Also, would PySpark be affected at all by these changes?

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen
Priority: Critical
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2014-09-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123213#comment-14123213
 ] 

Andrew Ash commented on SPARK-1823:
---

// This was not fixed in Spark 1.1 and should be bumped to Spark 1.2

[~pwendell] about this issue on dev@ Aug 25th:
{quote}
We might create a new JIRA for it, but it doesn't exist yet. We'll create 
JIRA's for the major 1.2 issues at the beginning of September.
{quote}

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.1.0


 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123214#comment-14123214
 ] 

Davies Liu commented on SPARK-3399:
---

Given fs.defaultFs as hdfs://,  saveAsTextFile() will save the files into HDFS, 
but other parts of code assume that the files are saved in local filesystem, 
then test cases failed. Do I understand correctly?

Then this patch make sense, thanks to your work!

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2014-09-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1823:
-
 Target Version/s: 1.2.0
Affects Version/s: (was: 1.0.0)
   1.1.0
   1.0.2
Fix Version/s: (was: 1.1.0)

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Or

 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2014-09-05 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123235#comment-14123235
 ] 

Andrew Or commented on SPARK-1823:
--

Thanks Andrew, I have updated the versions.

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Or

 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123238#comment-14123238
 ] 

Sandy Ryza commented on SPARK-3174:
---

I've been putting a little bit of thought into this - I'll work on a design doc 
and post it here.

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or

 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123237#comment-14123237
 ] 

Kousuke Saruta commented on SPARK-3399:
---

Yes I meant like what you mentioned.

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-05 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123253#comment-14123253
 ] 

Tathagata Das commented on SPARK-2892:
--

I intended it to be ERROR to catch such issues where receivers dont stop 
properly. If a program is supposed to shutdown after stopping the streaming 
context, then this probably not much of a problem as everything of Spark is 
torn down anyways. But a SparkContext is going to be reused, then this is 
indeed a problem. 

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3399.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2270
[https://github.com/apache/spark/pull/2270]

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta
 Fix For: 1.2.0


 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-09-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123326#comment-14123326
 ] 

Josh Rosen commented on SPARK-2491:
---

Ah, I see.  It looks like we don't want to display exceptions thrown from other 
tasks during a shutdown, since those exceptions are likely triggered by the 
unclean shutdown itself rather than real errors, and thus will be confusing to 
users who read the logs.

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at 

[jira] [Created] (SPARK-3419) Scheduler shouldn't delay running a task when executors don't reside at any of its preferred locations

2014-09-05 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3419:
-

 Summary: Scheduler shouldn't delay running a task when executors 
don't reside at any of its preferred locations 
 Key: SPARK-3419
 URL: https://issues.apache.org/jira/browse/SPARK-3419
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3420) Using Sphinx to generate API docs for PySpark

2014-09-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3420:
-

 Summary: Using Sphinx to generate API docs for PySpark
 Key: SPARK-3420
 URL: https://issues.apache.org/jira/browse/SPARK-3420
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


Sphinx can generate better documents than epydoc, so let's move on to Sphinx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3160) Simplify DecisionTree data structure for training

2014-09-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3160:
-
Description: 
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a 
parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, 
we could have a “LearningNode extends Node” setup where the LearningNode holds 
metadata for learning (such as impurities).  The test-time model could be 
extracted from this training-time model, so that extra information (such as 
impurities) does not have to be kept after training.

This would let us eliminate the flat array of nodes, thus saving storage when 
we do not grow a full tree.  It would also potentially make it easier to pass 
subtrees to compute nodes for local training.


  was:
Improvement: code clarity

Currently, we maintain a tree structure, a flat array of nodes, and a 
parentImpurities array.

Proposed fix: Maintain everything within a growing tree structure.  For this, 
we could have a “LearningNode extends Node” setup where the LearningNode holds 
metadata for learning (such as impurities).  The test-time model could be 
extracted from this training-time model, so that extra information (such as 
impurities) does not have to be kept after training.



 Simplify DecisionTree data structure for training
 -

 Key: SPARK-3160
 URL: https://issues.apache.org/jira/browse/SPARK-3160
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Improvement: code clarity
 Currently, we maintain a tree structure, a flat array of nodes, and a 
 parentImpurities array.
 Proposed fix: Maintain everything within a growing tree structure.  For this, 
 we could have a “LearningNode extends Node” setup where the LearningNode 
 holds metadata for learning (such as impurities).  The test-time model could 
 be extracted from this training-time model, so that extra information (such 
 as impurities) does not have to be kept after training.
 This would let us eliminate the flat array of nodes, thus saving storage when 
 we do not grow a full tree.  It would also potentially make it easier to pass 
 subtrees to compute nodes for local training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-09-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123390#comment-14123390
 ] 

Andrew Ash commented on SPARK-3280:
---

[~joshrosen] do you have a theory for the cause of the dropoff between 2800 and 
3200 partitions in your chart?  My interpretation is that both shuffle 
implementations behave similarly in this scenario up to ~1600 after which the 
hash based starts falling behind, then there's another step difference at 3200 
where it hits a severe dropoff.  I'm interested in the right third of the chart.

A couple theories:
- more partitions = more stuff in memory concurrently = GC pressure.  
Sort-based can stream and do merge sort, but hash-based needs to build the hash 
all at once then spill it
- more partitions = more concurrent spills = disk thrashing while writing to 
lots of files concurrently, exacerbated if the test was on spinnies instead of 
SSDs.  Maybe the sort-based merges spills while writing to disk so ends up 
writing fewer spill files concurrently.

Also the chart is a little unclear, is the y-axis time in seconds?

 Made sort-based shuffle the default implementation
 --

 Key: SPARK-3280
 URL: https://issues.apache.org/jira/browse/SPARK-3280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Attachments: hash-sort-comp.png


 sort-based shuffle has lower memory usage and seems to outperform hash-based 
 in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name

2014-09-05 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3421:
-

 Summary: StructField.toString should quote the name field to allow 
arbitrary character as struct field name
 Key: SPARK-3421
 URL: https://issues.apache.org/jira/browse/SPARK-3421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian


The original use case is something like this:
{code}
// JSON snippet with illegal characters in field names
val json =
  { a(b): { c(d): hello } } ::
  { a(b): { c(d): world } } ::
  Nil

val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json))
jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet)

java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: 
StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))),
 [1.37] failure: `,' expected but `(' found
{code}
The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as 
struct field name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2714) DAGScheduler should log jobid when runJob finishes

2014-09-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2714:
--
Summary: DAGScheduler should log jobid when runJob finishes  (was: 
DAGScheduler logs jobid when runJob finishes)

 DAGScheduler should log jobid when runJob finishes
 --

 Key: SPARK-2714
 URL: https://issues.apache.org/jira/browse/SPARK-2714
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 DAGScheduler logs jobid when runJob finishes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2714) DAGScheduler should log jobid when runJob finishes

2014-09-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2714:
--
Description: When DAGScheduler concurrently runs multiple jobs, 
SparkContext only logs Job finished and logs in the same file which doesn't 
tell who is who. It's difficult to found which job has finished or how much 
time it has taken from multiple Job finished: ..., took ... s logs.  (was: 
DAGScheduler logs jobid when runJob finishes)

 DAGScheduler should log jobid when runJob finishes
 --

 Key: SPARK-2714
 URL: https://issues.apache.org/jira/browse/SPARK-2714
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 When DAGScheduler concurrently runs multiple jobs, SparkContext only logs 
 Job finished and logs in the same file which doesn't tell who is who. It's 
 difficult to found which job has finished or how much time it has taken from 
 multiple Job finished: ..., took ... s logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-09-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2491:
-
Affects Version/s: 1.1.0

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
 at 
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
 at 
 

[jira] [Commented] (SPARK-2491) When an OOM is thrown,the executor does not stop properly.

2014-09-05 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123447#comment-14123447
 ] 

Andrew Or commented on SPARK-2491:
--

[~witgo] This doesn't seem to be specific to YARN. I'm changing the component 
to Spark Core. Let me know if there's a reason should change it back.

 When an OOM is thrown,the executor does not stop properly.
 --

 Key: SPARK-2491
 URL: https://issues.apache.org/jira/browse/SPARK-2491
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Guoqiang Li

 The executor log:
 {code}
 #
 # java.lang.OutOfMemoryError: Java heap space
 # -XX:OnOutOfMemoryError=kill %p
 #   Executing /bin/sh -c kill 44942...
 14/07/15 10:38:29 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
 SIGTERM
 14/07/15 10:38:29 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Connection manager future execution context-6,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
 at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94)
 at 
 org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176)
 at 
 org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63)
 at 
 org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:125)
 at 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$sendRequest$1.applyOrElse(BlockFetcherIterator.scala:122)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117)
 at 
 scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 14/07/15 10:38:29 WARN HadoopRDD: Exception in RecordReader.close()
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
 at 
 org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
 at java.io.FilterInputStream.close(FilterInputStream.java:181)
 at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
 at 
 org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:243)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
 at 
 org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at 
 org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 -
 14/07/15 10:38:30 INFO Executor: Running task ID 969
 14/07/15 10:38:30 INFO BlockManager: Found block broadcast_0 locally
 14/07/15 10:38:30 INFO HadoopRDD: Input split: 
 hdfs://10dian72.domain.test:8020/input/lbs/recommend/toona/rating/20140712/part-7:0+68016537
 14/07/15 10:38:30 ERROR Executor: Exception in task ID 969
 java.io.FileNotFoundException: 
 /yarn/nm/usercache/spark/appcache/application_1404728465401_0070/spark-local-20140715103235-ffda/2e/merged_shuffle_4_85_0
  (No such file or directory)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at 
 

[jira] [Updated] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3174:
--
Attachment: SPARK-3174design.pdf

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3174) Under YARN, add and remove executors based on load

2014-09-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123504#comment-14123504
 ] 

Sandy Ryza commented on SPARK-3174:
---

Posted a high-level design doc.

 Under YARN, add and remove executors based on load
 --

 Key: SPARK-3174
 URL: https://issues.apache.org/jira/browse/SPARK-3174
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.2
Reporter: Sandy Ryza
Assignee: Andrew Or
 Attachments: SPARK-3174design.pdf


 A common complaint with Spark in a multi-tenant environment is that 
 applications have a fixed allocation that doesn't grow and shrink with their 
 resource needs.  We're blocked on YARN-1197 for dynamically changing the 
 resources within executors, but we can still allocate and discard whole 
 executors.
 I think it would be useful to have some heuristics that
 * Request more executors when many pending tasks are building up
 * Request more executors when RDDs can't fit in memory
 * Discard executors when few tasks are running / pending and there's not much 
 in memory
 Bonus points: migrate blocks from executors we're about to discard to 
 executors with free space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123538#comment-14123538
 ] 

Apache Spark commented on SPARK-3414:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2293

 Case insensitivity breaks when unresolved relation contains attributes with 
 uppercase letters in their names
 

 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical

 Paste the following snippet to {{spark-shell}} (need Hive support) to 
 reproduce this issue:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 val hiveContext = new HiveContext(sc)
 import hiveContext._
 case class LogEntry(filename: String, message: String)
 case class LogFile(name: String)
 sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
 sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)
 val srdd = sql(
   
 SELECT name, message
 FROM rawLogs
 JOIN (
   SELECT name
   FROM logFiles
 ) files
 ON rawLogs.filename = files.name
   )
 srdd.registerTempTable(boom)
 sql(select * from boom)
 {code}
 Exception thrown:
 {code}
 SchemaRDD[7] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 Project [*]
  LowerCaseSchema
   Subquery boom
Project ['name,'message]
 Join Inner, Some(('rawLogs.filename = name#2))
  LowerCaseSchema
   Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
 MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
  Subquery files
   Project [name#2]
LowerCaseSchema
 Subquery logfiles
  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
 mapPartitions at basicOperators.scala:208)
 {code}
 Notice that {{rawLogs}} in the join operator is not lowercased.
 The reason is that, during analysis phase, the 
 {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
 {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
 {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
 {code}
 Join Inner, Some(('rawLogs.filename = 'files.name))
  UnresolvedRelation None, rawLogs, None
  Subquery files
   Project ['name]
UnresolvedRelation None, logFiles, None
 {code}
 notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
 not lowercased yet.
 And then, when {{select * from boom}} is been analyzed, its input logical 
 plan is:
 {code}
 Project [*]
  UnresolvedRelation None, boom, None
 {code}
 here the unresolved relation points to the unanalyzed logical plan of 
 {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
 not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
 {{rawLogs.filename}} is thus not lowercased:
 {code}
 === Applying Rule 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
  Project [*]Project [*]
 ! UnresolvedRelation None, boom, NoneLowerCaseSchema
 ! Subquery boom
 !  Project ['name,'message]
 !   Join Inner, 
 Some(('rawLogs.filename = 'files.name))
 !LowerCaseSchema
 ! Subquery rawlogs
 !  SparkLogicalPlan (ExistingRdd 
 [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
 basicOperators.scala:208)
 !Subquery files
 ! Project ['name]
 !  LowerCaseSchema
 !   Subquery logfiles
 !SparkLogicalPlan 
 (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:208)
 {code}
 A reasonable fix for this could be always register analyzed logical plan to 
 the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123537#comment-14123537
 ] 

Apache Spark commented on SPARK-3421:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2291

 StructField.toString should quote the name field to allow arbitrary character 
 as struct field name
 --

 Key: SPARK-3421
 URL: https://issues.apache.org/jira/browse/SPARK-3421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian

 The original use case is something like this:
 {code}
 // JSON snippet with illegal characters in field names
 val json =
   { a(b): { c(d): hello } } ::
   { a(b): { c(d): world } } ::
   Nil
 val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json))
 jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet)
 java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: 
 StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))),
  [1.37] failure: `,' expected but `(' found
 {code}
 The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} 
 as struct field name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2537) Workaround Timezone specific Hive tests

2014-09-05 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123556#comment-14123556
 ] 

Cheng Lian commented on SPARK-2537:
---

PR [#1440|https://github.com/apache/spark/pull/1440] fixes this issue.

 Workaround Timezone specific Hive tests
 ---

 Key: SPARK-2537
 URL: https://issues.apache.org/jira/browse/SPARK-2537
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Cheng Lian
Priority: Minor

 Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
 - {{timestamp_1}}
 - {{timestamp_2}}
 - {{timestamp_3}}
 - {{timestamp_udf}}
 Their answers differ between different timezones. Caching golden answers 
 naively cause build failures in other timezones. Currently these tests are 
 blacklisted. A not so clever solution is to cache golden answers of all 
 timezones for these tests, then select the right version for the current 
 build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2537) Workaround Timezone specific Hive tests

2014-09-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-2537.
---
  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s: 1.1.0

 Workaround Timezone specific Hive tests
 ---

 Key: SPARK-2537
 URL: https://issues.apache.org/jira/browse/SPARK-2537
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Cheng Lian
Priority: Minor
 Fix For: 1.1.0


 Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
 - {{timestamp_1}}
 - {{timestamp_2}}
 - {{timestamp_3}}
 - {{timestamp_udf}}
 Their answers differ between different timezones. Caching golden answers 
 naively cause build failures in other timezones. Currently these tests are 
 blacklisted. A not so clever solution is to cache golden answers of all 
 timezones for these tests, then select the right version for the current 
 build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks

2014-09-05 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123584#comment-14123584
 ] 

Andrew Ash commented on SPARK-2099:
---

I just gave this a runthrough and most of the metrics above are live-updated 
but the GC one isn't.  I think because it's not included in 
updateAggregateMetrics here: 
https://github.com/apache/spark/pull/1056/files#diff-1f32bcb61f51133bd0959a4177a066a5R175

Should I open a new ticket to make GC time live-updated?  That's the metric I 
was most excited about to see live-updating.

 Report TaskMetrics for running tasks
 

 Key: SPARK-2099
 URL: https://issues.apache.org/jira/browse/SPARK-2099
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.1.0


 Spark currently collects a set of helpful task metrics, like shuffle bytes 
 written, GC time, and displays them on the app web UI.  These are only 
 collected and displayed for tasks that have completed.  This makes them 
 unsuited to perhaps the situation where they would be most useful - 
 determining what's going wrong in currently running tasks.
 Reporting metrics progrss for running tasks would probably require adding an 
 executor-driver heartbeat that reports metrics for all tasks currently 
 running on the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2099) Report TaskMetrics for running tasks

2014-09-05 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123587#comment-14123587
 ] 

Sandy Ryza commented on SPARK-2099:
---

Yeah, unfortunately I haven't had the chance to add that one in yet.  A new 
ticket would be great.

 Report TaskMetrics for running tasks
 

 Key: SPARK-2099
 URL: https://issues.apache.org/jira/browse/SPARK-2099
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.1.0


 Spark currently collects a set of helpful task metrics, like shuffle bytes 
 written, GC time, and displays them on the app web UI.  These are only 
 collected and displayed for tasks that have completed.  This makes them 
 unsuited to perhaps the situation where they would be most useful - 
 determining what's going wrong in currently running tasks.
 Reporting metrics progrss for running tasks would probably require adding an 
 executor-driver heartbeat that reports metrics for all tasks currently 
 running on the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3418) [MLlib] Additional BLAS and Local Sparse Matrix support

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123634#comment-14123634
 ] 

Apache Spark commented on SPARK-3418:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/2294

 [MLlib] Additional BLAS and Local Sparse Matrix support
 ---

 Key: SPARK-3418
 URL: https://issues.apache.org/jira/browse/SPARK-3418
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Burak Yavuz

 Currently MLlib doesn't have Level-2 and Level-3 BLAS support. For 
 Multi-Model training, adding support for Level-3 BLAS functions is vital.
 In addition, as most real data is sparse, support for Local Sparse Matrices 
 will also be added, as supporting sparse matrices will save a lot of memory 
 and will lead to better performance. The ability to left multiply a dense 
 matrix with a sparse matrix, i.e. `C := alpha * A * B + beta * C` where `A` 
 is a sparse matrix will also be added. However, `B` and `C` will remain as 
 Dense Matrices for now.
 I will post performance comparisons with other libraries that support sparse 
 matrices such as Breeze and Matrix-toolkits-JAVA (MTJ) in the comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3211:
-
Target Version/s: 1.1.1, 1.2.0

 .take() is OOM-prone when there are empty partitions
 

 Key: SPARK-3211
 URL: https://issues.apache.org/jira/browse/SPARK-3211
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash

 Filed on dev@ on 22 August by [~pnepywoda]:
 {quote}
 On line 777
 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
 the logic for take() reads ALL partitions if the first one (or first k) are
 empty. This has actually lead to OOMs when we had many partitions
 (thousands) and unfortunately the first one was empty.
 Wouldn't a better implementation strategy be
 numPartsToTry = partsScanned * 2
 instead of
 numPartsToTry = totalParts - 1
 (this doubling is similar to most memory allocation strategies)
 Thanks!
 - Paul
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3416) Add matrix operations for large data set

2014-09-05 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123695#comment-14123695
 ] 

Yu Ishikawa commented on SPARK-3416:


We discuss about this issue on the thread.

http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-td8291.html

 Add matrix operations for large data set
 

 Key: SPARK-3416
 URL: https://issues.apache.org/jira/browse/SPARK-3416
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa

 I think matrix operations for large data set would be helpful. There is a 
 method to multiply a RDD based matrix and a local matrix. However, there is 
 not a method to operate a RDD based matrix and another one.
 - multiplication
 - addition / subraction
 - power
 - scalar
 - multipy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3082) yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't exist

2014-09-05 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved SPARK-3082.
---
   Resolution: Fixed
Fix Version/s: 1.1.0

 yarn.Client.logClusterResourceDetails throws NPE if requested queue doesn't 
 exist
 -

 Key: SPARK-3082
 URL: https://issues.apache.org/jira/browse/SPARK-3082
 Project: Spark
  Issue Type: Bug
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Minor
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3423) Implement BETWEEN support for regular SQL parser

2014-09-05 Thread William Benton (JIRA)
William Benton created SPARK-3423:
-

 Summary: Implement BETWEEN support for regular SQL parser
 Key: SPARK-3423
 URL: https://issues.apache.org/jira/browse/SPARK-3423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Priority: Minor


The HQL parser supports BETWEEN but the SQLParser currently does not.  It would 
be great if it did.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-09-05 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123749#comment-14123749
 ] 

Marcelo Vanzin commented on SPARK-3215:
---

I updated the prototype to include a Java API and not to use SparkConf in the 
API.

 Add remote interface for SparkContext
 -

 Key: SPARK-3215
 URL: https://issues.apache.org/jira/browse/SPARK-3215
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Marcelo Vanzin
  Labels: hive
 Attachments: RemoteSparkContext.pdf


 A quick description of the issue: as part of running Hive jobs on top of 
 Spark, it's desirable to have a SparkContext that is running in the 
 background and listening for job requests for a particular user session.
 Running multiple contexts in the same JVM is not a very good solution. Not 
 only SparkContext currently has issues sharing the same JVM among multiple 
 instances, but that turns the JVM running the contexts into a huge bottleneck 
 in the system.
 So I'm proposing a solution where we have a SparkContext that is running in a 
 separate process, and listening for requests from the client application via 
 some RPC interface (most probably Akka).
 I'll attach a document shortly with the current proposal. Let's use this bug 
 to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser

2014-09-05 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123814#comment-14123814
 ] 

William Benton commented on SPARK-3423:
---

(PR is here:  https://github.com/apache/spark/pull/2295 ) 

 Implement BETWEEN support for regular SQL parser
 

 Key: SPARK-3423
 URL: https://issues.apache.org/jira/browse/SPARK-3423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Assignee: William Benton
Priority: Minor

 The HQL parser supports BETWEEN but the SQLParser currently does not.  It 
 would be great if it did.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2014-09-05 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123883#comment-14123883
 ] 

Kostas Sakellis commented on SPARK-1239:


[~pwendell] I'd like to take a crack at this since it is affecting one of our 
customers. 

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.1.0


 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124238#comment-14124238
 ] 

Apache Spark commented on SPARK-3423:
-

User 'willb' has created a pull request for this issue:
https://github.com/apache/spark/pull/2295

 Implement BETWEEN support for regular SQL parser
 

 Key: SPARK-3423
 URL: https://issues.apache.org/jira/browse/SPARK-3423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Assignee: William Benton
Priority: Minor

 The HQL parser supports BETWEEN but the SQLParser currently does not.  It 
 would be great if it did.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124239#comment-14124239
 ] 

Apache Spark commented on SPARK-2334:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2296

 Attribute Error calling PipelinedRDD.id() in pyspark
 

 Key: SPARK-2334
 URL: https://issues.apache.org/jira/browse/SPARK-2334
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Diana Carroll

 calling the id() function of a PipelinedRDD causes an error in PySpark.  
 (Works fine in Scala.)
 The second id() call here fails, the first works:
 {code}
 r1 = sc.parallelize([1,2,3])
 r1.id()
 r2=r1.map(lambda i: i+1)
 r2.id()
 {code}
 Error:
 {code}
 ---
 AttributeErrorTraceback (most recent call last)
 ipython-input-31-a0cf66fcf645 in module()
  1 r2.id()
 /usr/lib/spark/python/pyspark/rdd.py in id(self)
 180 A unique ID for this RDD (within its SparkContext).
 181 
 -- 182 return self._id
 183 
 184 def __repr__(self):
 AttributeError: 'PipelinedRDD' object has no attribute '_id'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3211:
-
Assignee: Andrew Ash

 .take() is OOM-prone when there are empty partitions
 

 Key: SPARK-3211
 URL: https://issues.apache.org/jira/browse/SPARK-3211
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash
Assignee: Andrew Ash
 Fix For: 1.1.1, 1.2.0


 Filed on dev@ on 22 August by [~pnepywoda]:
 {quote}
 On line 777
 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
 the logic for take() reads ALL partitions if the first one (or first k) are
 empty. This has actually lead to OOMs when we had many partitions
 (thousands) and unfortunately the first one was empty.
 Wouldn't a better implementation strategy be
 numPartsToTry = partsScanned * 2
 instead of
 numPartsToTry = totalParts - 1
 (this doubling is similar to most memory allocation strategies)
 Thanks!
 - Paul
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3211) .take() is OOM-prone when there are empty partitions

2014-09-05 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3211.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

 .take() is OOM-prone when there are empty partitions
 

 Key: SPARK-3211
 URL: https://issues.apache.org/jira/browse/SPARK-3211
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Ash
Assignee: Andrew Ash
 Fix For: 1.1.1, 1.2.0


 Filed on dev@ on 22 August by [~pnepywoda]:
 {quote}
 On line 777
 https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
 the logic for take() reads ALL partitions if the first one (or first k) are
 empty. This has actually lead to OOMs when we had many partitions
 (thousands) and unfortunately the first one was empty.
 Wouldn't a better implementation strategy be
 numPartsToTry = partsScanned * 2
 instead of
 numPartsToTry = totalParts - 1
 (this doubling is similar to most memory allocation strategies)
 Thanks!
 - Paul
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3424) KMeans Plus Plus is too slow

2014-09-05 Thread Derrick Burns (JIRA)
Derrick Burns created SPARK-3424:


 Summary: KMeans Plus Plus is too slow
 Key: SPARK-3424
 URL: https://issues.apache.org/jira/browse/SPARK-3424
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns


The  KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the 
rounds of the KMeansParallel algorithm and k is the number of clusters.  

This can be dramatically improved by maintaining the distance the closest 
cluster center from round to round and then incrementally updating that value 
for each point. This incremental update is O(1) time, this reduces the running 
time for K Means Plus Plus to O( m k ).  For large k, this is significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3411) Improve load-balancing of concurrently-submitted drivers across workers

2014-09-05 Thread WangTaoTheTonic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-3411:
---
Summary: Improve load-balancing of concurrently-submitted drivers across 
workers  (was: Optimize the schedule procedure in Master)

 Improve load-balancing of concurrently-submitted drivers across workers
 ---

 Key: SPARK-3411
 URL: https://issues.apache.org/jira/browse/SPARK-3411
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 If the waiting driver array is too big, the drivers in it will be dispatched 
 to the first worker we get(if it has enough resources), with or without the 
 Randomization.
 We should do randomization every time we dispatch a driver, in order to 
 better balance drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3411) Improve load-balancing of concurrently-submitted drivers across workers

2014-09-05 Thread WangTaoTheTonic (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangTaoTheTonic updated SPARK-3411:
---
Description: 
If the waiting driver array is too big, the drivers in it will be dispatched to 
the first worker we get(if it has enough resources), with or without the 
Randomization.

We should do randomization every time we dispatch a driver, in order to better 
balance drivers.

Update(2014/9/6):Doing shuffle is much slower, so we use round robin to avoid 
it.

  was:
If the waiting driver array is too big, the drivers in it will be dispatched to 
the first worker we get(if it has enough resources), with or without the 
Randomization.

We should do randomization every time we dispatch a driver, in order to better 
balance drivers.


 Improve load-balancing of concurrently-submitted drivers across workers
 ---

 Key: SPARK-3411
 URL: https://issues.apache.org/jira/browse/SPARK-3411
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor

 If the waiting driver array is too big, the drivers in it will be dispatched 
 to the first worker we get(if it has enough resources), with or without the 
 Randomization.
 We should do randomization every time we dispatch a driver, in order to 
 better balance drivers.
 Update(2014/9/6):Doing shuffle is much slower, so we use round robin to avoid 
 it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3361) Expand PEP 8 checks to include EC2 script and Python examples

2014-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124293#comment-14124293
 ] 

Apache Spark commented on SPARK-3361:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2297

 Expand PEP 8 checks to include EC2 script and Python examples
 -

 Key: SPARK-3361
 URL: https://issues.apache.org/jira/browse/SPARK-3361
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
Priority: Minor
 Fix For: 1.1.0


 Via {{tox.ini}}, expand the PEP 8 checks to include the EC2 script and all 
 Python examples. That should cover everything.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org