date:20140903

Reynold Xin created SPARK-3367:
--

 Summary: Remove spark.shuffle.spill.compress (replace it with 
existing spark.shuffle.compress)
 Key: SPARK-3367
 URL: https://issues.apache.org/jira/browse/SPARK-3367
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


See the discussion at: 
https://github.com/apache/spark/pull/2178#issuecomment-54214760

We have both spark.shuffle.spill.compress and spark.shuffle.compress. It is 
very unlikely a user would want the two to be different. We should just remove 
the spill.compress option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2304) Add example application to run tera sort


 [ 
https://issues.apache.org/jira/browse/SPARK-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2304.

Resolution: Won't Fix

spark-perf is a better long term home for this.

 Add example application to run tera sort
 

 Key: SPARK-2304
 URL: https://issues.apache.org/jira/browse/SPARK-2304
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 See http://sortbenchmark.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2243) Support multiple SparkContexts in the same JVM


 [ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2243:
---
Target Version/s:   (was: 1.2.0)
Assignee: (was: Reynold Xin)

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at

[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM


[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119479#comment-14119479
 ] 

Reynold Xin commented on SPARK-2243:


Pushing this for later since Hive on Spark no longer depends on this ...

 Support multiple SparkContexts in the same JVM
 --

 Key: SPARK-2243
 URL: https://issues.apache.org/jira/browse/SPARK-2243
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Miguel Angel Fernandez Diaz
Assignee: Reynold Xin

 We're developing a platform where we create several Spark contexts for 
 carrying out different calculations. Is there any restriction when using 
 several Spark contexts? We have two contexts, one for Spark calculations and 
 another one for Spark Streaming jobs. The next error arises when we first 
 execute a Spark calculation and, once the execution is finished, a Spark 
 Streaming job is launched:
 {code}
 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
 java.io.FileNotFoundException
 java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at

[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-03 Thread David Greco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119499#comment-14119499
 ] 

David Greco commented on SPARK-3350:


Here you are:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
14/09/03 09:33:10 INFO SecurityManager: Changing view acls to: dgreco,
14/09/03 09:33:10 INFO SecurityManager: Changing modify acls to: dgreco,
14/09/03 09:33:10 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(dgreco, ); users 
with modify permissions: Set(dgreco, )
14/09/03 09:33:11 INFO Slf4jLogger: Slf4jLogger started
14/09/03 09:33:11 INFO Remoting: Starting remoting
14/09/03 09:33:12 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@192.168.0.21:49652]
14/09/03 09:33:12 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkDriver@192.168.0.21:49652]
14/09/03 09:33:12 INFO Utils: Successfully started service 'sparkDriver' on 
port 49652.
14/09/03 09:33:12 INFO SparkEnv: Registering MapOutputTracker
14/09/03 09:33:12 INFO SparkEnv: Registering BlockManagerMaster
14/09/03 09:33:12 INFO DiskBlockManager: Created local directory at 
/var/folders/bw/7dmpv8qx3p75mcmctmchwl5mgn/T/spark-local-20140903093312-29d3
14/09/03 09:33:12 INFO Utils: Successfully started service 'Connection manager 
for block manager' on port 49653.
14/09/03 09:33:12 INFO ConnectionManager: Bound socket to port 49653 with id = 
ConnectionManagerId(192.168.0.21,49653)
14/09/03 09:33:12 INFO MemoryStore: MemoryStore started with capacity 983.1 MB
14/09/03 09:33:12 INFO BlockManagerMaster: Trying to register BlockManager
14/09/03 09:33:12 INFO BlockManagerMasterActor: Registering block manager 
192.168.0.21:49653 with 983.1 MB RAM
14/09/03 09:33:12 INFO BlockManagerMaster: Registered BlockManager
14/09/03 09:33:12 INFO HttpFileServer: HTTP File server directory is 
/var/folders/bw/7dmpv8qx3p75mcmctmchwl5mgn/T/spark-f16014c3-f795-4300-88af-7e161d6f7547
14/09/03 09:33:12 INFO HttpServer: Starting HTTP Server
14/09/03 09:33:12 INFO Utils: Successfully started service 'HTTP file server' 
on port 49654.
14/09/03 09:33:13 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
14/09/03 09:33:13 INFO SparkUI: Started SparkUI at http://192.168.0.21:4040
14/09/03 09:33:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/09/03 09:33:15 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@192.168.0.21:49652/user/HeartbeatReceiver
14/09/03 09:33:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
mapreduce.task.id
14/09/03 09:33:18 INFO deprecation: mapred.task.id is deprecated. Instead, use 
mapreduce.task.attempt.id
14/09/03 09:33:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, 
use mapreduce.task.ismap
14/09/03 09:33:18 INFO deprecation: mapred.task.partition is deprecated. 
Instead, use mapreduce.task.partition
14/09/03 09:33:18 INFO deprecation: mapred.job.id is deprecated. Instead, use 
mapreduce.job.id
14/09/03 09:33:18 INFO SparkContext: Starting job: saveAsHadoopFile at 
AvroWriteTestCase.scala:67
14/09/03 09:33:18 INFO DAGScheduler: Got job 0 (saveAsHadoopFile at 
AvroWriteTestCase.scala:67) with 1 output partitions (allowLocal=false)
14/09/03 09:33:18 INFO DAGScheduler: Final stage: Stage 0(saveAsHadoopFile at 
AvroWriteTestCase.scala:67)
14/09/03 09:33:18 INFO DAGScheduler: Parents of final stage: List()
14/09/03 09:33:18 INFO DAGScheduler: Missing parents: List()
14/09/03 09:33:18 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[4] at map at 
AvroWriteTestCase.scala:66), which has no missing parents
14/09/03 09:33:19 INFO MemoryStore: ensureFreeSpace(45032) called with 
curMem=0, maxMem=1030823608
14/09/03 09:33:19 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 44.0 KB, free 983.0 MB)
14/09/03 09:33:19 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
(MappedRDD[4] at map at AvroWriteTestCase.scala:66)
14/09/03 09:33:19 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/09/03 09:33:19 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
localhost, PROCESS_LOCAL, 3191 bytes)
14/09/03 09:33:19 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/09/03 09:33:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at 
org.apache.spark.sql.SchemaRDDLike$class.queryExecution(SchemaRDDLike.scala:52)
at 
org.apache.spark.sql.SchemaRDD.queryExecution$lzycompute(SchemaRDD.scala:103)
at org.apache.spark.sql.SchemaRDD.queryExecution(SchemaRDD.scala:103)
at org.apache.spark.sql.SchemaRDD.schema(SchemaRDD.scala:126)
at

[jira] [Created] (SPARK-3368) Spark cannot be used with Avro and Parquet

2014-09-03 Thread Graham Dennis (JIRA)

Graham Dennis created SPARK-3368:


 Summary: Spark cannot be used with Avro and Parquet
 Key: SPARK-3368
 URL: https://issues.apache.org/jira/browse/SPARK-3368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis


Spark cannot currently (as of 1.0.2) use any Parquet write support classes that 
are not part of the spark assembly jar (at least when launched using 
`spark-submit`).  This prevents using Avro with Parquet.

See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
reproduce this issue.

The problem appears in the master logs as:

{noformat}
14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
parquet.hadoop.BadConfigurationException: could not instanciate class 
parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
at 
parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
at 
parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
... 11 more
{noformat}

The root cause of the problem is that the class loader that's used to find the 
Parquet write support class only searches the spark assembly jar and doesn't 
also search the application jar.  A solution would be to ensure that the 
application jar is always available on the executor classpath.  This is the 
same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet

2014-09-03 Thread Graham Dennis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119522#comment-14119522
 ] 

Graham Dennis commented on SPARK-3368:
--

There are a couple of github repos that demonstrate using Avro  Parquet, see 
https://github.com/AndreSchumacher/avro-parquet-spark-example and 
https://github.com/massie/spark-parquet-example

In the first case, data is written out locally, and in the second, spark is 
launched via maven (not spark-submit) which puts both spark jars and 
application jars on the classpath at launch time.

 Spark cannot be used with Avro and Parquet
 --

 Key: SPARK-3368
 URL: https://issues.apache.org/jira/browse/SPARK-3368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis

 Spark cannot currently (as of 1.0.2) use any Parquet write support classes 
 that are not part of the spark assembly jar (at least when launched using 
 `spark-submit`).  This prevents using Avro with Parquet.
 See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
 reproduce this issue.
 The problem appears in the master logs as:
 {noformat}
 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
 parquet.hadoop.BadConfigurationException: could not instanciate class 
 parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:190)
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
   ... 11 more
 {noformat}
 The root cause of the problem is that the class loader that's used to find 
 the Parquet write support class only searches the spark assembly jar and 
 doesn't also search the application jar.  A solution would be to ensure that 
 the application jar is always available on the executor classpath.  This is 
 the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3368) Spark cannot be used with Avro and Parquet

2014-09-03 Thread Graham Dennis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119524#comment-14119524
 ] 

Graham Dennis commented on SPARK-3368:
--

[~rxin]: I filed this issue to demonstate that the classpath issue behind 
SPARK-2878 and SPARK-3166 appears in other situations too.

 Spark cannot be used with Avro and Parquet
 --

 Key: SPARK-3368
 URL: https://issues.apache.org/jira/browse/SPARK-3368
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis

 Spark cannot currently (as of 1.0.2) use any Parquet write support classes 
 that are not part of the spark assembly jar (at least when launched using 
 `spark-submit`).  This prevents using Avro with Parquet.
 See https://github.com/GrahamDennis/spark-avro-parquet for a test case to 
 reproduce this issue.
 The problem appears in the master logs as:
 {noformat}
 14/09/03 17:31:10 ERROR Executor: Exception in task ID 0
 parquet.hadoop.BadConfigurationException: could not instanciate class 
 parquet.avro.AvroWriteSupport set in job conf at parquet.write.support.class
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:121)
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:302)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:714)
   at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:699)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException: parquet.avro.AvroWriteSupport
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:190)
   at 
 parquet.hadoop.ParquetOutputFormat.getWriteSupportClass(ParquetOutputFormat.java:115)
   ... 11 more
 {noformat}
 The root cause of the problem is that the class loader that's used to find 
 the Parquet write support class only searches the spark assembly jar and 
 doesn't also search the application jar.  A solution would be to ensure that 
 the application jar is always available on the executor classpath.  This is 
 the same underlying issue as SPARK-2878, and SPARK-3166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3273) We should read the version information from the same place.

2014-09-03 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3273:
---
Summary: We should read the version information from the same place.  (was: 
The spark version in the welcome message of spark-shell is not correct)

 We should read the version information from the same place.
 ---

 Key: SPARK-3273
 URL: https://issues.apache.org/jira/browse/SPARK-3273
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Reporter: Guoqiang Li
Assignee: Guoqiang Li
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF


 [ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3336:
---
Description: 
Running pyspark on a spark cluster with standalone master.
Cannot group by field on a UDF. But we can group by UDF in Scala.

For example:
q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY MYUDF(foo)')
out = q.collect()

I got this exception:
{code}
Py4JJavaError: An error occurred while calling o183.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 56.0 
(TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF#1278

org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)

org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)

org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)

org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)

org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)

org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
scala.collection.immutable.List.foreach(List.scala:318)
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
scala.collection.AbstractTraversable.map(Traversable.scala:105)

org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at

[jira] [Comment Edited] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF


[ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119400#comment-14119400
 ] 

Reynold Xin edited comment on SPARK-3336 at 9/3/14 8:45 AM:


[~marmbrus], If we reverse the order of count() and MYUDF(), then it can run 
successfully:

{code}
SELECT MYUDF(foo), COUNT(1) FROM bar  GROUP BY MYUDF(foo);
{code}

If we put COUNT(1) before MYUDF(foo), it will also raise the same exception in 
Scala:

{code}
scala sqlContext.sql(SELECT count(1), MYUDF(name) from people GROUP BY 
MYUDF(name) ).collect()
14/09/02 22:45:50 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: name#0
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:178)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at

[jira] [Created] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator

2014-09-03 Thread Sean Owen (JIRA)

Sean Owen created SPARK-3369:


 Summary: Java mapPartitions Iterator-Iterable is inconsistent 
with Scala's Iterator-Iterator
 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2
Reporter: Sean Owen


{{mapPartitions}} in the Scala RDD API takes a function that transforms an 
{{Iterator}} to an {{Iterator}}: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

In the Java RDD API, the equivalent is a FlatMapFunction, which operates on an 
{{Iterator}} but is requires to return an {{Iterable}}, which is a stronger 
condition and appears inconsistent. It's a problematic inconsistent though 
because this seems to require copying all of the input into memory in order to 
create an object that can be iterated many times, since the input does not 
afford this itself.

Similarity for other {{mapPartitions*}} methods and other 
{{*FlatMapFunctions}}s in Java.

(Is there a reason for this difference that I'm overlooking?)

If I'm right that this was inadvertent inconsistency, then the big issue here 
is that of course this is part of a public API. Workarounds I can think of:

Promise that Spark will only call {{iterator()}} once, so implementors can use 
a hacky {{IteratorIterable}} that returns the same {{Iterator}}.

Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-03 Thread kay feng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119660#comment-14119660
 ] 

kay feng commented on SPARK-3336:
-

We tested on spark 1.1.0 rc2.
In spark-shell, it works:
./bin/spark-shell
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.parallelize(List({foo:bar}))
sqlContext.jsonRDD(rdd).registerAsTable(t1)
sqlContext.registerFunction(len, (s: String) = s.length)
sqlContext.sql(select len(foo), count(1) from t1 group by len(foo)).collect()

However, in pyspark, we still have error: TreeNodeException: Binding 
attribute...
./bin/pyspark
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
rdd = sc.parallelize([{foo:bar}])
sqlContext.jsonRDD(rdd).registerAsTable(t1)
sqlContext.registerFunction(len, lambda s: len(s))
sqlContext.sql(select len(foo), count(1) from t1 group by len(foo)).collect()

 [Spark SQL] In pyspark, cannot group by field on UDF
 

 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master.
 Cannot group by field on a UDF. But we can group by UDF in Scala.
 For example:
 q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY 
 MYUDF(foo)')
 out = q.collect()
 I got this exception:
 {code}
 Py4JJavaError: An error occurred while calling o183.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 
 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 
 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#1278
 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 scala.collection.immutable.List.foreach(List.scala:318)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)

[jira] [Created] (SPARK-3370) The simple test error

2014-09-03 Thread edmonds (JIRA)

edmonds created SPARK-3370:
--

 Summary: The simple test error
 Key: SPARK-3370
 URL: https://issues.apache.org/jira/browse/SPARK-3370
 Project: Spark
  Issue Type: Question
  Components: Examples, MLlib, PySpark
Affects Versions: 1.0.2
 Environment: Linux  3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 
01:58:01 UTC 2014 i686 i686 i686 GNU/Linux
Reporter: edmonds


I tried to run the ALS example, but failed both in als.py, and 
MovielenALS.scala, how can I fix that? the following is the report for running 
the als.py
Traceback (most recent call last):
  File ALS.py, line 61, in module
model = ALS.train(ratings, rank, numIterations)
  File spark-1.0.2-bin-hadoop2/python/pyspark/mllib/recommendation.py, line 
66, in train
ratingBytes._jrdd, rank, iterations, lambda_, blocks)
  File 
spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, 
line 537, in __call__
  File 
spark-1.0.2-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
23.0:0 failed 1 times, most recent failure: Exception failure in TID 62 on host 
localhost: java.lang.StackOverflowError
java.util.zip.Inflater.inflate(Inflater.java:259)
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152)
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116)

java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)

java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2718)

java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1979)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
scala.collection.immutable.$colon$colon.readObject(List.scala:366)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

[jira] [Created] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error

2014-09-03 Thread Pei-Lun Lee (JIRA)

Pei-Lun Lee created SPARK-3371:
--

 Summary: Spark SQL: Renaming a function expression with group by 
gives error
 Key: SPARK-3371
 URL: https://issues.apache.org/jira/browse/SPARK-3371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Pei-Lun Lee


{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd = sc.parallelize(List({foo:bar}))
sqlContext.jsonRDD(rdd).registerAsTable(t1)
sqlContext.registerFunction(len, (s: String) = s.length)
sqlContext.sql(select len(foo) as a, count(1) from t1 group by 
len(foo)).collect()
{code}

running above code in spark-shell gives the following error

{noformat}
14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: foo#0
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
{noformat}

remove as a in the query causes no error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3372) Scalastyle fails due to UTF-8 character being contained in Gradient.scala


[ 
https://issues.apache.org/jira/browse/SPARK-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119706#comment-14119706
 ] 

Apache Spark commented on SPARK-3372:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2248

 Scalastyle fails due to UTF-8 character being contained in Gradient.scala
 -

 Key: SPARK-3372
 URL: https://issues.apache.org/jira/browse/SPARK-3372
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Windows8 / Java 7 / Maven
Reporter: Kousuke Saruta
Priority: Blocker

 Gradient.scala includes 2 UTF-8 hyphens.
 Caused by this, mvn package falied on Windows8 because it cannot pass 
 checkstyle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3372) Scalastyle fails due to UTF-8 character contained in Gradient.scala


 [ 
https://issues.apache.org/jira/browse/SPARK-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3372:
--
Summary: Scalastyle fails due to UTF-8 character contained in 
Gradient.scala  (was: Scalastyle fails due to UTF-8 character being contained 
in Gradient.scala)

 Scalastyle fails due to UTF-8 character contained in Gradient.scala
 ---

 Key: SPARK-3372
 URL: https://issues.apache.org/jira/browse/SPARK-3372
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Windows8 / Java 7 / Maven
Reporter: Kousuke Saruta
Priority: Blocker

 Gradient.scala includes 2 UTF-8 hyphens.
 Caused by this, mvn package falied on Windows8 because it cannot pass 
 checkstyle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3350) Strange anomaly trying to write a SchemaRDD into an Avro file

2014-09-03 Thread David Greco (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119742#comment-14119742
 ] 

David Greco commented on SPARK-3350:


I managed to make it working, below the passing code, I changed the function 
rowToAvroKey to return a function instead. Now it works. 
I think it's related to how spark serialises closures.
Just my .02
Hope it helps.

package com.eligotech.hnavigator.prototypes.spark

import org.apache.avro.{SchemaBuilder, Schema}
import org.apache.avro.SchemaBuilder.FieldAssembler
import org.apache.avro.generic.GenericData.Record
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.{AvroOutputFormat, AvroWrapper, AvroJob, AvroKey}
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.BooleanType
import org.apache.spark.sql.DoubleType
import org.apache.spark.sql.StructType
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.types.FloatType
import org.apache.spark.sql.catalyst.types.IntegerType
import org.apache.spark.sql.catalyst.types.LongType
import org.apache.spark.sql.catalyst.types.StringType
import org.apache.spark.sql.catalyst.types.StructField
import org.apache.spark.{SparkConf, SparkContext}

object AvroWriteTestCase extends App {
  val conf = new SparkConf().setAppName(HarpoonSparkTest).setMaster(local)
  val sparkContext = new SparkContext(conf)
  val sqlContext = new SQLContext(sparkContext)

  val tpeople = sparkContext.parallelize((1 to 100).map(i = Person(sCIAO$i, 
i)))
  case class Person(name: String, age: Int)
  import sqlContext._
  tpeople.registerTempTable(people)

  val people = sqlContext.sql(select * from people)

  def rowToAvroKey(structType: StructType): Row = (AvroKey[GenericRecord], 
NullWritable) = (row: Row) ={
val schema = structTypeToAvroSchema(structType)
val record = new Record(schema)
var i = 0
structType.fields.foreach {
  f =
record.put(f.name, row.apply(i))
i += 1
}
(new AvroKey(record), NullWritable.get())
  }

  def structTypeToAvroSchema(structType: StructType): Schema = {
val fieldsAssembler: FieldAssembler[Schema] = 
SchemaBuilder.record(RECORD).fields()
structType.fields foreach {
  (field: StructField) = field.dataType match {
case IntegerType = 
fieldsAssembler.name(field.name).`type`().nullable().intType().noDefault()
case LongType = 
fieldsAssembler.name(field.name).`type`().nullable().longType().noDefault()
case StringType = 
fieldsAssembler.name(field.name).`type`().nullable().stringType().noDefault()
case FloatType = 
fieldsAssembler.name(field.name).`type`().nullable().floatType().noDefault()
case DoubleType = 
fieldsAssembler.name(field.name).`type`().nullable().doubleType().noDefault()
case BooleanType = 
fieldsAssembler.name(field.name).`type`().nullable().booleanType().noDefault()
case _ = throw new IllegalArgumentException(StructType  with 
unhandled type:  + field)
  }
}
fieldsAssembler.endRecord()
  }

  def saveAsAvroFile(schemaRDD: SchemaRDD, path: String): Unit = {
val jobConf = new JobConf(schemaRDD.sparkContext.hadoopConfiguration)
val schema = structTypeToAvroSchema(schemaRDD.schema)
AvroJob.setOutputSchema(jobConf, schema)
import org.apache.spark.SparkContext._
schemaRDD.map(rowToAvroKey(schemaRDD.schema)).
  saveAsHadoopFile(path,
classOf[AvroWrapper[GenericRecord]],
classOf[NullWritable],
classOf[AvroOutputFormat[GenericRecord]],
jobConf)
  }

  saveAsAvroFile(people, /tmp/people.avro)

}


 Strange anomaly trying to write a SchemaRDD into an Avro file
 -

 Key: SPARK-3350
 URL: https://issues.apache.org/jira/browse/SPARK-3350
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: jdk1.7, macosx
Reporter: David Greco
 Attachments: AvroWriteTestCase.scala


 I found a way to automatically save a SchemaRDD in Avro format, similarly to 
 what Spark does with parquet file.
 I attached a test case to this issue. The code fails with a NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-563) Run findBugs and IDEA inspections in the codebase

2014-09-03 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-563.
-
Resolution: Won't Fix

This appears to be obsolete/stale too.

 Run findBugs and IDEA inspections in the codebase
 -

 Key: SPARK-563
 URL: https://issues.apache.org/jira/browse/SPARK-563
 Project: Spark
  Issue Type: Improvement
Reporter: Ismael Juma

 I ran into a few instances of unused local variables and unnecessary usage of 
 the 'return' keyword (the recommended practice is to avoid 'return' if 
 possible) and thought it would be good to run findBugs and IDEA inspections 
 to clean-up the code.
 I am willing to do this, but first would like to know whether you agree that 
 this is a good idea and whether this is the right time to do it. These 
 changes tend to affect many source files and can cause issues if there is 
 major work ongoing in separate branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


[ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119854#comment-14119854
 ] 

Thomas Graves commented on SPARK-2718:
--

[~andrewor]  the pr for this is closed, can we close the jira also?

 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


[ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119854#comment-14119854
 ] 

Thomas Graves edited comment on SPARK-2718 at 9/3/14 1:37 PM:
--

[~andrewor]  the pr for this is closed, can we resolve the jira also?


was (Author: tgraves):
[~andrewor]  the pr for this is closed, can we close the jira also?

 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3373) trim some useless informations of VertexRDD in some cases

uncleGen created SPARK-3373:
---

 Summary: trim some useless informations of VertexRDD in some cases
 Key: SPARK-3373
 URL: https://issues.apache.org/jira/browse/SPARK-3373
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2, 1.0.0
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3187) Refactor and cleanup Yarn allocator code


 [ 
https://issues.apache.org/jira/browse/SPARK-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3187.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 Refactor and cleanup Yarn allocator code
 

 Key: SPARK-3187
 URL: https://issues.apache.org/jira/browse/SPARK-3187
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.2.0


 This is a follow-up to SPARK-2933, which dealt with the ApplicationMaster 
 code.
 There's a lot of logic in the container allocation code in alpha/stable that 
 could probably be merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3375) spark on yarn alpha container allocation issues

Thomas Graves created SPARK-3375:


 Summary: spark on yarn alpha container allocation issues
 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Blocker


It looks like if yarn doesn't get the containers immediately it stops asking 
for them and the yarn application hangs with never getting any executors.  This 
was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

uncleGen created SPARK-3376:
---

 Summary: Memory-based shuffle strategy to reduce overhead of disk 
I/O
 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: Planned Work
Reporter: uncleGen
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3377) codahale base Metrics data between applications can jumble up together

Kousuke Saruta created SPARK-3377:
-

 Summary: codahale base Metrics data between applications can 
jumble up together
 Key: SPARK-3377
 URL: https://issues.apache.org/jira/browse/SPARK-3377
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical


I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw 
following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same 
time, some metrics names jumble up together. e.g, 
SparkPi.DAGScheduler.stage.failedStages jumble.

(2) When 2+ executors run on the same machine, JVM metrics of each executors 
jumble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: I think a memory-based shuffle can reduce some overhead of 
disk I/O. I just want to know is there any plan to do something about it. Or 
any suggestion about it. Base on the work (SPARK-2044), it is feasible to have 
several implementations of  shuffle.  (was: I think a memory-based shuffle can 
reduce some overhead of disk I/O. I just want to know is there any plan to do 
something about it. Or any suggestion about it. Base on the work (SPARK-2044), 
it is feasible to have several implementation of  shuffle.)

 Memory-based shuffle strategy to reduce overhead of disk I/O
 

 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: Planned Work
Reporter: uncleGen
Priority: Trivial

 I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
 want to know is there any plan to do something about it. Or any suggestion 
 about it. Base on the work (SPARK-2044), it is feasible to have several 
 implementations of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O


 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

uncleGen updated SPARK-3376:

Description: I think a memory-based shuffle can reduce some overhead of 
disk I/O. I just want to know is there any plan to do something about it. Or 
any suggestion about it. Base on the work (SPARK-2044), it is feasible to have 
several implementation of  shuffle.

 Memory-based shuffle strategy to reduce overhead of disk I/O
 

 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: Planned Work
Reporter: uncleGen
Priority: Trivial

 I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
 want to know is there any plan to do something about it. Or any suggestion 
 about it. Base on the work (SPARK-2044), it is feasible to have several 
 implementation of  shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3377) codahale base Metrics data between applications can jumble up together


[ 
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119975#comment-14119975
 ] 

Apache Spark commented on SPARK-3377:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2250

 codahale base Metrics data between applications can jumble up together
 --

 Key: SPARK-3377
 URL: https://issues.apache.org/jira/browse/SPARK-3377
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical

 I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I 
 saw following 2 problems.
 (1) When applications which have same spark.app.name run on cluster at the 
 same time, some metrics names jumble up together. e.g, 
 SparkPi.DAGScheduler.stage.failedStages jumble.
 (2) When 2+ executors run on the same machine, JVM metrics of each executors 
 jumble.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3377) codahale base Metrics data between applications can jumble up together

[
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kousuke Saruta updated SPARK-3377:
--
Description:
I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw
following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same
time, some metrics names jumble up together. e.g,
SparkPi.DAGScheduler.stage.failedStages jumble.

(2) When 2+ executors run on the same machine, JVM metrics of each executors
jumble. e.g, We current implementation cannot distinguish metric jvm.memory
is for which executor.

was:
I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw
following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same
time, some metrics names jumble up together. e.g,
SparkPi.DAGScheduler.stage.failedStages jumble.

(2) When 2+ executors run on the same machine, JVM metrics of each executors
jumble.

codahale base Metrics data between applications can jumble up together
--

Key: SPARK-3377
URL: https://issues.apache.org/jira/browse/SPARK-3377
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical

I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I
saw following 2 problems.
(1) When applications which have same spark.app.name run on cluster at the
same time, some metrics names jumble up together. e.g,
SparkPi.DAGScheduler.stage.failedStages jumble.
(2) When 2+ executors run on the same machine, JVM metrics of each executors
jumble. e.g, We current implementation cannot distinguish metric jvm.memory
is for which executor.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3304) ApplicationMaster's Finish status is wrong when uncaught exception is thrown from ReporterThread


[ 
https://issues.apache.org/jira/browse/SPARK-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120030#comment-14120030
 ] 

Kousuke Saruta commented on SPARK-3304:
---

Actually, I didn't see exception causing Reporter thread dead but unexpected 
exception like OOM or any other exceptions due to communication error with RM 
during resource negotiation caused by temporary traffic jam can be thrown, I 
think.

 ApplicationMaster's Finish status is wrong when uncaught exception is thrown 
 from ReporterThread
 

 Key: SPARK-3304
 URL: https://issues.apache.org/jira/browse/SPARK-3304
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 It's rare case but even though uncaught exception is thrown from Reporter 
 thread, allocating containers, finish status is marked as SUCCEEDED.
 In addition, in YARN Cluster mode, if we don't notice Reporter thread is 
 dead, we waits long time until timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL

Kousuke Saruta created SPARK-3378:
-

 Summary: Replace the word SparkSQL with right word Spark SQL
 Key: SPARK-3378
 URL: https://issues.apache.org/jira/browse/SPARK-3378
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Trivial


In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL 
instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL


[ 
https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120081#comment-14120081
 ] 

Apache Spark commented on SPARK-3378:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2251

 Replace the word SparkSQL with right word Spark SQL
 ---

 Key: SPARK-3378
 URL: https://issues.apache.org/jira/browse/SPARK-3378
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Trivial

 In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120141#comment-14120141
 ] 

Davies Liu commented on SPARK-3336:
---

[~kayfeng] The pyspark test case ran successfully in master but failed in 1.1 
branch.

After testing, This bug was fixed in https://github.com/apache/spark/pull/2155 
accidentally.

 [Spark SQL] In pyspark, cannot group by field on UDF
 

 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master.
 Cannot group by field on a UDF. But we can group by UDF in Scala.
 For example:
 q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY 
 MYUDF(foo)')
 out = q.collect()
 I got this exception:
 {code}
 Py4JJavaError: An error occurred while calling o183.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 
 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 
 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#1278
 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 scala.collection.immutable.List.foreach(List.scala:318)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

[jira] [Commented] (SPARK-3336) [Spark SQL] In pyspark, cannot group by field on UDF

2014-09-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120145#comment-14120145
 ] 

Davies Liu commented on SPARK-3336:
---

[~marmbrus] Should we merge this patch into 1.1 branch ? we can hold it for 
1.1.1.

 [Spark SQL] In pyspark, cannot group by field on UDF
 

 Key: SPARK-3336
 URL: https://issues.apache.org/jira/browse/SPARK-3336
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: kay feng
Assignee: Davies Liu

 Running pyspark on a spark cluster with standalone master.
 Cannot group by field on a UDF. But we can group by UDF in Scala.
 For example:
 q = sqlContext.sql('SELECT COUNT(*), MYUDF(foo)  FROM bar GROUP BY 
 MYUDF(foo)')
 out = q.collect()
 I got this exception:
 {code}
 Py4JJavaError: An error occurred while calling o183.collect.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 
 in stage 56.0 failed 4 times, most recent failure: Lost task 26.3 in stage 
 56.0 (TID 14038, ip-10-33-9-144.us-west-2.compute.internal): 
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#1278
 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 scala.collection.AbstractIterator.to(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 scala.collection.immutable.List.foreach(List.scala:318)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.init(Projection.scala:52)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.init(Aggregate.scala:176)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

[jira] [Created] (SPARK-3379) Implement 'POWER' for sql

Xinyun Huang created SPARK-3379:
---

 Summary: Implement 'POWER' for sql
 Key: SPARK-3379
 URL: https://issues.apache.org/jira/browse/SPARK-3379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0


Add support for the mathematical function POWER within spark sql. Spitted 
from SPARK-3176



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3176) Implement 'ABS' and 'LAST' for sql


 [ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyun Huang updated SPARK-3176:

Summary: Implement 'ABS' and 'LAST' for sql  (was: Implement 'POWER', 'ABS 
and 'LAST' for sql)

 Implement 'ABS' and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3176) Implement 'ABS' and 'LAST' for sql


[ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120162#comment-14120162
 ] 

Xinyun Huang commented on SPARK-3176:
-

Split POWER to SPARK-3379

 Implement 'ABS' and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function POWER and ABS and  the analytic 
 function last to return a subset of the rows satisfying a query within 
 spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3176) Implement 'ABS' and 'LAST' for sql


 [ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyun Huang updated SPARK-3176:

Description: Add support for the mathematical function ABS and  the 
analytic function last to return a subset of the rows satisfying a query 
within spark sql.  (was: Add support for the mathematical function POWER and 
ABS and  the analytic function last to return a subset of the rows 
satisfying a query within spark sql.)

 Implement 'ABS' and 'LAST' for sql
 --

 Key: SPARK-3176
 URL: https://issues.apache.org/jira/browse/SPARK-3176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 3h
  Remaining Estimate: 3h

 Add support for the mathematical function ABS and  the analytic function 
 last to return a subset of the rows satisfying a query within spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3379) Implement 'POWER' for sql


[ 
https://issues.apache.org/jira/browse/SPARK-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120210#comment-14120210
 ] 

Apache Spark commented on SPARK-3379:
-

User 'xinyunh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2252

 Implement 'POWER' for sql
 -

 Key: SPARK-3379
 URL: https://issues.apache.org/jira/browse/SPARK-3379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
 Environment: All
Reporter: Xinyun Huang
Priority: Minor
 Fix For: 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Add support for the mathematical function POWER within spark sql. Spitted 
 from SPARK-3176



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3366) Compute best splits distributively in decision tree

[
https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120218#comment-14120218
]

Joseph K. Bradley commented on SPARK-3366:
--

It is not really a bottleneck for tests I have done. E.g., on 2M examples,
3500 features, 15 workers EC2, about 95% of the time is spent during the
aggregation. (This is a rough estimate based on timing results from several
tests.)

In terms of scaling, most factors should not make the driver be a bottleneck:
* # examples: Increasing this will not increase the load on the driver.
* # features: Increasing this will increase the load on the driver (linearly in
# features), but it will also increase the amount of communication (also
linearly). I suspect communication will outweigh the cost of computation in
most cases.
* # bins/splits: Same as # features.
* tree size: Same as # features.

Are there factors I have missed?

I suspect we could save some time by doing a distributed best splits
computation, but I do not think it is high priority.

Compute best splits distributively in decision tree
---

Key: SPARK-3366
URL: https://issues.apache.org/jira/browse/SPARK-3366
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Xiangrui Meng

The current implementation computes all best splits locally on the driver,
which makes the driver a bottleneck for both communication and computation.
It would be nice if we can compute the best splits distributively.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3327) Make broadcasted value mutable for caching useful information

[
https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin resolved SPARK-3327.

Resolution: Won't Fix

Make broadcasted value mutable for caching useful information
-

Key: SPARK-3327
URL: https://issues.apache.org/jira/browse/SPARK-3327
Project: Spark
Issue Type: New Feature
Reporter: Liang-Chi Hsieh

When implementing some algorithms, it is helpful that we can cache some
useful information for using later.
Specifically, we would like to performa operation A on each partition of
data. Some variables are updated. Then we want to run operation B on the
data too. B operation uses the variables updated by operation A.
One of the examples is the Liblinear on Spark from Dr. Lin. They discuss the
problem in Section IV.D of the paper Large-scale Logistic Regression and
Linear Support Vector Machines Using Spark.
Currently broadcasted variables can satisfy partial need for that. We can
broadcast variables to reduce communication costs. However, because
broadcasted variables can not be modified, it doesn't help solve the problem
and we maybe need to collect updated variables back to master and broadcast
them again before conducting next data operation.
I would like to add an interface to broadcasted variables to make them
mutable so later data operations can use them again.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3380) DecisionTree: overflow and precision in aggregation

Joseph K. Bradley created SPARK-3380:


 Summary: DecisionTree: overflow and precision in aggregation
 Key: SPARK-3380
 URL: https://issues.apache.org/jira/browse/SPARK-3380
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley


DecisionTree does not check for overflows or loss of precision while 
aggregating sufficient statistics (binAggregates).  It uses Double, which may 
be a problem for DecisionTree regression since the variance calculation could 
blow up.  At the least, it could check for overflow and renormalize as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3381) DecisionTree: eliminate bins for unordered features

Joseph K. Bradley created SPARK-3381:


 Summary: DecisionTree: eliminate bins for unordered features
 Key: SPARK-3381
 URL: https://issues.apache.org/jira/browse/SPARK-3381
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Trivial


Code simplification: DecisionTree currently allocates bins for unordered 
features (in findSplitsBins).  However, those bins are not needed; only the 
splits are required.  This change will require modifying findSplitsBins, as 
well as modifying a few other functions to use splits instead of bins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3327) Make broadcasted value mutable for caching useful information

[
https://issues.apache.org/jira/browse/SPARK-3327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120223#comment-14120223
]

Reynold Xin commented on SPARK-3327:

Closing this as won't fix. See comments in the pull request for more
information: https://github.com/apache/spark/pull/2217

Make broadcasted value mutable for caching useful information
-

Key: SPARK-3327
URL: https://issues.apache.org/jira/browse/SPARK-3327
Project: Spark
Issue Type: New Feature
Reporter: Liang-Chi Hsieh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3382) GradientDescent convergence tolerance

Joseph K. Bradley created SPARK-3382:


 Summary: GradientDescent convergence tolerance
 Key: SPARK-3382
 URL: https://issues.apache.org/jira/browse/SPARK-3382
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Minor


GradientDescent should support a convergence tolerance setting.  In general, 
for optimization, convergence tolerance should be preferred over a limit on the 
number of iterations since it is a somewhat data-adaptive or data-specific 
convergence criterion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3309) Put all public API in all

2014-09-03 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3309.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2205
[https://github.com/apache/spark/pull/2205]

 Put all public API in __all__
 -

 Key: SPARK-3309
 URL: https://issues.apache.org/jira/browse/SPARK-3309
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 __all__ could be used to define a collection of public interfaces in pyspark.
 Also put all the public interfaces in top module of pyspark 
 (pyspark.__init__.py).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3383) DecisionTree aggregate size could be smaller

Joseph K. Bradley created SPARK-3383:


 Summary: DecisionTree aggregate size could be smaller
 Key: SPARK-3383
 URL: https://issues.apache.org/jira/browse/SPARK-3383
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Minor


Storage and communication optimization:
DecisionTree aggregate statistics could store less data (described below).  The 
savings would be significant for datasets with many low-arity categorical 
features (binary features, or unordered categorical features).  Savings would 
be negligible for continuous features.

DecisionTree stores a vector sufficient statistics for each (node, feature, 
bin).  We could store 1 fewer bin per (node, feature): For a given (node, 
feature), if we store these vectors for all but the last bin, and also store 
the total statistics for each node, then we could compute the statistics for 
the last bin.  For binary and unordered categorical features, this would cut in 
half the number of bins to store and communicate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)

RJ Nowling created SPARK-3384:
-

 Summary: Potential thread unsafe Breeze vector addition in KMeans
 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling


In the KMeans clustering implementation, the Breeze vectors are accumulated 
using +=: 

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

 This is potentially a thread unsafe operation.  (This is what I observed in 
local testing.)  I suggest changing the += to + -- a new object will be 
allocated but it will be thread safe since it won't write to an old location 
accessed by multiple threads.

Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

RJ Nowling updated SPARK-3384:
--
Description:
In the KMeans clustering implementation, the Breeze vectors are accumulated
using +=. For example,

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

This is potentially a thread unsafe operation. (This is what I observed in
local testing.) I suggest changing the += to + -- a new object will be
allocated but it will be thread safe since it won't write to an old location
accessed by multiple threads.

Further testing is required to reproduce and verify.

was:
In the KMeans clustering implementation, the Breeze vectors are accumulated
using +=:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162

Further testing is required to reproduce and verify.

Potential thread unsafe Breeze vector addition in KMeans

Key: SPARK-3384
URL: https://issues.apache.org/jira/browse/SPARK-3384
Project: Spark
Issue Type: Bug
Components: MLlib
Reporter: RJ Nowling

In the KMeans clustering implementation, the Breeze vectors are accumulated
using +=. For example,
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
This is potentially a thread unsafe operation. (This is what I observed in
local testing.) I suggest changing the += to + -- a new object will be
allocated but it will be thread safe since it won't write to an old location
accessed by multiple threads.
Further testing is required to reproduce and verify.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


[ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120289#comment-14120289
 ] 

Andrew Or commented on SPARK-2718:
--

Yes, thanks for the reminder.

 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


 [ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-2718.

Resolution: Fixed

 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3216.

Resolution: Fixed
  Assignee: Andrew Or

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120294#comment-14120294
 ] 

RJ Nowling commented on SPARK-3384:
---

Xiangrui Meng

I'll try to get a code example together in the next couple days.

Even if Spark itself is thread safe, I would re-iterate that it is easy to make 
the mistake of using += in the wrong place.  I suggest that we should frown 
upon that behavior, document it when we use it, and maybe even add checks for 
the presence of += with Breeze vectors in the tests so we can flag it.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120314#comment-14120314
 ] 

Sean Owen commented on SPARK-3384:
--

[~rnowling] I think it fairly important for speed in that section of code 
though. Using mutable data structures is not a problem if done correctly and 
for the right reason.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3385) Improve shuffle performance

Reynold Xin created SPARK-3385:
--

 Summary: Improve shuffle performance
 Key: SPARK-3385
 URL: https://issues.apache.org/jira/browse/SPARK-3385
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


Just a ticket to track various efforts related to improving shuffle in Spark.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2014-09-03 Thread Derrick Burns (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120325#comment-14120325
]

Derrick Burns commented on SPARK-3219:
--

Great!

You can find my work here:
https://github.com/derrickburns/generalized-kmeans-clustering.git.

I should warn you that I rewrote much of the original Spark clusterer
because the original is too tightly coupled to using the Euclidean norm and
does not
allow one to identify efficiently which points belong to which clusters. I
have tested this version extensively.

You will notice a package call com.rincaro.clusterer.metrics. Please take
a look at the two files EuOps.scala and FastEuclideansOps.scala. They
both implement the Euclidean norm. However, one is much faster than the
other by using the same algebraic transformations that the Spark version
uses. This demonstrates that
it is possible to be efficient while not being tightly coupled. One could
easily re-implement FastEuclideanOps using Breeze or Blas without effecting
the core Kmeans implementation.

Not included in this project is another distance function that that I have
implemented: the Kullback-Leibler distance function, a.k.a. relative
entropy. In my implementation, I also perform algebraic transformations to
expedite the computation, resulting in a distance computation that is even
faster than the fast euclidean norm.

Let me know if this is useful to you.

K-Means clusterer should support Bregman distance functions
---

Key: SPARK-3219
URL: https://issues.apache.org/jira/browse/SPARK-3219
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3386) Reuse serializer and serializer buffer in shuffle block iterator

Reynold Xin created SPARK-3386:
--

 Summary: Reuse serializer and serializer buffer in shuffle block 
iterator
 Key: SPARK-3386
 URL: https://issues.apache.org/jira/browse/SPARK-3386
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3216) Spark-shell is broken for branch-1.0


[ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120347#comment-14120347
 ] 

Patrick Wendell commented on SPARK-3216:


[~andrewor14] for this you should Mark it as a blocker for target version 1.0.3 
and also put fix version 1.0.3 when it is merged.

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker

 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3216:
---
Target Version/s: 1.0.3

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-3216:


 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3216:
---
Fix Version/s: 1.0.3

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3216.

Resolution: Fixed

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances

2014-09-03 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120354#comment-14120354
 ] 

Josh Rosen commented on SPARK-3358:
---

Agreed.  Long term, I think it would be better to address the causes behind why 
we need to fork so many processes.

 PySpark worker fork()ing performance regression in m3.* / PVM instances
 ---

 Key: SPARK-3358
 URL: https://issues.apache.org/jira/browse/SPARK-3358
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
 Environment: m3.* instances on EC2
Reporter: Josh Rosen

 SPARK-2764 (and some followup commits) simplified PySpark's worker process 
 structure by removing an intermediate pool of processes forked by daemon.py.  
 Previously, daemon.py forked a fixed-size pool of processes that shared a 
 socket and handled worker launch requests from Java.  After my patch, this 
 intermediate pool was removed and launch requests are handled directly in 
 daemon.py.
 Unfortunately, this seems to have increased PySpark task launch latency when 
 running on m3* class instances in EC2.  Most of this difference can be 
 attributed to m3 instances' more expensive fork() system calls.  I tried the 
 following microbenchmark on m3.xlarge and r3.xlarge instances:
 {code}
 import os
 for x in range(1000):
   if os.fork() == 0:
 exit()
 {code}
 On the r3.xlarge instance:
 {code}
 real  0m0.761s
 user  0m0.008s
 sys   0m0.144s
 {code}
 And on m3.xlarge:
 {code}
 real0m1.699s
 user0m0.012s
 sys 0m1.008s
 {code}
 I think this is due to HVM vs PVM EC2 instances using different 
 virtualization technologies with different fork costs.
 It may be the case that this performance difference only appears in certain 
 microbenchmarks and is masked by other performance improvements in PySpark, 
 such as improvements to large group-bys.  I'm in the process of re-running 
 spark-perf benchmarks on m3 instances in order to confirm whether this 
 impacts more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3375) spark on yarn container allocation issues


 [ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-3375:
-
Summary: spark on yarn container allocation issues  (was: spark on yarn 
alpha container allocation issues)

 spark on yarn container allocation issues
 -

 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Blocker

 It looks like if yarn doesn't get the containers immediately it stops asking 
 for them and the yarn application hangs with never getting any executors.  
 This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3375) spark on yarn alpha container allocation issues


[ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120361#comment-14120361
 ] 

Thomas Graves commented on SPARK-3375:
--

Ok so the real issue here is that we are sending the number of containers as 0 
after we send the original one of X.  on the yarn side this clears out the 
original request.  I would expect this to be affect 2.x also. I think the 
original 0.23 code kept just sending max-running and now we are sending 
max-pending-running.

For a ping we should just send empty asks.

 spark on yarn alpha container allocation issues
 ---

 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Priority: Blocker

 It looks like if yarn doesn't get the containers immediately it stops asking 
 for them and the yarn application hangs with never getting any executors.  
 This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-
Fix Version/s: (was: 1.0.2)

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3216) Spark-shell is broken for branch-1.0


 [ 
https://issues.apache.org/jira/browse/SPARK-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3216:
-
Fix Version/s: 1.0.2

 Spark-shell is broken for branch-1.0
 

 Key: SPARK-3216
 URL: https://issues.apache.org/jira/browse/SPARK-3216
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.3


 This fails when EC2 tries to clone the most recent version of Spark from 
 branch-1.0. This does not actually affect any released distributions, and so 
 I did not set the affected/fix/target versions. I marked this a blocker 
 because this is completely broken, but it is technically not blocking 
 anything.
 This was caused by https://github.com/apache/spark/pull/1831, which broke 
 spark-shell. The follow-up fix in https://github.com/apache/spark/pull/1825 
 was only merged into branch-1.1 and master, but not branch-1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Evan Sparks (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432
 ] 

Evan Sparks commented on SPARK-3384:


I agree with Sean. Avoiding the costly penalty of object allocation overhead is 
important to avoid here. As far as I can tell, we are using reduceByKey in the 
prescribed way (see: 
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
 mutating the left input. I don't believe that spark needs this mutation to be 
thread-safe, because it executes the combine sequentially on all workers, and 
then reduces sequentially on the master, but I could be wrong.

 Potential thread unsafe Breeze vector addition in KMeans
 

 Key: SPARK-3384
 URL: https://issues.apache.org/jira/browse/SPARK-3384
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: RJ Nowling

 In the KMeans clustering implementation, the Breeze vectors are accumulated 
 using +=.  For example,
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L162
  This is potentially a thread unsafe operation.  (This is what I observed in 
 local testing.)  I suggest changing the += to + -- a new object will be 
 allocated but it will be thread safe since it won't write to an old location 
 accessed by multiple threads.
 Further testing is required to reproduce and verify.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3384) Potential thread unsafe Breeze vector addition in KMeans

2014-09-03 Thread Evan Sparks (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120432#comment-14120432
]

Evan Sparks edited comment on SPARK-3384 at 9/3/14 8:54 PM:

I agree with Sean. Avoiding the costly penalty of object allocation overhead is
important here. As far as I can tell, we are using reduceByKey in the
prescribed way (see:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
mutating the left input. I don't believe that spark needs this mutation to be
thread-safe, because it executes the combine sequentially on all workers, and
then reduces sequentially on the master, but I could be wrong.

was (Author: sparks):
I agree with Sean. Avoiding the costly penalty of object allocation overhead is
important to avoid here. As far as I can tell, we are using reduceByKey in the
prescribed way (see:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3cecd3c09a-50f3-4683-a639-daddc4101...@gmail.com%3E)
mutating the left input. I don't believe that spark needs this mutation to be
thread-safe, because it executes the combine sequentially on all workers, and
then reduces sequentially on the master, but I could be wrong.

Potential thread unsafe Breeze vector addition in KMeans

Key: SPARK-3384
URL: https://issues.apache.org/jira/browse/SPARK-3384
Project: Spark
Issue Type: Bug
Components: MLlib
Reporter: RJ Nowling

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3388) Expose cluster applicationId, use it to serve history information

2014-09-03 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-3388:
-

 Summary: Expose cluster applicationId, use it to serve history 
information
 Key: SPARK-3388
 URL: https://issues.apache.org/jira/browse/SPARK-3388
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin


This is a follow up to SPARK-2150.

Currently Spark uses the application-generated log directory name as the URL 
for the history server. That's not very user-friendly since there's no way for 
the user to figure out that information otherwise.

A more user-friendly approach would be to use the cluster-defined application 
id, when present. That also has the advantage of providing that information as 
metadata for the Spark app, so that someone can easily correlate information 
kept in Spark with that kept by the cluster manager.

Open PR: https://github.com/apache/spark/pull/1218



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2419) Misc updates to streaming programming guide


[ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120463#comment-14120463
 ] 

Apache Spark commented on SPARK-2419:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2254

 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.
 - New Flume and Kinesis stuff.
 - Twitter4j version
 - Design pattern: creating connections to external sinks
 - Design pattern: multiple input streams



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3263) PR #720 broke GraphGenerator.logNormal


 [ 
https://issues.apache.org/jira/browse/SPARK-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3263.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 2168
[https://github.com/apache/spark/pull/2168]

 PR #720 broke GraphGenerator.logNormal
 --

 Key: SPARK-3263
 URL: https://issues.apache.org/jira/browse/SPARK-3263
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: RJ Nowling
 Fix For: 1.3.0


 PR #720 made multiple changes to GraphGenerator.logNormalGraph including:
 * Replacing the call to functions for generating random vertices and edges 
 with in-line implementations with different equations
 * Hard-coding of RNG seeds so that method now generates the same graph for a 
 given number of vertices, edges, mu, and sigma -- user is not able to 
 override seed or specify that seed should be randomly generated.
 * Backwards-incompatible change to logNormalGraph signature with introduction 
 of new required parameter.
 * Failed to update scala docs and programming guide for API changes
 I also see that PR #720 added a Synthetic Benchmark in the examples.
 Based on reading the Pregel paper, I believe the in-line functions are 
 incorrect.  I proposed to:
 * Removing the in-line calls
 * Adding a seed for deterministic behavior (when desired)
 * Keeping the number of partitions parameter.
 * Updating the synthetic benchmark example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables


 [ 
https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3373:
--
Summary: Filtering operations should optionally rebuild routing tables  
(was: trim some useless informations of VertexRDD in some cases)

 Filtering operations should optionally rebuild routing tables
 -

 Key: SPARK-3373
 URL: https://issues.apache.org/jira/browse/SPARK-3373
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3373) Filtering operations should optionally rebuild routing tables


 [ 
https://issues.apache.org/jira/browse/SPARK-3373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3373:
--
Description: Graph operations that filter the edges (subgraph, mask, 
groupEdges) currently reuse the existing routing table to avoid the shuffle 
which would be required to build a new one. However, this may be inefficient 
when the filtering is highly selective. Vertices will be sent to more 
partitions than necessary, and the extra routing information may take up 
excessive space.

 Filtering operations should optionally rebuild routing tables
 -

 Key: SPARK-3373
 URL: https://issues.apache.org/jira/browse/SPARK-3373
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor

 Graph operations that filter the edges (subgraph, mask, groupEdges) currently 
 reuse the existing routing table to avoid the shuffle which would be required 
 to build a new one. However, this may be inefficient when the filtering is 
 highly selective. Vertices will be sent to more partitions than necessary, 
 and the extra routing information may take up excessive space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3206) Error in PageRank values


[ 
https://issues.apache.org/jira/browse/SPARK-3206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120525#comment-14120525
 ] 

Ankur Dave commented on SPARK-3206:
---

I think the `run` method is incorrect due to a bug in Pregel. I wrote a 
standalone (non-Pregel) version of PageRank that provides the functionality of 
`run` without using Pregel: 
https://github.com/ankurdave/spark/blob/low-level-PageRank/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L274

I'd like to run that and check the results against these ones when I get a 
chance.

 Error in PageRank values
 

 Key: SPARK-3206
 URL: https://issues.apache.org/jira/browse/SPARK-3206
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
 Environment: UNIX with Hadoop
Reporter: Peter Fontana

 I have found a small example where the PageRank values using run and 
 runUntilConvergence differ quite a bit.
 I am running the Pagerank module on the following graph:
 Edge Table:
 || Node1 || Node2 ||
 |1 | 2 |
 |1 |  3|
 |3 |  2|
 |3 |  4|
 |5 |  3|
 |6 |  7|
 |7 |  8|
 |8 |  9|
 |9 |  7|
 Node Table (note the extra node):
 || NodeID  || NodeName  ||
 |a |  1|
 |b |  2|
 |c |  3|
 |d |  4|
 |e |  5|
 |f |  6|
 |g |  7|
 |h |  8|
 |i |  9|
 |j.longaddress.com |  10|
 with a default resetProb of 0.15.
 When I compute the pageRank with runUntilConvergence, running 
 {{val ranks = PageRank.runUntilConvergence(graph,0.0001).vertices}}
 I get the ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,1.3299054047985106)
 (9,1.2381240056453071)
 (8,1.2803346052504254)
 (10,0.15)
 (5,0.15)
 (2,0.358781244)
 However, when I run page Rank with the run() method, running  
 {{val ranksI = PageRank.run(graph,100).vertices}} 
 I get the page ranks
 (4,0.295031247)
 (1,0.15)
 (6,0.15)
 (3,0.341249994)
 (7,0.999387662847)
 (9,0.999256447741)
 (8,0.999256447741)
 (10,0.15)
 (5,0.15)
 (2,0.295031247)
 These are quite different, leading me to suspect that one of the PageRank 
 methods is incorrect. I have examined the source, but I do not know what the 
 correct fix is, or which set of values is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.


 [ 
https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave reopened SPARK-3123:
---

Reopening so I can change the status from Closed to Resolved.

 override the setName function to set EdgeRDD's name manually just as 
 VertexRDD does.
 --

 Key: SPARK-3123
 URL: https://issues.apache.org/jira/browse/SPARK-3123
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.


 [ 
https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3123.
---
Resolution: Fixed

Issue resolved by pull request 2033
https://github.com/apache/spark/pull/2033

 override the setName function to set EdgeRDD's name manually just as 
 VertexRDD does.
 --

 Key: SPARK-3123
 URL: https://issues.apache.org/jira/browse/SPARK-3123
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.


 [ 
https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3123:
--
Comment: was deleted

(was: Reopening so I can change the status from Closed to Resolved.)

 override the setName function to set EdgeRDD's name manually just as 
 VertexRDD does.
 --

 Key: SPARK-3123
 URL: https://issues.apache.org/jira/browse/SPARK-3123
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2845) Add timestamp to BlockManager events


 [ 
https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2845.

  Resolution: Fixed
Target Version/s: 1.2.0

 Add timestamp to BlockManager events
 

 Key: SPARK-2845
 URL: https://issues.apache.org/jira/browse/SPARK-2845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor

 BlockManager events in SparkListener.scala do not have a timestamp; while not 
 necessary for rendering the Spark UI, the extra information might be 
 interesting for other tools that use the event data.
 Note that the same applies to lots of other events; it would be nice, at some 
 point, to have all of them have a timestamp assigned at the point of creation 
 of the event, so that network traffic / event queueing in the driver do not 
 affect that information. But right now, I'm more interested in those 
 particular events. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2845) Add timestamp to BlockManager events


[ 
https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120533#comment-14120533
 ] 

Andrew Or commented on SPARK-2845:
--

Fixed by https://github.com/apache/spark/pull/654

 Add timestamp to BlockManager events
 

 Key: SPARK-2845
 URL: https://issues.apache.org/jira/browse/SPARK-2845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor

 BlockManager events in SparkListener.scala do not have a timestamp; while not 
 necessary for rendering the Spark UI, the extra information might be 
 interesting for other tools that use the event data.
 Note that the same applies to lots of other events; it would be nice, at some 
 point, to have all of them have a timestamp assigned at the point of creation 
 of the event, so that network traffic / event queueing in the driver do not 
 affect that information. But right now, I'm more interested in those 
 particular events. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2845) Add timestamp to BlockManager events


 [ 
https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2845:
-
Affects Version/s: 1.1.0

 Add timestamp to BlockManager events
 

 Key: SPARK-2845
 URL: https://issues.apache.org/jira/browse/SPARK-2845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.2.0


 BlockManager events in SparkListener.scala do not have a timestamp; while not 
 necessary for rendering the Spark UI, the extra information might be 
 interesting for other tools that use the event data.
 Note that the same applies to lots of other events; it would be nice, at some 
 point, to have all of them have a timestamp assigned at the point of creation 
 of the event, so that network traffic / event queueing in the driver do not 
 affect that information. But right now, I'm more interested in those 
 particular events. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2845) Add timestamp to BlockManager events


 [ 
https://issues.apache.org/jira/browse/SPARK-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2845:
-
Fix Version/s: 1.2.0

 Add timestamp to BlockManager events
 

 Key: SPARK-2845
 URL: https://issues.apache.org/jira/browse/SPARK-2845
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Minor
 Fix For: 1.2.0


 BlockManager events in SparkListener.scala do not have a timestamp; while not 
 necessary for rendering the Spark UI, the extra information might be 
 interesting for other tools that use the event data.
 Note that the same applies to lots of other events; it would be nice, at some 
 point, to have all of them have a timestamp assigned at the point of creation 
 of the event, so that network traffic / event queueing in the driver do not 
 affect that information. But right now, I'm more interested in those 
 particular events. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3388) Expose cluster applicationId, use it to serve history information


 [ 
https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-3388.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

Fixed by https://github.com/apache/spark/pull/1218

 Expose cluster applicationId, use it to serve history information
 -

 Key: SPARK-3388
 URL: https://issues.apache.org/jira/browse/SPARK-3388
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 This is a follow up to SPARK-2150.
 Currently Spark uses the application-generated log directory name as the URL 
 for the history server. That's not very user-friendly since there's no way 
 for the user to figure out that information otherwise.
 A more user-friendly approach would be to use the cluster-defined application 
 id, when present. That also has the advantage of providing that information 
 as metadata for the Spark app, so that someone can easily correlate 
 information kept in Spark with that kept by the cluster manager.
 Open PR: https://github.com/apache/spark/pull/1218



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information


 [ 
https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3388:
-
Affects Version/s: 1.1.0

 Expose cluster applicationId, use it to serve history information
 -

 Key: SPARK-3388
 URL: https://issues.apache.org/jira/browse/SPARK-3388
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 This is a follow up to SPARK-2150.
 Currently Spark uses the application-generated log directory name as the URL 
 for the history server. That's not very user-friendly since there's no way 
 for the user to figure out that information otherwise.
 A more user-friendly approach would be to use the cluster-defined application 
 id, when present. That also has the advantage of providing that information 
 as metadata for the Spark app, so that someone can easily correlate 
 information kept in Spark with that kept by the cluster manager.
 Open PR: https://github.com/apache/spark/pull/1218



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information


 [ 
https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3388:
-
Fix Version/s: (was: 1.1.0)

 Expose cluster applicationId, use it to serve history information
 -

 Key: SPARK-3388
 URL: https://issues.apache.org/jira/browse/SPARK-3388
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 This is a follow up to SPARK-2150.
 Currently Spark uses the application-generated log directory name as the URL 
 for the history server. That's not very user-friendly since there's no way 
 for the user to figure out that information otherwise.
 A more user-friendly approach would be to use the cluster-defined application 
 id, when present. That also has the advantage of providing that information 
 as metadata for the Spark app, so that someone can easily correlate 
 information kept in Spark with that kept by the cluster manager.
 Open PR: https://github.com/apache/spark/pull/1218



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3388) Expose cluster applicationId, use it to serve history information


 [ 
https://issues.apache.org/jira/browse/SPARK-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3388:
-
Fix Version/s: 1.1.0

 Expose cluster applicationId, use it to serve history information
 -

 Key: SPARK-3388
 URL: https://issues.apache.org/jira/browse/SPARK-3388
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
 Fix For: 1.2.0


 This is a follow up to SPARK-2150.
 Currently Spark uses the application-generated log directory name as the URL 
 for the history server. That's not very user-friendly since there's no way 
 for the user to figure out that information otherwise.
 A more user-friendly approach would be to use the cluster-defined application 
 id, when present. That also has the advantage of providing that information 
 as metadata for the Spark app, so that someone can easily correlate 
 information kept in Spark with that kept by the cluster manager.
 Open PR: https://github.com/apache/spark/pull/1218



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark

2014-09-03 Thread Uri Laserson (JIRA)

Uri Laserson created SPARK-3389:
---

 Summary: Add converter class to make reading Parquet files easy 
with PySpark
 Key: SPARK-3389
 URL: https://issues.apache.org/jira/browse/SPARK-3389
 Project: Spark
  Issue Type: Improvement
Reporter: Uri Laserson


If a user wants to read Parquet data from PySpark, they currently must use 
SparkContext.newAPIHadoopFile.  If they do not provide a valueConverter, they 
will get JSON string that must be parsed.  Here I add a Converter 
implementation based on the one in the AvroConverters.scala file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark

2014-09-03 Thread Uri Laserson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120638#comment-14120638
 ] 

Uri Laserson commented on SPARK-3389:
-

https://github.com/apache/spark/pull/2256

 Add converter class to make reading Parquet files easy with PySpark
 ---

 Key: SPARK-3389
 URL: https://issues.apache.org/jira/browse/SPARK-3389
 Project: Spark
  Issue Type: Improvement
Reporter: Uri Laserson

 If a user wants to read Parquet data from PySpark, they currently must use 
 SparkContext.newAPIHadoopFile.  If they do not provide a valueConverter, they 
 will get JSON string that must be parsed.  Here I add a Converter 
 implementation based on the one in the AvroConverters.scala file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3389) Add converter class to make reading Parquet files easy with PySpark