[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError

2014-04-04 Thread Idan Zalzberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959735#comment-13959735
 ] 

Idan Zalzberg commented on SPARK-1394:
--

This seems to be related to the way the handle_sigchld method in daemon.py 
works.
In order to kill the zombie processes the worker calls os.waitpid on SIGCHLD. 
however. since using Popen also tries to do that eventually, you get a closed 
handle.

Since platform.py is a native library, I would guess we should find a solution 
in pyspark (i.e. change the way handle_sigchld works, or maybe limit the 
processes it waits on)

 calling system.platform on worker raises IOError
 

 Key: SPARK-1394
 URL: https://issues.apache.org/jira/browse/SPARK-1394
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
 Environment: Tested on Ubuntu and Linux, local and remote master, 
 python 2.7.*
Reporter: Idan Zalzberg
  Labels: pyspark

 A simple program that calls system.platform() on the worker fails most of the 
 time (it works some times but very rarely).
 This is critical since many libraries call that method (e.g. boto).
 Here is the trace of the attempt to call that method:
 $ /usr/local/spark/bin/pyspark
 Python 2.7.3 (default, Feb 27 2014, 20:00:17)
 [GCC 4.6.3] on linux2
 Type help, copyright, credits or license for more information.
 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback 
 address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1)
 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started
 14/04/02 18:18:38 INFO Remoting: Starting remoting
 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@10.33.102.46:36640]
 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster
 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140402181839-919f
 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 
 MB.
 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id 
 = ConnectionManagerId(10.33.102.46,43357)
 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager
 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering 
 block manager 10.33.102.46:43357 with 294.6 MB RAM
 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at 
 http://10.33.102.46:51803
 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker
 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b
 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server
 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at 
 http://10.33.102.46:4040
 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 0.9.0
   /_/
 Using Python version 2.7.3 (default, Feb 27 2014 20:00:17)
 Spark context available as sc.
  import platform
  sc.parallelize([1]).map(lambda x : platform.system()).collect()
 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1
 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 
 output partitions (allowLocal=false)
 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at 
 stdin:1)
 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List()
 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List()
 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at 
 collect at stdin:1), which has no missing parents
 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 
 (PythonRDD[1] at collect at stdin:1)
 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 
 12 ms
 14/04/02 18:19:17 INFO Executor: Running task ID 0
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /usr/local/spark/python/pyspark/worker.py, line 77, in main
 

[jira] [Commented] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL

2014-04-04 Thread witgo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959784#comment-13959784
 ] 

witgo commented on SPARK-1413:
--

Try [the PR 325|https://github.com/apache/spark/pull/325]

 Parquet messes up stdout and stdin when used in Spark REPL
 --

 Key: SPARK-1413
 URL: https://issues.apache.org/jira/browse/SPARK-1413
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Matei Zaharia
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.0.0


 I have a simple Parquet file in foos.parquet, but after I type this code, 
 it freezes the shell, to the point where I can't read or write stuff:
 scala val qc = new org.apache.spark.sql.SQLContext(sc); import qc._
 qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826
 import qc._
 scala qc.parquetFile(foos.parquet).saveAsTextFile(bar)
 The job itself completes successfully, and bar contains the right text, but 
 I can no longer see commands I type in, or further log output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-04 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960100#comment-13960100
 ] 

Shivaram Venkataraman commented on SPARK-1391:
--

Thanks for the patch. I will try this out in the next couple of days and get 
back.

 BlockManager cannot transfer blocks larger than 2G in size
 --

 Key: SPARK-1391
 URL: https://issues.apache.org/jira/browse/SPARK-1391
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Shuffle
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman
Assignee: Min Zhou
 Attachments: SPARK-1391.diff


 If a task tries to remotely access a cached RDD block, I get an exception 
 when the block size is  2G. The exception is pasted below.
 Memory capacities are huge these days ( 60G), and many workflows depend on 
 having large blocks in memory, so it would be good to fix this bug.
 I don't know if the same thing happens on shuffles if one transfer (from 
 mapper to reducer) is  2G.
 {noformat}
 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer 
 message
 java.lang.ArrayIndexOutOfBoundsException
 at 
 it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134)
 at 
 it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
 at 
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38)
 at 
 org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26)
 at 
 org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913)
 at 
 org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922)
 at 
 org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348)
 at 
 org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323)
 at 
 org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
 at 
 org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at 
 org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at 
 org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
 at 
 org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
 at 
 org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661)
 at 
 org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1383) Spark-SQL: ParquetRelation improvements

2014-04-04 Thread Andre Schumacher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Schumacher resolved SPARK-1383.
-

Resolution: Fixed

Fixed by 
https://github.com/apache/spark/commit/fbebaedf26286ee8a75065822a3af1148351f828

 Spark-SQL: ParquetRelation improvements
 ---

 Key: SPARK-1383
 URL: https://issues.apache.org/jira/browse/SPARK-1383
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andre Schumacher
Assignee: Andre Schumacher

 Improve Spark-SQL's ParquetRelation as follows:
 - Instead of files a ParquetRelation is should be backed by a directory, 
 which simplifies importing data from other sources
 - InsertIntoParquetTable operation should supports switching between 
 overwriting or appending (at least in HiveQL)
 - tests should use the new API
 - Parquet logging should be forwarded to Log4J
 - It should be possible to enable compression (default compression for 
 Parquet files: GZIP, as in parquet-mr)
 - OverwriteCatalog should support dropping of tables



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1133) Add a new small files input for MLlib, which will return an RDD[(fileName, content)]

2014-04-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1133.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 Add a new small files input for MLlib, which will return an RDD[(fileName, 
 content)]
 

 Key: SPARK-1133
 URL: https://issues.apache.org/jira/browse/SPARK-1133
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 1.0.0
Reporter: Xusen Yin
Assignee: Xusen Yin
Priority: Minor
  Labels: IO, MLLib,, hadoop
 Fix For: 1.0.0


 As I am moving forward to write a LDA (Latent Dirichlet Allocation) 
 implementation to Spark MLlib, I find that a small files input API is useful, 
 so I write a smallTextFiles() to support it.
 smallTextFiles() digests a directory of text files, then return an 
 RDD\[(String, String)\], the former String is the file name, while the latter 
 one is the contents of the text file.
 smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the 
 textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses:
 1. smallTextFiles() is used to preprocess local disk files, i.e. combine 
 those files into a huge one, then transfer it onto HDFS to do further 
 process, such as LDA clustering.
 2. It is also used to transfer the raw directory of small files onto HDFS 
 (though it is not recommended, because it will cost too many namenode 
 entries), then clustering it directly with LDA.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1366) The sql function should be consistent between different types of SQLContext

2014-04-04 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960292#comment-13960292
 ] 

Michael Armbrust commented on SPARK-1366:
-

https://github.com/apache/spark/pull/319

 The sql function should be consistent between different types of SQLContext
 ---

 Key: SPARK-1366
 URL: https://issues.apache.org/jira/browse/SPARK-1366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.0.0


 Right now calling `context.sql` will cause things to be parsed with different 
 parsers, which is kinda confusing. Instead HiveContext should have a 
 specialized `hiveql` method that uses the HiveQL parser.
 Also need to update the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1414) Python API for SparkContext.wholeTextFiles

2014-04-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia reassigned SPARK-1414:


Assignee: Matei Zaharia

 Python API for SparkContext.wholeTextFiles
 --

 Key: SPARK-1414
 URL: https://issues.apache.org/jira/browse/SPARK-1414
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1416) Add support for SequenceFiles in PySpark

2014-04-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1416:


 Summary: Add support for SequenceFiles in PySpark
 Key: SPARK-1416
 URL: https://issues.apache.org/jira/browse/SPARK-1416
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia


Just covering the basic Hadoop Writable types (e.g. primitives, arrays of 
primitives, text) should still let people store data more efficiently.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1056) Header comment in Executor incorrectly implies it's not used for YARN

2014-04-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-1056:
-

Assignee: Sandy Ryza  (was: Sandy Pérez González)

 Header comment in Executor incorrectly implies it's not used for YARN
 -

 Key: SPARK-1056
 URL: https://issues.apache.org/jira/browse/SPARK-1056
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Sandy Pérez González
Assignee: Sandy Ryza
Priority: Trivial
 Fix For: 1.0.0


 {code}
 /**
  * Spark executor used with Mesos and the standalone scheduler.
  */
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1033) Ask for cores in Yarn container requests

2014-04-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-1033:
-

Assignee: Sandy Ryza  (was: Sandy Pérez González)

 Ask for cores in Yarn container requests 
 -

 Key: SPARK-1033
 URL: https://issues.apache.org/jira/browse/SPARK-1033
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza
 Fix For: 1.0.0


 Yarn 2.2 has support for requesting cores in addition to memory.  Spark 
 against Yarn 2.2 should include cores in its resource requests in the same 
 way it includes memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1211) In ApplicationMaster, set spark.master system property to yarn-cluster

2014-04-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-1211:
-

Assignee: Sandy Ryza  (was: Sandy Pérez González)

 In ApplicationMaster, set spark.master system property to yarn-cluster
 

 Key: SPARK-1211
 URL: https://issues.apache.org/jira/browse/SPARK-1211
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 This would make it so that users don't need to pass it in to their SparkConf. 
  It won't break anything for apps that already pass it in.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1197) Rename yarn-standalone and fix up docs for running on YARN

2014-04-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-1197:
-

Assignee: Sandy Ryza  (was: Sandy Pérez González)

 Rename yarn-standalone and fix up docs for running on YARN
 --

 Key: SPARK-1197
 URL: https://issues.apache.org/jira/browse/SPARK-1197
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza
 Fix For: 1.0.0


 yarn-standalone is a confusing name because the use of standalone is 
 different than the use in the sense of Spark standalone cluster manager.  It 
 would also be nice to fix up some typos in the YARN docs and add a section on 
 how to view container logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1417) Spark on Yarn - spark UI link from resourcemanager is broken

2014-04-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-1417:


Assignee: Thomas Graves

 Spark on Yarn - spark UI link from resourcemanager is broken
 

 Key: SPARK-1417
 URL: https://issues.apache.org/jira/browse/SPARK-1417
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker

 When running spark on yarn in yarn-cluster mode, spark registers a url with 
 the Yarn ResourceManager to point to the spark UI.  This link is now broken. 
 The link should be something like:  resourcemanager /proxy/ applicationId 
 instead its coming back as  resourcemanager / host of am:port 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1399) Reason for Stage Failure should be shown in UI

2014-04-04 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960790#comment-13960790
 ] 

Kay Ousterhout commented on SPARK-1399:
---

FYI this outstanding pull request changes this behavior: 
https://github.com/apache/spark/pull/309, so probably don't make sense to work 
on this until that gets resolved.

 Reason for Stage Failure should be shown in UI
 --

 Key: SPARK-1399
 URL: https://issues.apache.org/jira/browse/SPARK-1399
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Kay Ousterhout
Assignee: Nan Zhu

 Right now, we don't show why a stage failed in the UI.  We have this 
 information, and it would be useful for users to see (e.g., to see that a 
 stage was killed because the job was cancelled).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1419) Apache parent POM to version 14

2014-04-04 Thread Mark Hamstra (JIRA)
Mark Hamstra created SPARK-1419:
---

 Summary: Apache parent POM to version 14
 Key: SPARK-1419
 URL: https://issues.apache.org/jira/browse/SPARK-1419
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra


Latest version of the Apache parent POM includes several improvements and 
bugfixes, including to the release plugin: 
http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEADr2=1434717diff_format=h




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1198) Allow pipes tasks to run in different sub-directories

2014-04-04 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1198.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 Allow pipes tasks to run in different sub-directories
 -

 Key: SPARK-1198
 URL: https://issues.apache.org/jira/browse/SPARK-1198
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Thomas Graves
Assignee: Thomas Graves
 Fix For: 1.0.0


 Currently when a task runs, its working directory is the same as all the 
 other tasks running on that Worker.  If the tasks happen to output files to 
 that working directory with the same name, collisions happen.
 We should add an option to allow the tasks to run in separate sub-directories 
 to avoid those conflicts. 
 I should clarify that the specific concern is when running the pipes command.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1415) Add a minSplits parameter to wholeTextFiles

2014-04-04 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin reassigned SPARK-1415:


Assignee: Xusen Yin

 Add a minSplits parameter to wholeTextFiles
 ---

 Key: SPARK-1415
 URL: https://issues.apache.org/jira/browse/SPARK-1415
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Xusen Yin
  Labels: Starter

 This probably requires adding one to newAPIHadoopFile too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1216) Add a OneHotEncoder for handling categorical features

2014-04-04 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-1216:
-

Assignee: Sandy Ryza  (was: Sandy Pérez González)

 Add a OneHotEncoder for handling categorical features
 -

 Key: SPARK-1216
 URL: https://issues.apache.org/jira/browse/SPARK-1216
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Sandy Pérez González
Assignee: Sandy Ryza

 It would be nice to add something to MLLib to make it easy to do one-of-K 
 encoding of categorical features.
 Something like:
 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1415) Add a minSplits parameter to wholeTextFiles

2014-04-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960908#comment-13960908
 ] 

Xusen Yin commented on SPARK-1415:
--

Hi Matei, I just looked around in those Hadoop APIs. I find that the new Hadoop 
API deprecates the minSplit, instead of minSplit, they prefer minSplitSize and 
maxSplitSize to control the split. minSplit is negative correlated with 
maxSplitSize, so I think we have 2 ways to fix the issue:

1. We just provide a new API with maxSplitSize, say, wholeTextFiles(path: 
String, maxSplitSize: Long);

2. We write a delegation to compute the maxSplitSize using minSplit (easy to 
write, taking old Hadoop API as an example), and provide the API 
wholeTextFile(path: String, minSplit: Int);

I also think we can provide the two APIs simultaneously. What do you think?

 Add a minSplits parameter to wholeTextFiles
 ---

 Key: SPARK-1415
 URL: https://issues.apache.org/jira/browse/SPARK-1415
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Xusen Yin
  Labels: Starter

 This probably requires adding one to newAPIHadoopFile too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1419) Apache parent POM to version 14

2014-04-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1419.


   Resolution: Fixed
Fix Version/s: 1.0.0

 Apache parent POM to version 14
 ---

 Key: SPARK-1419
 URL: https://issues.apache.org/jira/browse/SPARK-1419
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Build, Deploy
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra
 Fix For: 1.0.0


 Latest version of the Apache parent POM includes several improvements and 
 bugfixes, including to the release plugin: 
 http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEADr2=1434717diff_format=h



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage

2014-04-04 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960937#comment-13960937
 ] 

Cheng Lian commented on SPARK-1402:
---

Corresponding PR: https://github.com/apache/spark/pull/330

 3 more compression algorithms for in-memory columnar storage
 

 Key: SPARK-1402
 URL: https://issues.apache.org/jira/browse/SPARK-1402
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker
  Labels: compression
 Fix For: 1.0.0


 This is a followup of SPARK-1373: Compression for In-Memory Columnar storage
 3 more compression algorithms for in-memory columnar storage should be 
 implemented:
 * BooleanBitSet
 * IntDelta
 * LongDelta



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7

2014-04-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-922:
--

Issue Type: Task  (was: Improvement)

 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Josh Rosen
Priority: Blocker
 Fix For: 1.0.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-04-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1305.


Resolution: Fixed

 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)