[jira] [Commented] (SPARK-1394) calling system.platform on worker raises IOError
[ https://issues.apache.org/jira/browse/SPARK-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959735#comment-13959735 ] Idan Zalzberg commented on SPARK-1394: -- This seems to be related to the way the handle_sigchld method in daemon.py works. In order to kill the zombie processes the worker calls os.waitpid on SIGCHLD. however. since using Popen also tries to do that eventually, you get a closed handle. Since platform.py is a native library, I would guess we should find a solution in pyspark (i.e. change the way handle_sigchld works, or maybe limit the processes it waits on) calling system.platform on worker raises IOError Key: SPARK-1394 URL: https://issues.apache.org/jira/browse/SPARK-1394 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0 Environment: Tested on Ubuntu and Linux, local and remote master, python 2.7.* Reporter: Idan Zalzberg Labels: pyspark A simple program that calls system.platform() on the worker fails most of the time (it works some times but very rarely). This is critical since many libraries call that method (e.g. boto). Here is the trace of the attempt to call that method: $ /usr/local/spark/bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. 14/04/02 18:18:37 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/02 18:18:37 WARN Utils: Your hostname, qlika-dev resolves to a loopback address: 127.0.1.1; using 10.33.102.46 instead (on interface eth1) 14/04/02 18:18:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/02 18:18:38 INFO Slf4jLogger: Slf4jLogger started 14/04/02 18:18:38 INFO Remoting: Starting remoting 14/04/02 18:18:39 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@10.33.102.46:36640] 14/04/02 18:18:39 INFO SparkEnv: Registering BlockManagerMaster 14/04/02 18:18:39 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140402181839-919f 14/04/02 18:18:39 INFO MemoryStore: MemoryStore started with capacity 294.6 MB. 14/04/02 18:18:39 INFO ConnectionManager: Bound socket to port 43357 with id = ConnectionManagerId(10.33.102.46,43357) 14/04/02 18:18:39 INFO BlockManagerMaster: Trying to register BlockManager 14/04/02 18:18:39 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 10.33.102.46:43357 with 294.6 MB RAM 14/04/02 18:18:39 INFO BlockManagerMaster: Registered BlockManager 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:39 INFO HttpBroadcast: Broadcast server started at http://10.33.102.46:51803 14/04/02 18:18:39 INFO SparkEnv: Registering MapOutputTracker 14/04/02 18:18:39 INFO HttpFileServer: HTTP File server directory is /tmp/spark-9b38acb0-7b01-4463-b0a6-602bfed05a2b 14/04/02 18:18:39 INFO HttpServer: Starting HTTP Server 14/04/02 18:18:40 INFO SparkUI: Started Spark Web UI at http://10.33.102.46:4040 14/04/02 18:18:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 0.9.0 /_/ Using Python version 2.7.3 (default, Feb 27 2014 20:00:17) Spark context available as sc. import platform sc.parallelize([1]).map(lambda x : platform.system()).collect() 14/04/02 18:19:17 INFO SparkContext: Starting job: collect at stdin:1 14/04/02 18:19:17 INFO DAGScheduler: Got job 0 (collect at stdin:1) with 1 output partitions (allowLocal=false) 14/04/02 18:19:17 INFO DAGScheduler: Final stage: Stage 0 (collect at stdin:1) 14/04/02 18:19:17 INFO DAGScheduler: Parents of final stage: List() 14/04/02 18:19:17 INFO DAGScheduler: Missing parents: List() 14/04/02 18:19:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at collect at stdin:1), which has no missing parents 14/04/02 18:19:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (PythonRDD[1] at collect at stdin:1) 14/04/02 18:19:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/04/02 18:19:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/04/02 18:19:17 INFO TaskSetManager: Serialized task 0.0:0 as 2152 bytes in 12 ms 14/04/02 18:19:17 INFO Executor: Running task ID 0 PySpark worker failed with exception: Traceback (most recent call last): File /usr/local/spark/python/pyspark/worker.py, line 77, in main
[jira] [Commented] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959784#comment-13959784 ] witgo commented on SPARK-1413: -- Try [the PR 325|https://github.com/apache/spark/pull/325] Parquet messes up stdout and stdin when used in Spark REPL -- Key: SPARK-1413 URL: https://issues.apache.org/jira/browse/SPARK-1413 Project: Spark Issue Type: Bug Components: SQL Reporter: Matei Zaharia Assignee: Michael Armbrust Priority: Critical Fix For: 1.0.0 I have a simple Parquet file in foos.parquet, but after I type this code, it freezes the shell, to the point where I can't read or write stuff: scala val qc = new org.apache.spark.sql.SQLContext(sc); import qc._ qc: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c0c8826 import qc._ scala qc.parquetFile(foos.parquet).saveAsTextFile(bar) The job itself completes successfully, and bar contains the right text, but I can no longer see commands I type in, or further log output. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960100#comment-13960100 ] Shivaram Venkataraman commented on SPARK-1391: -- Thanks for the patch. I will try this out in the next couple of days and get back. BlockManager cannot transfer blocks larger than 2G in size -- Key: SPARK-1391 URL: https://issues.apache.org/jira/browse/SPARK-1391 Project: Spark Issue Type: Bug Components: Block Manager, Shuffle Affects Versions: 1.0.0 Reporter: Shivaram Venkataraman Assignee: Min Zhou Attachments: SPARK-1391.diff If a task tries to remotely access a cached RDD block, I get an exception when the block size is 2G. The exception is pasted below. Memory capacities are huge these days ( 60G), and many workflows depend on having large blocks in memory, so it would be good to fix this bug. I don't know if the same thing happens on shuffles if one transfer (from mapper to reducer) is 2G. {noformat} 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer message java.lang.ArrayIndexOutOfBoundsException at it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) at it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1383) Spark-SQL: ParquetRelation improvements
[ https://issues.apache.org/jira/browse/SPARK-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andre Schumacher resolved SPARK-1383. - Resolution: Fixed Fixed by https://github.com/apache/spark/commit/fbebaedf26286ee8a75065822a3af1148351f828 Spark-SQL: ParquetRelation improvements --- Key: SPARK-1383 URL: https://issues.apache.org/jira/browse/SPARK-1383 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andre Schumacher Assignee: Andre Schumacher Improve Spark-SQL's ParquetRelation as follows: - Instead of files a ParquetRelation is should be backed by a directory, which simplifies importing data from other sources - InsertIntoParquetTable operation should supports switching between overwriting or appending (at least in HiveQL) - tests should use the new API - Parquet logging should be forwarded to Log4J - It should be possible to enable compression (default compression for Parquet files: GZIP, as in parquet-mr) - OverwriteCatalog should support dropping of tables -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1133) Add a new small files input for MLlib, which will return an RDD[(fileName, content)]
[ https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1133. -- Resolution: Fixed Fix Version/s: 1.0.0 Add a new small files input for MLlib, which will return an RDD[(fileName, content)] Key: SPARK-1133 URL: https://issues.apache.org/jira/browse/SPARK-1133 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 1.0.0 Reporter: Xusen Yin Assignee: Xusen Yin Priority: Minor Labels: IO, MLLib,, hadoop Fix For: 1.0.0 As I am moving forward to write a LDA (Latent Dirichlet Allocation) implementation to Spark MLlib, I find that a small files input API is useful, so I write a smallTextFiles() to support it. smallTextFiles() digests a directory of text files, then return an RDD\[(String, String)\], the former String is the file name, while the latter one is the contents of the text file. smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses: 1. smallTextFiles() is used to preprocess local disk files, i.e. combine those files into a huge one, then transfer it onto HDFS to do further process, such as LDA clustering. 2. It is also used to transfer the raw directory of small files onto HDFS (though it is not recommended, because it will cost too many namenode entries), then clustering it directly with LDA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1366) The sql function should be consistent between different types of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960292#comment-13960292 ] Michael Armbrust commented on SPARK-1366: - https://github.com/apache/spark/pull/319 The sql function should be consistent between different types of SQLContext --- Key: SPARK-1366 URL: https://issues.apache.org/jira/browse/SPARK-1366 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.0.0 Right now calling `context.sql` will cause things to be parsed with different parsers, which is kinda confusing. Instead HiveContext should have a specialized `hiveql` method that uses the HiveQL parser. Also need to update the documentation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1414) Python API for SparkContext.wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1414: Assignee: Matei Zaharia Python API for SparkContext.wholeTextFiles -- Key: SPARK-1414 URL: https://issues.apache.org/jira/browse/SPARK-1414 Project: Spark Issue Type: Bug Components: PySpark Reporter: Matei Zaharia Assignee: Matei Zaharia Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1416) Add support for SequenceFiles in PySpark
Matei Zaharia created SPARK-1416: Summary: Add support for SequenceFiles in PySpark Key: SPARK-1416 URL: https://issues.apache.org/jira/browse/SPARK-1416 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Just covering the basic Hadoop Writable types (e.g. primitives, arrays of primitives, text) should still let people store data more efficiently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1056) Header comment in Executor incorrectly implies it's not used for YARN
[ https://issues.apache.org/jira/browse/SPARK-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1056: - Assignee: Sandy Ryza (was: Sandy Pérez González) Header comment in Executor incorrectly implies it's not used for YARN - Key: SPARK-1056 URL: https://issues.apache.org/jira/browse/SPARK-1056 Project: Spark Issue Type: Bug Components: YARN Reporter: Sandy Pérez González Assignee: Sandy Ryza Priority: Trivial Fix For: 1.0.0 {code} /** * Spark executor used with Mesos and the standalone scheduler. */ {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1033) Ask for cores in Yarn container requests
[ https://issues.apache.org/jira/browse/SPARK-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1033: - Assignee: Sandy Ryza (was: Sandy Pérez González) Ask for cores in Yarn container requests - Key: SPARK-1033 URL: https://issues.apache.org/jira/browse/SPARK-1033 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza Fix For: 1.0.0 Yarn 2.2 has support for requesting cores in addition to memory. Spark against Yarn 2.2 should include cores in its resource requests in the same way it includes memory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1211) In ApplicationMaster, set spark.master system property to yarn-cluster
[ https://issues.apache.org/jira/browse/SPARK-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1211: - Assignee: Sandy Ryza (was: Sandy Pérez González) In ApplicationMaster, set spark.master system property to yarn-cluster Key: SPARK-1211 URL: https://issues.apache.org/jira/browse/SPARK-1211 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza This would make it so that users don't need to pass it in to their SparkConf. It won't break anything for apps that already pass it in. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1197) Rename yarn-standalone and fix up docs for running on YARN
[ https://issues.apache.org/jira/browse/SPARK-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1197: - Assignee: Sandy Ryza (was: Sandy Pérez González) Rename yarn-standalone and fix up docs for running on YARN -- Key: SPARK-1197 URL: https://issues.apache.org/jira/browse/SPARK-1197 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza Fix For: 1.0.0 yarn-standalone is a confusing name because the use of standalone is different than the use in the sense of Spark standalone cluster manager. It would also be nice to fix up some typos in the YARN docs and add a section on how to view container logs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1417) Spark on Yarn - spark UI link from resourcemanager is broken
[ https://issues.apache.org/jira/browse/SPARK-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-1417: Assignee: Thomas Graves Spark on Yarn - spark UI link from resourcemanager is broken Key: SPARK-1417 URL: https://issues.apache.org/jira/browse/SPARK-1417 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Blocker When running spark on yarn in yarn-cluster mode, spark registers a url with the Yarn ResourceManager to point to the spark UI. This link is now broken. The link should be something like: resourcemanager /proxy/ applicationId instead its coming back as resourcemanager / host of am:port -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1399) Reason for Stage Failure should be shown in UI
[ https://issues.apache.org/jira/browse/SPARK-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960790#comment-13960790 ] Kay Ousterhout commented on SPARK-1399: --- FYI this outstanding pull request changes this behavior: https://github.com/apache/spark/pull/309, so probably don't make sense to work on this until that gets resolved. Reason for Stage Failure should be shown in UI -- Key: SPARK-1399 URL: https://issues.apache.org/jira/browse/SPARK-1399 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Kay Ousterhout Assignee: Nan Zhu Right now, we don't show why a stage failed in the UI. We have this information, and it would be useful for users to see (e.g., to see that a stage was killed because the job was cancelled). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1419) Apache parent POM to version 14
Mark Hamstra created SPARK-1419: --- Summary: Apache parent POM to version 14 Key: SPARK-1419 URL: https://issues.apache.org/jira/browse/SPARK-1419 Project: Spark Issue Type: Bug Components: Build, Deploy Affects Versions: 1.0.0 Reporter: Mark Hamstra Assignee: Mark Hamstra Latest version of the Apache parent POM includes several improvements and bugfixes, including to the release plugin: http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEADr2=1434717diff_format=h -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1198) Allow pipes tasks to run in different sub-directories
[ https://issues.apache.org/jira/browse/SPARK-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1198. -- Resolution: Fixed Fix Version/s: 1.0.0 Allow pipes tasks to run in different sub-directories - Key: SPARK-1198 URL: https://issues.apache.org/jira/browse/SPARK-1198 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Thomas Graves Assignee: Thomas Graves Fix For: 1.0.0 Currently when a task runs, its working directory is the same as all the other tasks running on that Worker. If the tasks happen to output files to that working directory with the same name, collisions happen. We should add an option to allow the tasks to run in separate sub-directories to avoid those conflicts. I should clarify that the specific concern is when running the pipes command. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1415) Add a minSplits parameter to wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin reassigned SPARK-1415: Assignee: Xusen Yin Add a minSplits parameter to wholeTextFiles --- Key: SPARK-1415 URL: https://issues.apache.org/jira/browse/SPARK-1415 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Xusen Yin Labels: Starter This probably requires adding one to newAPIHadoopFile too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1216) Add a OneHotEncoder for handling categorical features
[ https://issues.apache.org/jira/browse/SPARK-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1216: - Assignee: Sandy Ryza (was: Sandy Pérez González) Add a OneHotEncoder for handling categorical features - Key: SPARK-1216 URL: https://issues.apache.org/jira/browse/SPARK-1216 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.0 Reporter: Sandy Pérez González Assignee: Sandy Ryza It would be nice to add something to MLLib to make it easy to do one-of-K encoding of categorical features. Something like: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1415) Add a minSplits parameter to wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960908#comment-13960908 ] Xusen Yin commented on SPARK-1415: -- Hi Matei, I just looked around in those Hadoop APIs. I find that the new Hadoop API deprecates the minSplit, instead of minSplit, they prefer minSplitSize and maxSplitSize to control the split. minSplit is negative correlated with maxSplitSize, so I think we have 2 ways to fix the issue: 1. We just provide a new API with maxSplitSize, say, wholeTextFiles(path: String, maxSplitSize: Long); 2. We write a delegation to compute the maxSplitSize using minSplit (easy to write, taking old Hadoop API as an example), and provide the API wholeTextFile(path: String, minSplit: Int); I also think we can provide the two APIs simultaneously. What do you think? Add a minSplits parameter to wholeTextFiles --- Key: SPARK-1415 URL: https://issues.apache.org/jira/browse/SPARK-1415 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Xusen Yin Labels: Starter This probably requires adding one to newAPIHadoopFile too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1419) Apache parent POM to version 14
[ https://issues.apache.org/jira/browse/SPARK-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1419. Resolution: Fixed Fix Version/s: 1.0.0 Apache parent POM to version 14 --- Key: SPARK-1419 URL: https://issues.apache.org/jira/browse/SPARK-1419 Project: Spark Issue Type: Dependency upgrade Components: Build, Deploy Affects Versions: 1.0.0 Reporter: Mark Hamstra Assignee: Mark Hamstra Fix For: 1.0.0 Latest version of the Apache parent POM includes several improvements and bugfixes, including to the release plugin: http://svn.apache.org/viewvc/maven/pom/tags/apache-14/pom.xml?r1=HEADr2=1434717diff_format=h -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1402) 3 more compression algorithms for in-memory columnar storage
[ https://issues.apache.org/jira/browse/SPARK-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960937#comment-13960937 ] Cheng Lian commented on SPARK-1402: --- Corresponding PR: https://github.com/apache/spark/pull/330 3 more compression algorithms for in-memory columnar storage Key: SPARK-1402 URL: https://issues.apache.org/jira/browse/SPARK-1402 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Labels: compression Fix For: 1.0.0 This is a followup of SPARK-1373: Compression for In-Memory Columnar storage 3 more compression algorithms for in-memory columnar storage should be implemented: * BooleanBitSet * IntDelta * LongDelta -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-922: -- Issue Type: Task (was: Improvement) Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Josh Rosen Priority: Blocker Fix For: 1.0.0 Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1305. Resolution: Fixed Support persisting RDD's directly to Tachyon Key: SPARK-1305 URL: https://issues.apache.org/jira/browse/SPARK-1305 Project: Spark Issue Type: New Feature Components: Block Manager Reporter: Patrick Wendell Assignee: Haoyuan Li Priority: Blocker Fix For: 1.0.0 This is already an ongoing pull request - in a nutshell we want to support Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)