[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169 ] Derrick Burns commented on SPARK-3261: -- One solution is to run KMeansParallel or KMeansRandom after each Lloyds round to "replenish" empty clusters. I have implemented the former in https://github.com/derrickburns/generalized-kmeans-clustering. Performance is reasonable. Inspection reveals that the slow part of the KMeansParallel computation is the computation of the sum of the weights of the points in each cluster. However, the performance can be reduced by sampling the points and summing the contributions of each sampled point. For large data sets, this approach is appropriate. > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336152#comment-14336152 ] Anselme Vignon commented on SPARK-5775: --- [~marmbrus][~lian cheng] Hi, I'm quite new on the process of debugging spark, but the pull request I updated 5 days ago (referenced above) seems to be solving this issue. Or did I miss something ? Cheers (and thanks for the awesome work), Anselme > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Cheng Lian >Priority: Blocker > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with >
[jira] [Commented] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.
[ https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336128#comment-14336128 ] Apache Spark commented on SPARK-5969: - User 'foxik' has created a pull request for this issue: https://github.com/apache/spark/pull/4761 > The pyspark.rdd.sortByKey always fills only two partitions when > ascending=False. > > > Key: SPARK-5969 > URL: https://issues.apache.org/jira/browse/SPARK-5969 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.1 > Environment: Linux, 64bit >Reporter: Milan Straka > > The pyspark.rdd.sortByKey always fills only two partitions when > ascending=False -- the first one and the last one. > Simple example sorting numbers 0..999 into 10 partitions in descending order: > {code} > sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, > numPartitions=10).glom().map(len).collect() > {code} > returns the following partition sizes: > {code} > [469, 0, 0, 0, 0, 0, 0, 0, 0, 531] > {code} > When ascending=True, all works as expected: > {code} > sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, > numPartitions=10).glom().map(len).collect() > Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85] > {code} > The problem is caused by the following line 565 in rdd.py: > {code} > samples = sorted(samples, reverse=(not ascending), key=keyfunc) > {code} > That sorts the samples descending if ascending=False. Nevertheless samples > should always be in ascending order, because it is (after subsampling to > variable bounds) used in a bisect_left call: > {code} > def rangePartitioner(k): > p = bisect.bisect_left(bounds, keyfunc(k)) > if ascending: > return p > else: > return numPartitions - 1 - p > {code} > As you can see, rangePartitioner already handles the ascending=False by > itself, so the fix for the whole problem is really trivial: just change line > rdd.py:565 to > {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5999) Remove duplicate Literal matching block
[ https://issues.apache.org/jira/browse/SPARK-5999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336120#comment-14336120 ] Apache Spark commented on SPARK-5999: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4760 > Remove duplicate Literal matching block > --- > > Key: SPARK-5999 > URL: https://issues.apache.org/jira/browse/SPARK-5999 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5999) Remove duplicate Literal matching block
Liang-Chi Hsieh created SPARK-5999: -- Summary: Remove duplicate Literal matching block Key: SPARK-5999 URL: https://issues.apache.org/jira/browse/SPARK-5999 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5970) Temporary directories are not removed (but their content is)
[ https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336113#comment-14336113 ] Apache Spark commented on SPARK-5970: - User 'foxik' has created a pull request for this issue: https://github.com/apache/spark/pull/4759 > Temporary directories are not removed (but their content is) > > > Key: SPARK-5970 > URL: https://issues.apache.org/jira/browse/SPARK-5970 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Linux, 64bit > spark-1.2.1-bin-hadoop2.4.tgz >Reporter: Milan Straka > > How to reproduce: > - extract spark-1.2.1-bin-hadoop2.4.tgz > - without any further configuration, run bin/pyspark > - run sc.stop() and close python shell > Expected results: > - no temporary directories are left in /tmp > Actual results: > - four empty temporary directories are created in /tmp, for example after > {{ls -d /tmp/spark*}}:{code} > /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53 > /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3 > /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753 > /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146 > {code} > The issue is caused by changes in {{util/Utils.scala}}. Consider the > {{createDirectory}}: > {code} /** >* Create a directory inside the given parent directory. The directory is > guaranteed to be >* newly created, and is not marked for automatic deletion. >*/ > def createDirectory(root: String, namePrefix: String = "spark"): File = ... > {code} > The {{createDirectory}} is used in two places. The first is in > {{createTempDir}}, where it is marked for automatic deletion: > {code} > def createTempDir( > root: String = System.getProperty("java.io.tmpdir"), > namePrefix: String = "spark"): File = { > val dir = createDirectory(root, namePrefix) > registerShutdownDeleteDir(dir) > dir > } > {code} > Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ > marked for automatic deletion: > {code} > private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] > = { > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so > we must set it > // to what Yarn on this system said was available. Note this assumes > that Yarn has > // created the directories already, and that they are secured so that > only the > // user has access to them. > getYarnLocalDirs(conf).split(",") > } else { > // In non-Yarn mode (or for the driver in yarn-client mode), we cannot > trust the user > // configuration to point to a secure directory. So create a > subdirectory with restricted > // permissions under each listed directory. > Option(conf.getenv("SPARK_LOCAL_DIRS")) > .getOrElse(conf.get("spark.local.dir", > System.getProperty("java.io.tmpdir"))) > .split(",") > .flatMap { root => > try { > val rootDir = new File(root) > if (rootDir.exists || rootDir.mkdirs()) { > Some(createDirectory(root).getAbsolutePath()) > } else { > logError(s"Failed to create dir in $root. Ignoring this > directory.") > None > } > } catch { > case e: IOException => > logError(s"Failed to create local root dir in $root. Ignoring > this directory.") > None > } > } > .toArray > } > } > {code} > Therefore I think the > {code} > Some(createDirectory(root).getAbsolutePath()) > {code} > should be replaced by something like (I am not an experienced Scala > programmer): > {code} > val dir = createDirectory(root) > registerShutdownDeleteDir(dir) > Some(dir.getAbsolutePath()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5970) Temporary directories are not removed (but their content is)
[ https://issues.apache.org/jira/browse/SPARK-5970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336070#comment-14336070 ] Milan Straka commented on SPARK-5970: - I found a _Contributing to Spark_ guide, will do so. > Temporary directories are not removed (but their content is) > > > Key: SPARK-5970 > URL: https://issues.apache.org/jira/browse/SPARK-5970 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Linux, 64bit > spark-1.2.1-bin-hadoop2.4.tgz >Reporter: Milan Straka > > How to reproduce: > - extract spark-1.2.1-bin-hadoop2.4.tgz > - without any further configuration, run bin/pyspark > - run sc.stop() and close python shell > Expected results: > - no temporary directories are left in /tmp > Actual results: > - four empty temporary directories are created in /tmp, for example after > {{ls -d /tmp/spark*}}:{code} > /tmp/spark-1577b13d-4b9a-4e35-bac2-6e84e5605f53 > /tmp/spark-96084e69-77fd-42fb-ab10-e1fc74296fe3 > /tmp/spark-ab2ea237-d875-485e-b16c-5b0ac31bd753 > /tmp/spark-ddeb0363-4760-48a4-a189-81321898b146 > {code} > The issue is caused by changes in {{util/Utils.scala}}. Consider the > {{createDirectory}}: > {code} /** >* Create a directory inside the given parent directory. The directory is > guaranteed to be >* newly created, and is not marked for automatic deletion. >*/ > def createDirectory(root: String, namePrefix: String = "spark"): File = ... > {code} > The {{createDirectory}} is used in two places. The first is in > {{createTempDir}}, where it is marked for automatic deletion: > {code} > def createTempDir( > root: String = System.getProperty("java.io.tmpdir"), > namePrefix: String = "spark"): File = { > val dir = createDirectory(root, namePrefix) > registerShutdownDeleteDir(dir) > dir > } > {code} > Nevertheless, it is also used in {{getOrCreateLocalDirs}} where it is _not_ > marked for automatic deletion: > {code} > private[spark] def getOrCreateLocalRootDirs(conf: SparkConf): Array[String] > = { > if (isRunningInYarnContainer(conf)) { > // If we are in yarn mode, systems can have different disk layouts so > we must set it > // to what Yarn on this system said was available. Note this assumes > that Yarn has > // created the directories already, and that they are secured so that > only the > // user has access to them. > getYarnLocalDirs(conf).split(",") > } else { > // In non-Yarn mode (or for the driver in yarn-client mode), we cannot > trust the user > // configuration to point to a secure directory. So create a > subdirectory with restricted > // permissions under each listed directory. > Option(conf.getenv("SPARK_LOCAL_DIRS")) > .getOrElse(conf.get("spark.local.dir", > System.getProperty("java.io.tmpdir"))) > .split(",") > .flatMap { root => > try { > val rootDir = new File(root) > if (rootDir.exists || rootDir.mkdirs()) { > Some(createDirectory(root).getAbsolutePath()) > } else { > logError(s"Failed to create dir in $root. Ignoring this > directory.") > None > } > } catch { > case e: IOException => > logError(s"Failed to create local root dir in $root. Ignoring > this directory.") > None > } > } > .toArray > } > } > {code} > Therefore I think the > {code} > Some(createDirectory(root).getAbsolutePath()) > {code} > should be replaced by something like (I am not an experienced Scala > programmer): > {code} > val dir = createDirectory(root) > registerShutdownDeleteDir(dir) > Some(dir.getAbsolutePath()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336057#comment-14336057 ] Saurabh Santhosh commented on SPARK-1823: - What is the status of this issue? Is there any other ticket for this? > ExternalAppendOnlyMap can still OOM if one key is very large > > > Key: SPARK-1823 > URL: https://issues.apache.org/jira/browse/SPARK-1823 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Andrew Or > > If the values for one key do not collectively fit into memory, then the map > will still OOM when you merge the spilled contents back in. > This is a problem especially for PySpark, since we hash the keys (Python > objects) before a shuffle, and there are only so many integers out there in > the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336034#comment-14336034 ] Twinkle Sachdeva commented on SPARK-4705: - Hi [~vanzin] Working on it. Thanks, Twinkle > Driver retries in cluster mode always fail if event logging is enabled > -- > > Key: SPARK-4705 > URL: https://issues.apache.org/jira/browse/SPARK-4705 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin > Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, Updated UI - > II.png > > > yarn-cluster mode will retry to run the driver in certain failure modes. If > even logging is enabled, this will most probably fail, because: > {noformat} > Exception in thread "Driver" java.io.IOException: Log directory > hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 > already exists! > at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) > at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) > at org.apache.spark.SparkContext.(SparkContext.scala:353) > {noformat} > The even log path should be "more unique". Or perhaps retries of the same app > should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336027#comment-14336027 ] Mukesh Jha commented on SPARK-5837: --- My Hadoop version is Hadoop 2.5.0-cdh5.3.0 > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1192) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.je
[jira] [Commented] (SPARK-5837) HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-5837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336026#comment-14336026 ] Mukesh Jha commented on SPARK-5837: --- [~srowen] From the Driver logs [3] I can see that SparkUI started on a specified port, also my YARN app tracking URL[1] points to that port which is in turn getting redirected to the proxy URL[2] which gives me java.net.BindException: Cannot assign requested address. If there was a port conflict issue the sparkUI stark will have issues but that id not the case. [1] YARN: application_1424814313649_0006spark-realtime-MessageStoreWriter SPARKciuser root.ciuserRUNNING UNDEFINED 10% http://host21.cloud.com:44648 [2] ProxyURL: http://host28.cloud.com:8088/proxy/application_1424814313649_0006/ [3] LOGS: 15/02/25 04:25:02 INFO util.Utils: Successfully started service 'SparkUI' on port 44648. 15/02/25 04:25:02 INFO ui.SparkUI: Started SparkUI at http://host21.cloud.com:44648 15/02/25 04:25:02 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler 15/02/25 04:25:02 INFO netty.NettyBlockTransferService: Server created on 41518 > HTTP 500 if try to access Spark UI in yarn-cluster or yarn-client mode > -- > > Key: SPARK-5837 > URL: https://issues.apache.org/jira/browse/SPARK-5837 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.2.0, 1.2.1 >Reporter: Marco Capuccini > > Both Spark 1.2.0 and Spark 1.2.1 return this error when I try to access the > Spark UI if I run over yarn (version 2.4.0): > HTTP ERROR 500 > Problem accessing /proxy/application_1423564210894_0017/. Reason: > Connection refused > Caused by: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:579) > at java.net.Socket.connect(Socket.java:528) > at java.net.Socket.(Socket.java:425) > at java.net.Socket.(Socket.java:280) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) > at > org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) > at > org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) > at > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:346) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.proxyLink(WebAppProxyServlet.java:187) > at > org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet.doGet(WebAppProxyServlet.java:344) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) > at > org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:66) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:79) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFi
[jira] [Resolved] (SPARK-5994) Python DataFrame documentation fixes
[ https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5994. - Resolution: Fixed Issue resolved by pull request 4756 [https://github.com/apache/spark/pull/4756] > Python DataFrame documentation fixes > > > Key: SPARK-5994 > URL: https://issues.apache.org/jira/browse/SPARK-5994 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > select empty should NOT be the same as select(*). make sure selectExpr is > behaving the same. > join param documentation > link to source doesn't work in jekyll generated file > cross reference of columns (i.e. enabling linking) > show(): move df example before df.show() > move tests in SQLContext out of docstring otherwise doc is too long > Column.desc and .asc doesn't have any documentation > in documentation, sort functions.* (and add the functions defined in the dict > automatically) > if possible, add index for all functions and classes > GroupedData doesn't have any description in the table of contents for > pyspark.sql module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out
[ https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335987#comment-14335987 ] Apache Spark commented on SPARK-5751: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4758 > Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes > times out > -- > > Key: SPARK-5751 > URL: https://issues.apache.org/jira/browse/SPARK-5751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Labels: flaky-test > Fix For: 1.3.0 > > > The "Test JDBC query execution" test case times out occasionally, all other > test cases are just fine. The failure output only contains service startup > command line without any log output. Guess somehow the test case misses the > log file path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5998) Make Spark Streaming checkpoint version compatible
Saisai Shao created SPARK-5998: -- Summary: Make Spark Streaming checkpoint version compatible Key: SPARK-5998 URL: https://issues.apache.org/jira/browse/SPARK-5998 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Currently Spark Streaming's checkpointing(serialization) mechanism is version mutable, if the code is updated or any changes happened. The old checkpointing files cannot be read again. To keep the long-running and upgradability of streaming App, making the checkpointing mechanism version compatible is meaning. Here creating a JIRA as stub of this issue, I'm currently working on this, a design doc will be posted later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5286. - Resolution: Fixed Issue resolved by pull request 4755 [https://github.com/apache/spark/pull/4755] > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.
[jira] [Commented] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs
[ https://issues.apache.org/jira/browse/SPARK-5996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335947#comment-14335947 ] Apache Spark commented on SPARK-5996: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/4757 > DataFrame.collect() doesn't recognize UDTs > -- > > Key: SPARK-5996 > URL: https://issues.apache.org/jira/browse/SPARK-5996 > Project: Spark > Issue Type: Bug > Components: MLlib, SQL >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Michael Armbrust >Priority: Blocker > > {code} > import org.apache.spark.mllib.linalg._ > case class Test(data: Vector) > val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0 > df.collect()[0].getAs[Vector](0) > {code} > throws an exception. `collect()` returns `Row` instead of `Vector`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5775: Assignee: Cheng Lian (was: Michael Armbrust) > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Cheng Lian >Priority: Blocker > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with > curMem=289276, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 7.5 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with > curMem=296908, maxMem=28
[jira] [Updated] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1
[ https://issues.apache.org/jira/browse/SPARK-5993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-5993: - Fix Version/s: 1.3.0 > Published Kafka-assembly JAR was empty in 1.3.0-RC1 > --- > > Key: SPARK-5993 > URL: https://issues.apache.org/jira/browse/SPARK-5993 > Project: Spark > Issue Type: Bug > Components: Build, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.3.0 > > > This is because the maven build generated two Jars- > 1. an empty JAR file (since kafka-assembly has no code of its own) > 2. a assembly JAR file containing everything in a different location as 1 > The maven publishing plugin uploaded 1 and not 2. > Instead if 2 is not configure to generate in a different location, there is > only 1 jar containing everything, which gets published. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5775: Priority: Blocker (was: Major) > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Michael Armbrust >Priority: Blocker > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with > curMem=289276, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 7.5 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with > curMem=296908, maxMem=280248975
[jira] [Assigned] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5775: --- Assignee: Michael Armbrust > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Michael Armbrust > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with > curMem=289276, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 7.5 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with > curMem=296908, maxMem=280248975 > 15/02/12 16:21:03 INFO Memo
[jira] [Updated] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5775: Target Version/s: 1.3.0 > GenericRow cannot be cast to SpecificMutableRow when nested data and > partitioned table > -- > > Key: SPARK-5775 > URL: https://issues.apache.org/jira/browse/SPARK-5775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Ayoub Benali >Assignee: Michael Armbrust > Labels: hivecontext, nested, parquet, partition > > Using the "LOAD" sql command in Hive context to load parquet files into a > partitioned table causes exceptions during query time. > The bug requires the table to have a column of *type Array of struct* and to > be *partitioned*. > The example bellow shows how to reproduce the bug and you can see that if the > table is not partitioned the query works fine. > {noformat} > scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}""" > scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}""" > scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil) > scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD) > scala> schemaRDD.printSchema > root > |-- data_array: array (nullable = true) > ||-- element: struct (containsNull = false) > |||-- field1: integer (nullable = true) > |||-- field2: integer (nullable = true) > scala> hiveContext.sql("create external table if not exists > partitioned_table(data_array ARRAY >) > Partitioned by (date STRING) STORED AS PARQUET Location > 'hdfs:///partitioned_table'") > scala> hiveContext.sql("create external table if not exists > none_partitioned_table(data_array ARRAY >) > STORED AS PARQUET Location 'hdfs:///none_partitioned_table'") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1") > scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE > partitioned_table PARTITION(date='2015-02-12')") > scala> hiveContext.sql("LOAD DATA INPATH > 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE > none_partitioned_table") > scala> hiveContext.sql("select data.field1 from none_partitioned_table > LATERAL VIEW explode(data_array) nestedStuff AS data").collect > res23: Array[org.apache.spark.sql.Row] = Array([1], [3]) > scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL > VIEW explode(data_array) nestedStuff AS data").collect > 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from > partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data > 15/02/12 16:21:03 INFO ParseDriver: Parse Completed > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with > curMem=0, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in > memory (estimated size 254.6 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with > curMem=260661, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes > in memory (estimated size 27.9 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory > on *:51990 (size: 27.9 KB, free: 267.2 MB) > 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block > broadcast_18_piece0 > 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD > at ParquetTableOperations.scala:119 > 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3 > 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side > Metadata Split Strategy > 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at > SparkPlan.scala:84 > 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at > SparkPlan.scala:84) with 3 output partitions (allowLocal=false) > 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at > SparkPlan.scala:84) > 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List() > 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List() > 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at > map at SparkPlan.scala:84), which has no missing parents > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with > curMem=289276, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in > memory (estimated size 7.5 KB, free 267.0 MB) > 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(4230) called with > curMem=296908, maxMem=280248975 > 15/02/12 16:21:03 INFO MemoryStore: B
[jira] [Resolved] (SPARK-5985) sortBy -> orderBy in Python
[ https://issues.apache.org/jira/browse/SPARK-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5985. - Resolution: Fixed Issue resolved by pull request 4752 [https://github.com/apache/spark/pull/4752] > sortBy -> orderBy in Python > --- > > Key: SPARK-5985 > URL: https://issues.apache.org/jira/browse/SPARK-5985 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Seems like a mistake that we included sortBy but not orderBy in Python > DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335846#comment-14335846 ] Chris Fregly commented on SPARK-5960: - pushing this up to 1.3.1 > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Fregly updated SPARK-5960: Target Version/s: 1.3.1 (was: 1.4.0) > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5994) Python DataFrame documentation fixes
[ https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335845#comment-14335845 ] Apache Spark commented on SPARK-5994: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4756 > Python DataFrame documentation fixes > > > Key: SPARK-5994 > URL: https://issues.apache.org/jira/browse/SPARK-5994 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > select empty should NOT be the same as select(*). make sure selectExpr is > behaving the same. > join param documentation > link to source doesn't work in jekyll generated file > cross reference of columns (i.e. enabling linking) > show(): move df example before df.show() > move tests in SQLContext out of docstring otherwise doc is too long > Column.desc and .asc doesn't have any documentation > in documentation, sort functions.* (and add the functions defined in the dict > automatically) > if possible, add index for all functions and classes > GroupedData doesn't have any description in the table of contents for > pyspark.sql module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5997) Increase partition count without performing a shuffle
[ https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-5997: -- Description: When decreasing partition count with rdd.repartition() or rdd.coalesce(), the user has the ability to choose whether or not to perform a shuffle. However when increasing partition count there is no option of whether to perform a shuffle or not -- a shuffle always occurs. This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that performs a repartition to a higher partition count without a shuffle. The motivating use case is to decrease the size of an individual partition enough that the .toLocalIterator has significantly reduced memory pressure on the driver, as it loads a partition at a time into the driver. was: When decreasing partition count with rdd.repartition() or rdd.coalesce(), the user has the ability to choose whether or not to perform a shuffle. However when increasing partition count there is no option of whether to perform a shuffle or not. This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that performs a repartition to a higher partition count without a shuffle. The motivating use case is to decrease the size of an individual partition enough that the .toLocalIterator has significantly reduced memory pressure on the driver, as it loads a partition at a time into the driver. > Increase partition count without performing a shuffle > - > > Key: SPARK-5997 > URL: https://issues.apache.org/jira/browse/SPARK-5997 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Andrew Ash > > When decreasing partition count with rdd.repartition() or rdd.coalesce(), the > user has the ability to choose whether or not to perform a shuffle. However > when increasing partition count there is no option of whether to perform a > shuffle or not -- a shuffle always occurs. > This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call > that performs a repartition to a higher partition count without a shuffle. > The motivating use case is to decrease the size of an individual partition > enough that the .toLocalIterator has significantly reduced memory pressure on > the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335835#comment-14335835 ] Apache Spark commented on SPARK-5286: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4755 > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(Fu
[jira] [Updated] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5286: Priority: Blocker (was: Critical) > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > [inf
[jira] [Created] (SPARK-5997) Increase partition count without performing a shuffle
Andrew Ash created SPARK-5997: - Summary: Increase partition count without performing a shuffle Key: SPARK-5997 URL: https://issues.apache.org/jira/browse/SPARK-5997 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Andrew Ash When decreasing partition count with rdd.repartition() or rdd.coalesce(), the user has the ability to choose whether or not to perform a shuffle. However when increasing partition count there is no option of whether to perform a shuffle or not. This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call that performs a repartition to a higher partition count without a shuffle. The motivating use case is to decrease the size of an individual partition enough that the .toLocalIterator has significantly reduced memory pressure on the driver, as it loads a partition at a time into the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5286) Fail to drop an invalid table when using the data source API
[ https://issues.apache.org/jira/browse/SPARK-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-5286: - I am reopen this issue because we need to catch all Throwables instead of just Exceptions. > Fail to drop an invalid table when using the data source API > > > Key: SPARK-5286 > URL: https://issues.apache.org/jira/browse/SPARK-5286 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > Fix For: 1.3.0 > > > Example > {code} > CREATE TABLE jsonTable > USING org.apache.spark.sql.json.DefaultSource > OPTIONS ( > path 'it is not a path at all!' > ) > DROP TABLE jsonTable > {code} > We will get > {code} > [info] com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/Users/yhuai/Projects/Spark/spark/sql/hive/it is not a path at all! > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4882) > [info] at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4898) > [info] at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:147) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:241) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137) > [info] at scala.Option.getOrElse(Option.scala:120) > [info] at > org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137) > [info] at > org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:241) > [info] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:332) > [info] at > org.apache.spark.sql.hive.execution.DropTable.run(commands.scala:57) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:53) > [info] at > org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:61) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:472) > [info] at > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > [info] at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:108) > [info] at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:73) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply$mcV$sp(MetastoreDataSourcesSuite.scala:258) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite$$anonfun$9.apply(MetastoreDataSourcesSuite.scala:246) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > [info] at > org.apache.spark.sql.hive.MetastoreDataSourcesSuite.runTest(MetastoreDataSourcesSuite.scala:36) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.SuperEngine$$an
[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335827#comment-14335827 ] Michael Nazario commented on SPARK-5978: If I get the chance I'll look into it. I'm not planning to put any time into this right now since I no longer depend on the correctness of the example code. However, it would definitely be nice for Spark users to be able to run this example code on a hadoop 2 build like cdh4. > Spark examples cannot compile with Hadoop 2 > --- > > Key: SPARK-5978 > URL: https://issues.apache.org/jira/browse/SPARK-5978 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark >Affects Versions: 1.2.0, 1.2.1 >Reporter: Michael Nazario > Labels: hadoop-version > > This is a regression from Spark 1.1.1. > The Spark Examples includes an example for an avro converter for PySpark. > When I was trying to debug a problem, I discovered that even though you can > build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 > somewhere else in the examples code. > An easy fix would be to separate the examples into hadoop specific versions. > Another way would be to fix the hbase dependencies so that they don't rely on > hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume
[ https://issues.apache.org/jira/browse/SPARK-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335823#comment-14335823 ] Apache Spark commented on SPARK-5979: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/4754 > `--packages` should not exclude spark streaming assembly jars for kafka and > flume > -- > > Key: SPARK-5979 > URL: https://issues.apache.org/jira/browse/SPARK-5979 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 1.3.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently `--packages` has an exclude rule for all dependencies with the > groupId `org.apache.spark` assuming that these are packaged inside the > spark-assembly jar. This is not the case and more fine grained filtering is > required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper
[ https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5816. Resolution: Fixed Fix Version/s: 1.3.0 Target Version/s: 1.3.0 (was: 1.3.0, 1.4.0) > Add huge backward compatibility warning in DriverWrapper > > > Key: SPARK-5816 > URL: https://issues.apache.org/jira/browse/SPARK-5816 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.3.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 1.3.0 > > > As of Spark 1.3, we provide backward and forward compatibility in standalone > cluster mode through the REST submission gateway. HOWEVER, it nevertheless > goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the > command line arguments there must not change. For instance, this was broken > in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3. > There is currently no warning against that in the class and so we should add > one before it's too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5996) DataFrame.collect() doesn't recognize UDTs
Xiangrui Meng created SPARK-5996: Summary: DataFrame.collect() doesn't recognize UDTs Key: SPARK-5996 URL: https://issues.apache.org/jira/browse/SPARK-5996 Project: Spark Issue Type: Bug Components: MLlib, SQL Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Michael Armbrust Priority: Blocker {code} import org.apache.spark.mllib.linalg._ case class Test(data: Vector) val df = sqlContext.createDataFrame(Seq(Test(Vectors.dense(1.0, 2.0 df.collect()[0].getAs[Vector](0) {code} throws an exception. `collect()` returns `Row` instead of `Vector`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335757#comment-14335757 ] Kay Ousterhout commented on SPARK-5845: --- I'd go with (b) -- it's fine (and good, I think!) to include time to cleanup the partition writers, since this also involves interacting with the disk. Thanks! > Time to cleanup spilled shuffle files not included in shuffle write time > > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5995) Make ML Prediction Developer APIs public
Joseph K. Bradley created SPARK-5995: Summary: Make ML Prediction Developer APIs public Key: SPARK-5995 URL: https://issues.apache.org/jira/browse/SPARK-5995 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789] There are ongoing discussions about the best design of the API. This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public. Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design. Some issues under debate: * Should there be strongly typed APIs for fit()? * Should the strongly typed API for transform() be public (vs. protected)? * What transformation methods should the API make developers implement for classification? (See details below.) * Should there be a way to transform a single Row (instead of only DataFrames)? More on "What transformation methods should the API make developers implement for classification?": * Goals: ** Optimize transform: Make it fast, and make it output only the desired columns. ** Easy development ** Support Classifier, Regressor, and ProbabilisticClassifier * (currently) Developers implement predictX methods for each output column X. They may override transform() to optimize speed. ** Pros: predictX is easy to understand. ** Cons: An optimized transform() is annoying to write. * Developers implement more basic transformation methods, such as features2raw, raw2pred, raw2prob. ** Pros: Abstract classes may implement optimized transform(). ** Cons: Different types of predictors require different methods: *** Predictor and Regressor: features2pred *** Classifier: features2raw, raw2pred *** ProbabilisticClassifier: raw2prob * Developers implement a single predict() method which takes parameters for what columns to output (returning tuple or some type with None for missing values). Abstract classes take the outputs they want and put them into columns. ** Pros: Developers only write 1 method and can optimize it as much as they want. It could be more optimized than the previous 2 options; e.g., if LogisticRegressionModel only wants the prediction, then it never has to construct intermediate results such as the vector of raw predictions. ** Cons: predict() will have a different signature for different abstractions, based on the possible output columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5994) Python DataFrame documentation fixes
[ https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335750#comment-14335750 ] Reynold Xin commented on SPARK-5994: Added one more: "GroupedData doesn't have any description in the table of contents for pyspark.sql module" > Python DataFrame documentation fixes > > > Key: SPARK-5994 > URL: https://issues.apache.org/jira/browse/SPARK-5994 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > select empty should NOT be the same as select(*). make sure selectExpr is > behaving the same. > join param documentation > link to source doesn't work in jekyll generated file > cross reference of columns (i.e. enabling linking) > show(): move df example before df.show() > move tests in SQLContext out of docstring otherwise doc is too long > Column.desc and .asc doesn't have any documentation > in documentation, sort functions.* (and add the functions defined in the dict > automatically) > if possible, add index for all functions and classes > GroupedData doesn't have any description in the table of contents for > pyspark.sql module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5994) Python DataFrame documentation fixes
[ https://issues.apache.org/jira/browse/SPARK-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5994: --- Description: select empty should NOT be the same as select(*). make sure selectExpr is behaving the same. join param documentation link to source doesn't work in jekyll generated file cross reference of columns (i.e. enabling linking) show(): move df example before df.show() move tests in SQLContext out of docstring otherwise doc is too long Column.desc and .asc doesn't have any documentation in documentation, sort functions.* (and add the functions defined in the dict automatically) if possible, add index for all functions and classes GroupedData doesn't have any description in the table of contents for pyspark.sql module was: select empty should NOT be the same as select(*). make sure selectExpr is behaving the same. join param documentation link to source doesn't work in jekyll generated file cross reference of columns (i.e. enabling linking) show(): move df example before df.show() move tests in SQLContext out of docstring otherwise doc is too long Column.desc and .asc doesn't have any documentation in documentation, sort functions.* (and add the functions defined in the dict automatically) if possible, add index for all functions and classes > Python DataFrame documentation fixes > > > Key: SPARK-5994 > URL: https://issues.apache.org/jira/browse/SPARK-5994 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > select empty should NOT be the same as select(*). make sure selectExpr is > behaving the same. > join param documentation > link to source doesn't work in jekyll generated file > cross reference of columns (i.e. enabling linking) > show(): move df example before df.show() > move tests in SQLContext out of docstring otherwise doc is too long > Column.desc and .asc doesn't have any documentation > in documentation, sort functions.* (and add the functions defined in the dict > automatically) > if possible, add index for all functions and classes > GroupedData doesn't have any description in the table of contents for > pyspark.sql module -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335751#comment-14335751 ] Ilya Ganelin commented on SPARK-5845: - My mistake - missed your comment about the spill files in the detailed description. Given that we're interested in cleaning up the spill files which appear to be cleaned up in ExternalSorter.stop() (please correct me if I'm wrong), I would like to either a) Pass the context to the stop() method - this is possible since the SortShuffleWriter has visibility of the TaskContext (which in turn stores the metrics we're interested in). b) (My preference since it won't break the existing interface) Surround sorter.stop() on line 91 of SortShuffleWriter.scala with a timer. The only downside to this second approach is that it will also include the cleanup of the partition writers. I'm not sure whether that should be included in this time computation. > Time to cleanup spilled shuffle files not included in shuffle write time > > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5994) Python DataFrame documentation fixes
Reynold Xin created SPARK-5994: -- Summary: Python DataFrame documentation fixes Key: SPARK-5994 URL: https://issues.apache.org/jira/browse/SPARK-5994 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Priority: Blocker select empty should NOT be the same as select(*). make sure selectExpr is behaving the same. join param documentation link to source doesn't work in jekyll generated file cross reference of columns (i.e. enabling linking) show(): move df example before df.show() move tests in SQLContext out of docstring otherwise doc is too long Column.desc and .asc doesn't have any documentation in documentation, sort functions.* (and add the functions defined in the dict automatically) if possible, add index for all functions and classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix
[ https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3956: - Assignee: (was: Davies Liu) > Python API for Distributed Matrix > - > > Key: SPARK-3956 > URL: https://issues.apache.org/jira/browse/SPARK-3956 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Davies Liu > > Python API for distributed matrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1
[ https://issues.apache.org/jira/browse/SPARK-5993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335736#comment-14335736 ] Apache Spark commented on SPARK-5993: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/4753 > Published Kafka-assembly JAR was empty in 1.3.0-RC1 > --- > > Key: SPARK-5993 > URL: https://issues.apache.org/jira/browse/SPARK-5993 > Project: Spark > Issue Type: Bug > Components: Build, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > This is because the maven build generated two Jars- > 1. an empty JAR file (since kafka-assembly has no code of its own) > 2. a assembly JAR file containing everything in a different location as 1 > The maven publishing plugin uploaded 1 and not 2. > Instead if 2 is not configure to generate in a different location, there is > only 1 jar containing everything, which gets published. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5993) Published Kafka-assembly JAR was empty in 1.3.0-RC1
Tathagata Das created SPARK-5993: Summary: Published Kafka-assembly JAR was empty in 1.3.0-RC1 Key: SPARK-5993 URL: https://issues.apache.org/jira/browse/SPARK-5993 Project: Spark Issue Type: Bug Components: Build, Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker This is because the maven build generated two Jars- 1. an empty JAR file (since kafka-assembly has no code of its own) 2. a assembly JAR file containing everything in a different location as 1 The maven publishing plugin uploaded 1 and not 2. Instead if 2 is not configure to generate in a different location, there is only 1 jar containing everything, which gets published. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5992: - Description: Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. (was: Locality Sensitive Hashing (LSH) would be very useful for ) > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
Joseph K. Bradley created SPARK-5992: Summary: Locality Sensitive Hashing (LSH) for MLlib Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5751) Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes times out
[ https://issues.apache.org/jira/browse/SPARK-5751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5751. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4720 [https://github.com/apache/spark/pull/4720] > Flaky test: o.a.s.sql.hive.thriftserver.HiveThriftServer2Suite sometimes > times out > -- > > Key: SPARK-5751 > URL: https://issues.apache.org/jira/browse/SPARK-5751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Labels: flaky-test > Fix For: 1.3.0 > > > The "Test JDBC query execution" test case times out occasionally, all other > test cases are just fine. The failure output only contains service startup > command line without any log output. Guess somehow the test case misses the > log file path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix
[ https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3956: - Target Version/s: 1.4.0 (was: 1.2.0) > Python API for Distributed Matrix > - > > Key: SPARK-3956 > URL: https://issues.apache.org/jira/browse/SPARK-3956 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Minor > > Python API for distributed matrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix
[ https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3956: - Component/s: MLlib > Python API for Distributed Matrix > - > > Key: SPARK-3956 > URL: https://issues.apache.org/jira/browse/SPARK-3956 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Minor > > Python API for distributed matrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3956) Python API for Distributed Matrix
[ https://issues.apache.org/jira/browse/SPARK-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3956: - Priority: Major (was: Minor) > Python API for Distributed Matrix > - > > Key: SPARK-3956 > URL: https://issues.apache.org/jira/browse/SPARK-3956 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Davies Liu >Assignee: Davies Liu > > Python API for distributed matrix -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema
[ https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3851: --- Fix Version/s: 1.3.0 > Support for reading parquet files with different but compatible schema > -- > > Key: SPARK-3851 > URL: https://issues.apache.org/jira/browse/SPARK-3851 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.3.0 > > > Right now it is required that all of the parquet files have the same schema. > It would be nice to support some safe subset of cases where the schemas of > files is different. For example: > - Adding and removing nullable columns. > - Widening types (a column that is of both Int and Long type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5991) Python API for ML model import/export
Joseph K. Bradley created SPARK-5991: Summary: Python API for ML model import/export Key: SPARK-5991 URL: https://issues.apache.org/jira/browse/SPARK-5991 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Critical Many ML models support save/load in Scala and Java. The Python API needs this. It should mostly be a simple matter of calling the JVM methods for save/load, except for models which are stored in Python (e.g., linear models). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5990) Model import/export for IsotonicRegression
Joseph K. Bradley created SPARK-5990: Summary: Model import/export for IsotonicRegression Key: SPARK-5990 URL: https://issues.apache.org/jira/browse/SPARK-5990 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for IsotonicRegressionModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5985) sortBy -> orderBy in Python
[ https://issues.apache.org/jira/browse/SPARK-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335693#comment-14335693 ] Apache Spark commented on SPARK-5985: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4752 > sortBy -> orderBy in Python > --- > > Key: SPARK-5985 > URL: https://issues.apache.org/jira/browse/SPARK-5985 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.3.0 > > > Seems like a mistake that we included sortBy but not orderBy in Python > DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5989) Model import/export for LDAModel
Joseph K. Bradley created SPARK-5989: Summary: Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5988) Model import/export for PowerIterationClusteringModel
Joseph K. Bradley created SPARK-5988: Summary: Model import/export for PowerIterationClusteringModel Key: SPARK-5988 URL: https://issues.apache.org/jira/browse/SPARK-5988 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Add save/load for PowerIterationClusteringModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5987) Model import/export for GaussianMixtureModel
Joseph K. Bradley created SPARK-5987: Summary: Model import/export for GaussianMixtureModel Key: SPARK-5987 URL: https://issues.apache.org/jira/browse/SPARK-5987 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Support save/load for GaussianMixtureModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5986) Model import/export for KMeansModel
Joseph K. Bradley created SPARK-5986: Summary: Model import/export for KMeansModel Key: SPARK-5986 URL: https://issues.apache.org/jira/browse/SPARK-5986 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Support save/load for KMeansModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5985) sortBy -> orderBy in Python
Reynold Xin created SPARK-5985: -- Summary: sortBy -> orderBy in Python Key: SPARK-5985 URL: https://issues.apache.org/jira/browse/SPARK-5985 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Seems like a mistake that we included sortBy but not orderBy in Python DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335676#comment-14335676 ] Kay Ousterhout commented on SPARK-5845: --- [~pwendell] that's exactly what I meant -- I just updated the JIRA description. > Time to cleanup spilled shuffle files not included in shuffle write time > > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-5845: -- Summary: Time to cleanup spilled shuffle files not included in shuffle write time (was: Time to cleanup intermediate shuffle files not included in shuffle write time) > Time to cleanup spilled shuffle files not included in shuffle write time > > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5962) [MLLIB] Python support for Power Iteration Clustering
[ https://issues.apache.org/jira/browse/SPARK-5962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5962: - Assignee: Stephen Boesch > [MLLIB] Python support for Power Iteration Clustering > - > > Key: SPARK-5962 > URL: https://issues.apache.org/jira/browse/SPARK-5962 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Stephen Boesch >Assignee: Stephen Boesch > Labels: python > Original Estimate: 168h > Remaining Estimate: 168h > > Add python support for the Power Iteration Clustering feature. Here is a > fragment of the python API as we plan to implement it: > /** >* Java stub for Python mllib PowerIterationClustering.run() >*/ > def trainPowerIterationClusteringModel( > data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], > k: Int, > maxIterations: Int, > runs: Int, > initializationMode: String, > seed: java.lang.Long): PowerIterationClusteringModel = { > val picAlg = new PowerIterationClustering() > .setK(k) > .setMaxIterations(maxIterations) > try { > picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) > } finally { > data.rdd.unpersist(blocking = false) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5963) [MLLIB] Python support for Power Iteration Clustering
[ https://issues.apache.org/jira/browse/SPARK-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-5963. Resolution: Duplicate Assignee: (was: Stephen Boesch) > [MLLIB] Python support for Power Iteration Clustering > - > > Key: SPARK-5963 > URL: https://issues.apache.org/jira/browse/SPARK-5963 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Stephen Boesch > Original Estimate: 168h > Remaining Estimate: 168h > > Add python support for the Power Iteration Clustering feature. Here is a > fragment of the python API as we plan to implement it: > /** >* Java stub for Python mllib PowerIterationClustering.run() >*/ > def trainPowerIterationClusteringModel( > data: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)], > k: Int, > maxIterations: Int, > runs: Int, > initializationMode: String, > seed: java.lang.Long): PowerIterationClusteringModel = { > val picAlg = new PowerIterationClustering() > .setK(k) > .setMaxIterations(maxIterations) > try { > picAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) > } finally { > data.rdd.unpersist(blocking = false) > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5904) DataFrame methods with varargs do not work in Java
[ https://issues.apache.org/jira/browse/SPARK-5904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335674#comment-14335674 ] Apache Spark commented on SPARK-5904: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4751 > DataFrame methods with varargs do not work in Java > -- > > Key: SPARK-5904 > URL: https://issues.apache.org/jira/browse/SPARK-5904 > Project: Spark > Issue Type: Sub-task > Components: Java API, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Reynold Xin >Priority: Blocker > Labels: DataFrame > Fix For: 1.3.0 > > > DataFrame methods with varargs fail when called from Java due to a bug in > Scala. > This can be produced by, e.g., modifying the end of the example > ml.JavaSimpleParamsExample in the master branch: > {code} > DataFrame results = model2.transform(test); > results.printSchema(); // works > results.collect(); // works > results.filter("label > 0.0").count(); // works > for (Row r: results.select("features", "label", "myProbability", > "prediction").collect()) { // fails on select > System.out.println("(" + r.get(0) + ", " + r.get(1) + ") -> prob=" + > r.get(2) > + ", prediction=" + r.get(3)); > } > {code} > I have also tried groupBy and found that failed too. > The error looks like this: > {code} > Exception in thread "main" java.lang.AbstractMethodError: > org.apache.spark.sql.DataFrameImpl.groupBy(Ljava/lang/String;[Ljava/lang/String;)Lorg/apache/spark/sql/GroupedData; > at > org.apache.spark.examples.ml.JavaSimpleParamsExample.main(JavaSimpleParamsExample.java:108) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The error appears to be from this Scala bug with using varargs in an abstract > method: > [https://issues.scala-lang.org/browse/SI-9013] > My current plan is to move the implementations of the methods with varargs > from DataFrameImpl to DataFrame. > However, this may cause issues with IncomputableColumn---feedback?? > Thanks to [~joshrosen] for figuring the bug and fix out! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335667#comment-14335667 ] Patrick Wendell commented on SPARK-5845: [~kayousterhout] did you mean the time required to delete spill files used during aggregation? For the shuffle files themselves, they are deleted asynchronously, as [~ilganeli] has mentioned. > Time to cleanup intermediate shuffle files not included in shuffle write time > - > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335646#comment-14335646 ] Ilya Ganelin edited comment on SPARK-5845 at 2/24/15 11:19 PM: --- If I understand correctly, the file cleanup happens in IndexShuffleBlockManager:::removeDataByMap(), which is called from either the SortShuffleManager or the SortShuffleWriter. The problem is that these classes do not have any knowledge of the currently collected metrics. Furthermore, the disk cleanup is, unless configured in the SparkConf, triggered asynchronously via the RemoveShuffle message so there doesn't appear to be a straightforward way to provide a set of metrics to be updated. Do you have any suggestions for getting around this? Please let me know, thank you. was (Author: ilganeli): If I understand correctly, the file cleanup happens in IndexShuffleBlockManager:::removeDataByMap(), which is called from either the SortShuffleManager or the SortShuffleWriter. The problem is that these classes do not have any knowledge of the currently collected metrics. Furthermore, the disk cleanup is, unless configured in the SparkConf, triggered asynchronously via the RemoveShuffle() message so there doesn't appear to be a straightforward way to provide a set of metrics to be updated. Do you have any suggestions for getting around this? Please let me know, thank you. > Time to cleanup intermediate shuffle files not included in shuffle write time > - > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5845) Time to cleanup intermediate shuffle files not included in shuffle write time
[ https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335646#comment-14335646 ] Ilya Ganelin commented on SPARK-5845: - If I understand correctly, the file cleanup happens in IndexShuffleBlockManager:::removeDataByMap(), which is called from either the SortShuffleManager or the SortShuffleWriter. The problem is that these classes do not have any knowledge of the currently collected metrics. Furthermore, the disk cleanup is, unless configured in the SparkConf, triggered asynchronously via the RemoveShuffle() message so there doesn't appear to be a straightforward way to provide a set of metrics to be updated. Do you have any suggestions for getting around this? Please let me know, thank you. > Time to cleanup intermediate shuffle files not included in shuffle write time > - > > Key: SPARK-5845 > URL: https://issues.apache.org/jira/browse/SPARK-5845 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kay Ousterhout >Assignee: Ilya Ganelin >Priority: Minor > > When the disk is contended, I've observed cases when it takes as long as 7 > seconds to clean up all of the intermediate spill files for a shuffle (when > using the sort based shuffle, but bypassing merging because there are <=200 > shuffle partitions). This is even when the shuffle data is non-huge (152MB > written from one of the tasks where I observed this). This is effectively > part of the shuffle write time (because it's a necessary side effect of > writing data to disk) so should be added to the shuffle write time to > facilitate debugging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5436. -- Resolution: Fixed Fix Version/s: 1.4.0 Target Version/s: 1.4.0 > Validate GradientBoostedTrees during training > - > > Key: SPARK-5436 > URL: https://issues.apache.org/jira/browse/SPARK-5436 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > Fix For: 1.4.0 > > > For Gradient Boosting, it would be valuable to compute test error on a > separate validation set during training. That way, training could stop early > based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335637#comment-14335637 ] Brennon York commented on SPARK-4123: - Gents, need a bit of input on this one. Looks like, to get the expected output that [~pwendell] showed, I would need to run a {{mvn clean ... install}} before I could do any sub-project traversal ([SO reference|https://stackoverflow.com/questions/12223754/how-to-maven-sub-projects-with-inter-dependencies] with similar issue). I've seen this issue in [SPARK-3355|https://github.com/apache/spark/pull/4734] as well and, in that case, just avoided it. I've retested on my machine by running the above lines without installation, then with installation, then clearing org/apache/spark from my ~/.m2 directory and without again. As expected, it failed, worked, and then failed again. So... my question is how does Jenkins store state wrt the local repository? I'm wondering if there might not be another way to grab this information, but if we use Maven with the above command and we need to install spark into a local directory I can imagine build failures everywhere, esp. if we clear / update that directory. Thoughts? cc [~shaneknapp] for possible Jenkins advice > Show new dependencies added in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Priority: Critical > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5984) TimSort broken
Reynold Xin created SPARK-5984: -- Summary: TimSort broken Key: SPARK-5984 URL: https://issues.apache.org/jira/browse/SPARK-5984 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.3.0 Reporter: Reynold Xin Assignee: Aaron Davidson Priority: Minor See http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ Our TimSort is based on Android's TimSort, which is broken in some corner case. Marking it minor as this problem exists for almost all TimSort implementations out there, including Android, OpenJDK, Python, and it hasn't manifested itself in practice yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail
[ https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5973. -- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 > zip two rdd with AutoBatchedSerializer will fail > > > Key: SPARK-5973 > URL: https://issues.apache.org/jira/browse/SPARK-5973 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0, 1.2.2 > > > zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by > SPARK-4841 > {code} > >> a.zip(b).count() > 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main > process() > File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func > return f(iterator) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in > load_stream > " in pair: (%d, %d)" % (len(keys), len(vals))) > ValueError: Can not deserialize RDD with different number of items in pair: > (123, 64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail
[ https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5973: - Assignee: Davies Liu > zip two rdd with AutoBatchedSerializer will fail > > > Key: SPARK-5973 > URL: https://issues.apache.org/jira/browse/SPARK-5973 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by > SPARK-4841 > {code} > >> a.zip(b).count() > 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main > process() > File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func > return f(iterator) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in > load_stream > " in pair: (%d, %d)" % (len(keys), len(vals))) > ValueError: Can not deserialize RDD with different number of items in pair: > (123, 64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5974) Add save/load to examples in ML guide
[ https://issues.apache.org/jira/browse/SPARK-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335609#comment-14335609 ] Apache Spark commented on SPARK-5974: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4750 > Add save/load to examples in ML guide > - > > Key: SPARK-5974 > URL: https://issues.apache.org/jira/browse/SPARK-5974 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > We should add save/load (model import/export) to the Scala and Java code > examples in the ML guide. This is not yet supported in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide
[ https://issues.apache.org/jira/browse/SPARK-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335610#comment-14335610 ] Apache Spark commented on SPARK-5980: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4750 > Add GradientBoostedTrees Python examples to ML guide > > > Key: SPARK-5980 > URL: https://issues.apache.org/jira/browse/SPARK-5980 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > GBT now has a Python API and should have examples in the ML guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335577#comment-14335577 ] Alex commented on SPARK-2344: - You're right of course - i did not mean to leave it there (only for now). Sorry that I did not answered you earlier, I have a test of Thursday and I spend all my time studying After Thursday I will take a look at your code. Alex > Add Fuzzy C-Means algorithm to MLlib > > > Key: SPARK-2344 > URL: https://issues.apache.org/jira/browse/SPARK-2344 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Alex >Priority: Minor > Labels: clustering > Original Estimate: 1m > Remaining Estimate: 1m > > I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. > FCM is very similar to K - Means which is already implemented, and they > differ only in the degree of relationship each point has with each cluster: > (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. > As part of the implementation I would like: > - create a base class for K- Means and FCM > - implement the relationship for each algorithm differently (in its class) > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5982) Remove Local Read Time
[ https://issues.apache.org/jira/browse/SPARK-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335576#comment-14335576 ] Apache Spark commented on SPARK-5982: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/4749 > Remove Local Read Time > -- > > Key: SPARK-5982 > URL: https://issues.apache.org/jira/browse/SPARK-5982 > Project: Spark > Issue Type: Bug >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Blocker > > LocalReadTime was added to TaskMetrics for the 1.3.0 release. However, this > time is actually only a small subset of the local read time, because local > shuffle files are memory mapped, so most of the read time occurs later, as > data is read from the memory mapped files and the data actually gets read > from disk. We should remove this before the 1.3.0 release, so we never > expose this incomplete info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335573#comment-14335573 ] Nicholas Chammas commented on SPARK-5312: - It's something to consider I guess. Spark provides strong guarantees about API stability and the like. Making it easy for reviewers to catch changes to public classes is supposed to help with that. What we have is perhaps good for now, and perhaps the foreseeable future. So maybe we should resolve this issue for now and just keep it in mind in the future. cc [~pwendell] > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5983) Don't respond to HTTP TRACE in HTTP-based UIs
Sean Owen created SPARK-5983: Summary: Don't respond to HTTP TRACE in HTTP-based UIs Key: SPARK-5983 URL: https://issues.apache.org/jira/browse/SPARK-5983 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Sean Owen Priority: Minor This was flagged a while ago during a routine security scan: the HTTP-based Spark services respond to an HTTP TRACE command. This is basically an HTTP verb that has no practical use, and has a pretty theoretical chance of being an exploit vector. It is flagged as a security issue by one common tool, however. Spark's HTTP services are based on Jetty, which by default does not enable TRACE (like Tomcat). However, the services do reply to TRACE requests. I think it is because the use of Jetty is pretty 'raw' and does not enable much of the default additional configuration you might get by using Jetty as a standalone server. I know that it is at least possible to stop the reply to TRACE with a few extra lines of code, so I think it is worth shutting off TRACE requests. Although the security risk is quite theoretical, it should be easy to fix and bring the Spark services into line with the common default of HTTP servers today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335565#comment-14335565 ] Brennon York commented on SPARK-5312: - New thoughts... So I checked out the link which seems like it would def. do what we want it to do, but at the cost of much more complexity (unless I'm missing something). It looks like the easiest way to get the tool to work would be to add a separate scala program that would need to be executed (in the {{run-tests-jenkins}} script) producing the public classes. Pros? It should get us exactly the classes we want, then we could {{diff}} that against the master branch and be on our way. Cons? An entire other scala program (again, unless I'm missing a way to integrate it) that would need to be maintained / managed. Not sure how important this is given the amount of code added to maintain into the project. Thoughts? > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5981) pyspark ML models should support predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5981: - Description: Currently, most Python models only have limited support for single-vector prediction. E.g., one can call {code}model.predict(myFeatureVector){code} for a single instance, but that fails within a map for Python ML models and transformers which use JavaModelWrapper: {code} data.map(lambda features: model.predict(features)) {code} This fails because JavaModelWrapper.call uses the SparkContext (within the transformation). (It works for linear models, which do prediction within Python.) Supporting prediction within a map would require storing the model and doing prediction/transformation within Python. was: Many Python ML models and transformers use JavaModelWrapper to call methods in the JVM, such as predict() and transform(). It is common to write: {code} data.map(lambda features: model.predict(features)) {code} This fails because JavaModelWrapper.call uses the SparkContext (within the transformation). Note: It is possible to do a workaround using batch predict if models/transformers support it: {code} model.predict(data) {code} However, this is still a major problem. > pyspark ML models should support predict/transform on vector within map > --- > > Key: SPARK-5981 > URL: https://issues.apache.org/jira/browse/SPARK-5981 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Currently, most Python models only have limited support for single-vector > prediction. > E.g., one can call {code}model.predict(myFeatureVector){code} for a single > instance, but that fails within a map for Python ML models and transformers > which use JavaModelWrapper: > {code} > data.map(lambda features: model.predict(features)) > {code} > This fails because JavaModelWrapper.call uses the SparkContext (within the > transformation). (It works for linear models, which do prediction within > Python.) > Supporting prediction within a map would require storing the model and doing > prediction/transformation within Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5981: - Target Version/s: 1.4.0 (was: 1.3.0) > pyspark ML models fail during predict/transform on vector within map > > > Key: SPARK-5981 > URL: https://issues.apache.org/jira/browse/SPARK-5981 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Many Python ML models and transformers use JavaModelWrapper to call methods > in the JVM, such as predict() and transform(). It is common to write: > {code} > data.map(lambda features: model.predict(features)) > {code} > This fails because JavaModelWrapper.call uses the SparkContext (within the > transformation). > Note: It is possible to do a workaround using batch predict if > models/transformers support it: > {code} > model.predict(data) > {code} > However, this is still a major problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5981: - Issue Type: Improvement (was: Bug) > pyspark ML models fail during predict/transform on vector within map > > > Key: SPARK-5981 > URL: https://issues.apache.org/jira/browse/SPARK-5981 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Many Python ML models and transformers use JavaModelWrapper to call methods > in the JVM, such as predict() and transform(). It is common to write: > {code} > data.map(lambda features: model.predict(features)) > {code} > This fails because JavaModelWrapper.call uses the SparkContext (within the > transformation). > Note: It is possible to do a workaround using batch predict if > models/transformers support it: > {code} > model.predict(data) > {code} > However, this is still a major problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5981: - Priority: Major (was: Critical) > pyspark ML models fail during predict/transform on vector within map > > > Key: SPARK-5981 > URL: https://issues.apache.org/jira/browse/SPARK-5981 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Many Python ML models and transformers use JavaModelWrapper to call methods > in the JVM, such as predict() and transform(). It is common to write: > {code} > data.map(lambda features: model.predict(features)) > {code} > This fails because JavaModelWrapper.call uses the SparkContext (within the > transformation). > Note: It is possible to do a workaround using batch predict if > models/transformers support it: > {code} > model.predict(data) > {code} > However, this is still a major problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5981) pyspark ML models should support predict/transform on vector within map
[ https://issues.apache.org/jira/browse/SPARK-5981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5981: - Summary: pyspark ML models should support predict/transform on vector within map (was: pyspark ML models fail during predict/transform on vector within map) > pyspark ML models should support predict/transform on vector within map > --- > > Key: SPARK-5981 > URL: https://issues.apache.org/jira/browse/SPARK-5981 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Many Python ML models and transformers use JavaModelWrapper to call methods > in the JVM, such as predict() and transform(). It is common to write: > {code} > data.map(lambda features: model.predict(features)) > {code} > This fails because JavaModelWrapper.call uses the SparkContext (within the > transformation). > Note: It is possible to do a workaround using batch predict if > models/transformers support it: > {code} > model.predict(data) > {code} > However, this is still a major problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5982) Remove Local Read Time
Kay Ousterhout created SPARK-5982: - Summary: Remove Local Read Time Key: SPARK-5982 URL: https://issues.apache.org/jira/browse/SPARK-5982 Project: Spark Issue Type: Bug Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Blocker LocalReadTime was added to TaskMetrics for the 1.3.0 release. However, this time is actually only a small subset of the local read time, because local shuffle files are memory mapped, so most of the read time occurs later, as data is read from the memory mapped files and the data actually gets read from disk. We should remove this before the 1.3.0 release, so we never expose this incomplete info. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335518#comment-14335518 ] Corey J. Nolet edited comment on SPARK-5140 at 2/24/15 9:50 PM: I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. was (Author: sonixbp): I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this would change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Corey J. Nolet > Labels: features > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Awai
[jira] [Created] (SPARK-5981) pyspark ML models fail during predict/transform on vector within map
Joseph K. Bradley created SPARK-5981: Summary: pyspark ML models fail during predict/transform on vector within map Key: SPARK-5981 URL: https://issues.apache.org/jira/browse/SPARK-5981 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Critical Many Python ML models and transformers use JavaModelWrapper to call methods in the JVM, such as predict() and transform(). It is common to write: {code} data.map(lambda features: model.predict(features)) {code} This fails because JavaModelWrapper.call uses the SparkContext (within the transformation). Note: It is possible to do a workaround using batch predict if models/transformers support it: {code} model.predict(data) {code} However, this is still a major problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335518#comment-14335518 ] Corey J. Nolet commented on SPARK-5140: --- I have a framework (similar to cascading) where users wire together their RDD transformations in reusable components and the framework wires them up together based on dependencies that they define. There is a notion of sinks- or data outputs and they are also encapsulated in reusable componentry. A sink depends on sources- The sinks are actually what get executed. I wanted to be able to execute them in parallel and just have spark be smart enough to figure out what needs to block (whenever there's a cached rdd) in order to make that possible. We're comparing the same data transformations to MapReduce that was written in JAQL and the single-threaded execution of the sinks is causing absolutely wretched run times. I'm not saying an internal requirement from my framework should necessaily be levied against the spark features- however, this would change (seeming like it would only affect the driver side scheduling) would allow the spark context to be truly thread-safe. I've tried running jobs that over-utilize the resources in parallel until one of them fully caches and it really just slows things down and uses too many resources- I can't see why anyone submitting rdds in different threads that depend on cached RDDs wouldn't want those threads to block for the parents but maybe I'm not thinking abstract enough. For now, I'm going to propagate down my data source tree breadth first and no-op those sources that return CachedRDDs so that their children can be scheduled in different threads. > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Corey J. Nolet > Labels: features > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Await.result(Future.sequence(futures), 120.second) > {code} > bq. Note that I run the future 4 times in parallel. I found that the first > run has all tasks take 10 seconds. The second has about 50% of its tasks take > 10 seconds, and the rest just wait for the first stage to finish. The last > two runs have no tasks that take 10 seconds; all wait for the first two > stages to finish. > What we want is the ability to fire off a job and have the DAG figure out > that two RDDs depend on the same parent so that when the children are > scheduled concurrently, the first one to start will activate the parent and > both will wait on the parent. When the parent is done, they will both be able > to finish their work concurrently. We are trying to use this pattern by > having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail
[ https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-5973: -- Description: zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by SPARK-4841 {code} >> a.zip(b).count() 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main process() File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in pipeline_func return func(split, prev_func(split, iterator)) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in pipeline_func return func(split, prev_func(split, iterator)) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func return f(iterator) File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in load_stream " in pair: (%d, %d)" % (len(keys), len(vals))) ValueError: Can not deserialize RDD with different number of items in pair: (123, 64) {code} was:zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by SPARK-4841 > zip two rdd with AutoBatchedSerializer will fail > > > Key: SPARK-5973 > URL: https://issues.apache.org/jira/browse/SPARK-5973 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Priority: Blocker > > zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by > SPARK-4841 > {code} > >> a.zip(b).count() > 15/02/24 12:11:56 ERROR PythonRDD: Python worker exited unexpectedly (crashed) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main > process() > File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in > process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2249, in > pipeline_func > return func(split, prev_func(split, iterator)) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 270, in func > return f(iterator) > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/rdd.py", line 933, in > > return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > File "/Users/davies/work/spark/python/pyspark/serializers.py", line 306, in > load_stream > " in pair: (%d, %d)" % (len(keys), len(vals))) > ValueError: Can not deserialize RDD with different number of items in pair: > (123, 64) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5971) Add Mesos support to spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-5971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5971: Description: Right now, spark-ec2 can only launch Spark clusters that use the standalone manager. Adding support for Mesos would be useful mostly for automated performance testing of Spark on Mesos. was: Right now, spark-ec2 can only launching Spark clusters that use the standalone manager. Adding support to launch Spark-on-Mesos clusters would be useful mostly for automated performance testing of Spark on Mesos. > Add Mesos support to spark-ec2 > -- > > Key: SPARK-5971 > URL: https://issues.apache.org/jira/browse/SPARK-5971 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Nicholas Chammas >Priority: Minor > > Right now, spark-ec2 can only launch Spark clusters that use the standalone > manager. > Adding support for Mesos would be useful mostly for automated performance > testing of Spark on Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5952) Failure to lock metastore client in tableExists()
[ https://issues.apache.org/jira/browse/SPARK-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5952. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4746 [https://github.com/apache/spark/pull/4746] > Failure to lock metastore client in tableExists() > - > > Key: SPARK-5952 > URL: https://issues.apache.org/jira/browse/SPARK-5952 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume
Burak Yavuz created SPARK-5979: -- Summary: `--packages` should not exclude spark streaming assembly jars for kafka and flume Key: SPARK-5979 URL: https://issues.apache.org/jira/browse/SPARK-5979 Project: Spark Issue Type: Bug Components: Deploy, Spark Submit Affects Versions: 1.3.0 Reporter: Burak Yavuz Priority: Blocker Currently `--packages` has an exclude rule for all dependencies with the groupId `org.apache.spark` assuming that these are packaged inside the spark-assembly jar. This is not the case and more fine grained filtering is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5980) Add GradientBoostedTrees Python examples to ML guide
Joseph K. Bradley created SPARK-5980: Summary: Add GradientBoostedTrees Python examples to ML guide Key: SPARK-5980 URL: https://issues.apache.org/jira/browse/SPARK-5980 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor GBT now has a Python API and should have examples in the ML guide -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335482#comment-14335482 ] Xiangrui Meng commented on SPARK-2335: -- [~Rusty] For exact k-NN, the problem is complexity. So I would prefer starting from the approximate algorithm directly. Yu's suggestion looks good to me. Let's discuss more about approximate k-NN to SPARK-2336. > k-Nearest Neighbor classification and regression for MLLib > -- > > Key: SPARK-2335 > URL: https://issues.apache.org/jira/browse/SPARK-2335 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > The k-Nearest Neighbor model for classification and regression problems is a > simple and intuitive approach, offering a straightforward path to creating > non-linear decision/estimation contours. It's downsides -- high variance > (sensitivity to the known training data set) and computational intensity for > estimating new point labels -- both play to Spark's big data strengths: lots > of data mitigates data concerns; lots of workers mitigate computational > latency. > We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2336) Approximate k-NN Models for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335476#comment-14335476 ] Xiangrui Meng commented on SPARK-2336: -- [~Rusty] Could you provide a summary of your plan? For example, 1. Which paper are you going to follow? Any modification to the algorithm proposed in the paper? 2. What's the complexity of the algorithm and the expected scalability? 3. What is the trade-off between the approximate error and the cost? How do you want to expose it to users? > Approximate k-NN Models for MLLib > - > > Key: SPARK-2336 > URL: https://issues.apache.org/jira/browse/SPARK-2336 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: clustering, features > > After tackling the general k-Nearest Neighbor model as per > https://issues.apache.org/jira/browse/SPARK-2335 , there's an opportunity to > also offer approximate k-Nearest Neighbor. A promising approach would involve > building a kd-tree variant within from each partition, a la > http://www.autonlab.org/autonweb/14714.html?branch=1&language=2 > This could offer a simple non-linear ML model that can label new data with > much lower latency than the plain-vanilla kNN versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5973) zip two rdd with AutoBatchedSerializer will fail
[ https://issues.apache.org/jira/browse/SPARK-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335473#comment-14335473 ] Josh Rosen commented on SPARK-5973: --- Just for searchability's sake, what error message did this fail with before? > zip two rdd with AutoBatchedSerializer will fail > > > Key: SPARK-5973 > URL: https://issues.apache.org/jira/browse/SPARK-5973 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Priority: Blocker > > zip two rdd with AutoBatchedSerializer will fail, this bug was introduced by > SPARK-4841 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5976) Factors returned by ALS do not have partitioners associated.
[ https://issues.apache.org/jira/browse/SPARK-5976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335464#comment-14335464 ] Apache Spark commented on SPARK-5976: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4748 > Factors returned by ALS do not have partitioners associated. > > > Key: SPARK-5976 > URL: https://issues.apache.org/jira/browse/SPARK-5976 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > The model trained by ALS requires partitioning information to do quick lookup > of a user/item factor for making recommendation on individual requests. In > the new implementation, we don't put partitioners with ALS, which would cause > performance regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5978) Spark examples cannot compile with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-5978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335461#comment-14335461 ] Sean Owen commented on SPARK-5978: -- Pretty similar to https://issues.apache.org/jira/browse/SPARK-3039 ; I imagine some similar profile-based fix is in order. Care to take a shot at that? > Spark examples cannot compile with Hadoop 2 > --- > > Key: SPARK-5978 > URL: https://issues.apache.org/jira/browse/SPARK-5978 > Project: Spark > Issue Type: Bug > Components: Examples, PySpark >Affects Versions: 1.2.0, 1.2.1 >Reporter: Michael Nazario > Labels: hadoop-version > > This is a regression from Spark 1.1.1. > The Spark Examples includes an example for an avro converter for PySpark. > When I was trying to debug a problem, I discovered that even though you can > build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 > somewhere else in the examples code. > An easy fix would be to separate the examples into hadoop specific versions. > Another way would be to fix the hbase dependencies so that they don't rely on > hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5978) Spark examples cannot compile with Hadoop 2
Michael Nazario created SPARK-5978: -- Summary: Spark examples cannot compile with Hadoop 2 Key: SPARK-5978 URL: https://issues.apache.org/jira/browse/SPARK-5978 Project: Spark Issue Type: Bug Components: Examples, PySpark Affects Versions: 1.2.1, 1.2.0 Reporter: Michael Nazario This is a regression from Spark 1.1.1. The Spark Examples includes an example for an avro converter for PySpark. When I was trying to debug a problem, I discovered that even though you can build with Hadoop 2 for Spark 1.2.0, an hbase dependency depends on Hadoop 1 somewhere else in the examples code. An easy fix would be to separate the examples into hadoop specific versions. Another way would be to fix the hbase dependencies so that they don't rely on hadoop 1 specific code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3665: - Target Version/s: 1.4.0 (was: 1.3.0) > Java API for GraphX > --- > > Key: SPARK-3665 > URL: https://issues.apache.org/jira/browse/SPARK-3665 > Project: Spark > Issue Type: Improvement > Components: GraphX, Java API >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > The Java API will wrap the Scala API in a similar manner as JavaRDD. > Components will include: > # JavaGraph > #- removes optional param from persist, subgraph, mapReduceTriplets, > Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply > #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices > #- merges multiple parameters lists > #- incorporates GraphOps > # JavaVertexRDD > # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3665) Java API for GraphX
[ https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3665: - Affects Version/s: 1.0.0 > Java API for GraphX > --- > > Key: SPARK-3665 > URL: https://issues.apache.org/jira/browse/SPARK-3665 > Project: Spark > Issue Type: Improvement > Components: GraphX, Java API >Affects Versions: 1.0.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > The Java API will wrap the Scala API in a similar manner as JavaRDD. > Components will include: > # JavaGraph > #- removes optional param from persist, subgraph, mapReduceTriplets, > Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply > #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices > #- merges multiple parameters lists > #- incorporates GraphOps > # JavaVertexRDD > # JavaEdgeRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org