[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Khaitman updated SPARK-5782: - Priority: Blocker (was: Critical) Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman Priority: Blocker I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.
Yin Huai created SPARK-6366: --- Summary: In Python API, the default save mode for save and saveAsTable should be error instead of append. Key: SPARK-6366 URL: https://issues.apache.org/jira/browse/SPARK-6366 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai If a user want to append data, he/she should explicitly specify the save mode. Also, in Scala and Java, the default save mode is ErrorIfExists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects
[ https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364016#comment-14364016 ] Mingyu Kim commented on SPARK-4808: --- [~andrewor14], should this now be closed with the fix version 1.4? What's the next step? Spark fails to spill with small number of large objects --- Key: SPARK-4808 URL: https://issues.apache.org/jira/browse/SPARK-4808 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1 Reporter: Dennis Lawler Spillable's maybeSpill does not allow spill to occur until at least 1000 elements have been spilled, and then will only evaluate spill every 32nd element thereafter. When there is a small number of very large items being tracked, out-of-memory conditions may occur. I suspect that this and the every-32nd-element behavior was to reduce the impact of the estimateSize() call. This method was extracted into SizeTracker, which implements its own exponential backup for size estimation, so now we are only avoiding using the resulting estimated size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6228) Move SASL support into network/common module
[ https://issues.apache.org/jira/browse/SPARK-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6228: --- Summary: Move SASL support into network/common module (was: Provide SASL support in network/common module) Move SASL support into network/common module Key: SPARK-6228 URL: https://issues.apache.org/jira/browse/SPARK-6228 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.4.0 Currently, there's support for SASL in network/shuffle, but not in network/common. Moving the SASL code to network/common would enable other applications using that code to also support secure authentication and, later, encryption. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6319: Target Version/s: 1.4.0 (was: 1.3.1) DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1 Reporter: Cheng Lian Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5310) Update SQL programming guide for 1.3
[ https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364206#comment-14364206 ] Michael Armbrust commented on SPARK-5310: - I think I'd rather just publish more examples of writing data sources. Most users will probably not need to know how to do this. Update SQL programming guide for 1.3 Key: SPARK-5310 URL: https://issues.apache.org/jira/browse/SPARK-5310 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical Fix For: 1.3.0 We make quite a few changes. We should update the SQL programming guide to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6367) Use the proper data type for those expressions that are hijacking existing data types.
Yin Huai created SPARK-6367: --- Summary: Use the proper data type for those expressions that are hijacking existing data types. Key: SPARK-6367 URL: https://issues.apache.org/jira/browse/SPARK-6367 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai For the following expressions, the actual value type does not match the type of our internal representation. ApproxCountDistinctPartition NewSet AddItemToSet CombineSets CollectHashSet We should create UDTs for data types of these expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints
[ https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364246#comment-14364246 ] Kian Ho commented on SPARK-6340: Hi Joseph, I initially considered that as a solution, however it was my understanding that you couldn't guarantee the same ordering between the instances pre- and post- transformations (since the transformations will be distributed across worker nodes). Is this correct? This question was also mentioned by a couple of users in that thread. Thanks mllib.IDF for LabelPoints - Key: SPARK-6340 URL: https://issues.apache.org/jira/browse/SPARK-6340 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Environment: python 2.7.8 pyspark OS: Linux Mint 17 Qiana (Cinnamon 64-bit) Reporter: Kian Ho Priority: Minor Labels: feature as per: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528 Having the IDF.fit accept LabelPoints would be useful since, correct me if i'm wrong, there currently isn't a way of keeping track of which labels belong to which documents if one needs to apply a conventional tf-idf transformation on labelled text data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6304) Checkpointing doesn't retain driver port
[ https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364254#comment-14364254 ] Saisai Shao commented on SPARK-6304: Hi [~msoutier], the reason to remove these two configurations, especially spark.driver.port is that: SparkContext itself will randomly choose a port and set it to configuration even user didn't set it, next time after application is recovered, previous configuration spark.driver.port need to remove and let SparkContext itself to randomly choose again and set into the SparkConf. So that's why checkpoint need to remove these two configurations. Checkpointing doesn't retain driver port Key: SPARK-6304 URL: https://issues.apache.org/jira/browse/SPARK-6304 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier In a check-pointed Streaming application running on a fixed driver port, the setting spark.driver.port is not loaded when recovering from a checkpoint. (The driver is then started on a random port.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6146) Support more datatype in SqlParser
[ https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6146: Target Version/s: 1.4.0 (was: 1.3.0) Support more datatype in SqlParser -- Key: SPARK-6146 URL: https://issues.apache.org/jira/browse/SPARK-6146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, I cannot do {code} df.selectExpr(cast(a as bigint)) {code} because only the following data types are supported in SqlParser {code} protected lazy val dataType: Parser[DataType] = ( STRING ^^^ StringType | TIMESTAMP ^^^ TimestampType | DOUBLE ^^^ DoubleType | fixedDecimalType | DECIMAL ^^^ DecimalType.Unlimited | DATE ^^^ DateType | INT ^^^ IntegerType ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5463: Priority: Blocker (was: Critical) Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5821) JSONRelation should check if delete is successful for the overwrite operation.
[ https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5821: Target Version/s: 1.3.1 (was: 1.3.0) JSONRelation should check if delete is successful for the overwrite operation. -- Key: SPARK-5821 URL: https://issues.apache.org/jira/browse/SPARK-5821 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang When you run CTAS command such as CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path /a/b/c/d ) AS SELECT a, b FROM jt, you will run into failure if you don't have write permission for directory /a/b/c whether d is a directory or file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down
[ https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5463: Target Version/s: 1.4.0 (was: 1.3.0, 1.2.2) Fix Parquet filter push-down Key: SPARK-5463 URL: https://issues.apache.org/jira/browse/SPARK-5463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1, 1.2.2 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5183) Document data source API
[ https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5183. - Resolution: Fixed Fix Version/s: 1.3.0 Document data source API Key: SPARK-5183 URL: https://issues.apache.org/jira/browse/SPARK-5183 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Critical Fix For: 1.3.0 We need to document the data types the caller needs to support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-6250: - Assignee: Yin Huai (was: Michael Armbrust) Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364427#comment-14364427 ] Michael Armbrust commented on SPARK-6250: - Okay, thanks for explaining the problem! Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6372) spark-submit --conf is not being propagated to child processes
Marcelo Vanzin created SPARK-6372: - Summary: spark-submit --conf is not being propagated to child processes Key: SPARK-6372 URL: https://issues.apache.org/jira/browse/SPARK-6372 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Blocker Thanks to [~irashid] for bringing this up. It seems that the new launcher library is incorrectly handling --conf and not passing it down to the child processes. Fix is simple, PR coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6373) Add SSL/TLS for the Netty based BlockTransferService
Jeffrey Turpin created SPARK-6373: - Summary: Add SSL/TLS for the Netty based BlockTransferService Key: SPARK-6373 URL: https://issues.apache.org/jira/browse/SPARK-6373 Project: Spark Issue Type: New Feature Components: Block Manager, Shuffle Affects Versions: 1.2.1 Reporter: Jeffrey Turpin Priority: Minor Add the ability to allow for secure communications (SSL/TLS) for the Netty based BlockTransferService and the ExternalShuffleClient. This ticket will hopefully start the conversation around potential designs... Below is a reference to a WIP prototype which implements this functionality (prototype)... I have attempted to disrupt as little code as possible and tried to follow the current code structure (for the most part) in the areas I modified. I also studied how Hadoop achieves encrypted shuffle (http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html) https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue
[ https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363894#comment-14363894 ] Mark Khaitman commented on SPARK-5782: -- I've upped this JIRA ticket to blocker since there's a serious memory leak / GC problem causing these python workers to reach almost 3GB each sometimes (with a 512MB default limit). I'm going to try and re-produce this using non-production data in the meantime. Python Worker / Pyspark Daemon Memory Issue --- Key: SPARK-5782 URL: https://issues.apache.org/jira/browse/SPARK-5782 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.3.0, 1.2.1, 1.2.2 Environment: CentOS 7, Spark Standalone Reporter: Mark Khaitman Priority: Blocker I'm including the Shuffle component on this, as a brief scan through the code (which I'm not 100% familiar with just yet) shows a large amount of memory handling in it: It appears that any type of join between two RDDs spawns up twice as many pyspark.daemon workers compared to the default 1 task - 1 core configuration in our environment. This can become problematic in the cases where you build up a tree of RDD joins, since the pyspark.daemons do not cease to exist until the top level join is completed (or so it seems)... This can lead to memory exhaustion by a single framework, even though is set to have a 512MB python worker memory limit and few gigs of executor memory. Another related issue to this is that the individual python workers are not supposed to even exceed that far beyond 512MB, otherwise they're supposed to spill to disk. Some of our python workers are somehow reaching 2GB each (which when multiplied by the number of cores per executor * the number of joins occurring in some cases), causing the Out-of-Memory killer to step up to its unfortunate job! :( I think with the _next_limit method in shuffle.py, if the current memory usage is close to the memory limit, then a 1.05 multiplier can endlessly cause more memory to be consumed by the single python worker, since the max of (512 vs 511 * 1.05) would end up blowing up towards the latter of the two... Shouldn't the memory limit be the absolute cap in this case? I've only just started looking into the code, and would definitely love to contribute towards Spark, though I figured it might be quicker to resolve if someone already owns the code! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6327) Run PySpark with python directly is broken
[ https://issues.apache.org/jira/browse/SPARK-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6327. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5019 [https://github.com/apache/spark/pull/5019] Run PySpark with python directly is broken -- Key: SPARK-6327 URL: https://issues.apache.org/jira/browse/SPARK-6327 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical Fix For: 1.4.0 It works before, but broken now: {code} davies@localhost:~/work/spark$ python r.py NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Usage: spark-submit [options] app jar | python file [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally (client) or on one of the worker machines inside the cluster (cluster) (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user NAME User to impersonate when submitting the application. --help, -h Show this help message and exit --verbose, -v Print additional debug output --version, Print the version of current Spark Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). --supervise If given, restarts the driver on failure. --kill SUBMISSION_IDIf given, kills the driver specified. --status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. YARN-only: --driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: default). --num-executors NUM Number of executors to launch (Default: 2). --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Resolved] (SPARK-5310) Update SQL programming guide for 1.3
[ https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5310. - Resolution: Fixed Fix Version/s: 1.3.0 Update SQL programming guide for 1.3 Key: SPARK-5310 URL: https://issues.apache.org/jira/browse/SPARK-5310 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical Fix For: 1.3.0 We make quite a few changes. We should update the SQL programming guide to reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6340) mllib.IDF for LabelPoints
[ https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-6340. Resolution: Not a Problem mllib.IDF for LabelPoints - Key: SPARK-6340 URL: https://issues.apache.org/jira/browse/SPARK-6340 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Environment: python 2.7.8 pyspark OS: Linux Mint 17 Qiana (Cinnamon 64-bit) Reporter: Kian Ho Priority: Minor Labels: feature as per: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528 Having the IDF.fit accept LabelPoints would be useful since, correct me if i'm wrong, there currently isn't a way of keeping track of which labels belong to which documents if one needs to apply a conventional tf-idf transformation on labelled text data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed
[ https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6247: Priority: Critical (was: Major) Certain self joins cannot be analyzed - Key: SPARK-6247 URL: https://issues.apache.org/jira/browse/SPARK-6247 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical When you try the following code {code} val df = (1 to 10) .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString)) .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2) df.registerTempTable(test) sql( |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol) |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1) |GROUP BY x.stringCol2 .stripMargin).explain() {code} The following exception will be thrown. {code} [info] java.util.NoSuchElementException: next on empty iterator [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) [info] at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) [info] at scala.collection.IterableLike$class.head(IterableLike.scala:91) [info] at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47) [info] at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) [info] at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info] at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info] at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) [info] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) [info] at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) [info] at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) [info] at scala.collection.immutable.List.foldLeft(List.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) [info] at
[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken
[ https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6231: Target Version/s: 1.3.1 Join on two tables (generated from same one) is broken -- Key: SPARK-6231 URL: https://issues.apache.org/jira/browse/SPARK-6231 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Critical Labels: DataFrame If the two column used in joinExpr come from the same table, they have the same id, then the joniExpr is explained in wrong way. {code} val df = sqlContext.load(path, parquet) val txns = df.groupBy(cust_id).agg($cust_id, countDistinct($day_num).as(txns)) val spend = df.groupBy(cust_id).agg($cust_id, sum($extended_price).as(spend)) val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner) scala rmJoin.explain == Physical Plan == CartesianProduct Filter (cust_id#0 = cust_id#0) Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS txns#7L] Exchange (HashPartitioning [cust_id#0], 200) Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS partialSets#25] PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at newParquet.scala:542 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8] Exchange (HashPartitioning [cust_id#17], 200) Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS PartialSum#38] PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at newParquet.scala:542 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6146) Support more datatype in SqlParser
[ https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364272#comment-14364272 ] Michael Armbrust commented on SPARK-6146: - Now that we have our own DDL parser that doesn't live in hive, we should use one code path for this. Support more datatype in SqlParser -- Key: SPARK-6146 URL: https://issues.apache.org/jira/browse/SPARK-6146 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Right now, I cannot do {code} df.selectExpr(cast(a as bigint)) {code} because only the following data types are supported in SqlParser {code} protected lazy val dataType: Parser[DataType] = ( STRING ^^^ StringType | TIMESTAMP ^^^ TimestampType | DOUBLE ^^^ DoubleType | fixedDecimalType | DECIMAL ^^^ DecimalType.Unlimited | DATE ^^^ DateType | INT ^^^ IntegerType ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5881: Target Version/s: 1.4.0 (was: 1.3.0) RDD remains cached after the table gets overridden by CACHE TABLE --- Key: SPARK-5881 URL: https://issues.apache.org/jira/browse/SPARK-5881 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).registerTempTable(jt) sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt) sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt) {code} After the second CACHE TABLE command, the RDD for the first table still remains in the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5881) RDD remains cached after the table gets overridden by CACHE TABLE
[ https://issues.apache.org/jira/browse/SPARK-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5881: Priority: Critical (was: Major) RDD remains cached after the table gets overridden by CACHE TABLE --- Key: SPARK-5881 URL: https://issues.apache.org/jira/browse/SPARK-5881 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Critical {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).registerTempTable(jt) sqlContext.sql(CACHE TABLE foo AS SELECT * FROM jt) sqlContext.sql(CACHE TABLE foo AS SELECT a FROM jt) {code} After the second CACHE TABLE command, the RDD for the first table still remains in the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364279#comment-14364279 ] Tathagata Das commented on SPARK-5523: -- As long as the hostname object is short-lived its cool. That's the same strategy used for StorageLevel. So it is fine. On Mon, Mar 16, 2015 at 12:36 AM, Saisai Shao (JIRA) j...@apache.org TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364310#comment-14364310 ] tanyinyan commented on SPARK-6348: -- Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364345#comment-14364345 ] Michael Armbrust commented on SPARK-6320: - Hmm, interesting. So far I had only considered this interface for planning leaves of the query plan. Can you tell me more about what you are trying to optimize? Adding new query plan strategy to SQLContext Key: SPARK-6320 URL: https://issues.apache.org/jira/browse/SPARK-6320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Youssef Hatem Priority: Minor Hi, I would like to add a new strategy to {{SQLContext}}. To do this I created a new class which extends {{Strategy}}. In my new class I need to call {{planLater}} function. However this method is defined in {{SparkPlanner}} (which itself inherits the method from {{QueryPlanner}}). To my knowledge the only way to make {{planLater}} function visible to my new strategy is to define my strategy inside another class that extends {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will have to extend the {{SQLContext}} such that I can override the {{planner}} field with the new {{Planner}} class I created. It seems that this is a design problem because adding a new strategy seems to require extending {{SQLContext}} (unless I am doing it wrong and there is a better way to do it). Thanks a lot, Youssef -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6349) Add probability estimates in SVMModel predict result
[ https://issues.apache.org/jira/browse/SPARK-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364361#comment-14364361 ] tanyinyan commented on SPARK-6349: -- Yes, this doesn't solve the problem of picking which threshold. But a raw margin usually has no fixed boundary(as i tested above, output margin are all negative),but a probability threshold has. So it's more convenient to pick a good threshold , right? Add probability estimates in SVMModel predict result Key: SPARK-6349 URL: https://issues.apache.org/jira/browse/SPARK-6349 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Original Estimate: 168h Remaining Estimate: 168h In SVMModel, predictPoint method output raw margin(threshold not set) or 1/0 label(threshold set). when SVM are used as a classifier, it's hard to find a good threshold,and the raw margin is hard to understand. when I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. I have to set threshold to -1 to get a reasonable confusion matrix. So, I suggest to provide probability predict result in SVMModel as in libSVM(Platt's binary SVM Probablistic Output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6371) Update version to 1.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364410#comment-14364410 ] Apache Spark commented on SPARK-6371: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5056 Update version to 1.4.0-SNAPSHOT Key: SPARK-6371 URL: https://issues.apache.org/jira/browse/SPARK-6371 Project: Spark Issue Type: Task Components: Build Reporter: Marcelo Vanzin Priority: Critical See summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364429#comment-14364429 ] Nitay Joffe commented on SPARK-6250: Thanks [~marmbrus] and [~yhuai]. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token
[ https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364441#comment-14364441 ] Doug Balog commented on SPARK-6207: --- Need to catch java.lang.UnsupportedOperationException and ignore, or check to see if delegation token mode is supported with current configuration before trying to get a delegation token. See https://issues.apache.org/jira/browse/HIVE-4625 YARN secure cluster mode doesn't obtain a hive-metastore token --- Key: SPARK-6207 URL: https://issues.apache.org/jira/browse/SPARK-6207 Project: Spark Issue Type: Bug Components: Spark Submit, SQL, YARN Affects Versions: 1.2.0, 1.3.0, 1.2.1 Environment: YARN Reporter: Doug Balog When running a spark job, on YARN in secure mode, with --deploy-mode cluster, org.apache.spark.deploy.yarn.Client() does not obtain a delegation token to the hive-metastore. Therefore any attempts to talk to the hive-metastore fail with a GSSException: No valid credentials provided... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6376) Relation are thrown away too early in dataframes
Michael Armbrust created SPARK-6376: --- Summary: Relation are thrown away too early in dataframes Key: SPARK-6376 URL: https://issues.apache.org/jira/browse/SPARK-6376 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical Because we throw away aliases as we construct the query plan, you can't reference them later. For example, this query fails: {code} test(self join with aliases) { val df = Seq(1,2,3).map(i = (i, i.toString)).toDF(int, str) checkAnswer( df.as('x).join(df.as('y), $x.str === $y.str).groupBy(x.str).count(), Row(1, 1) :: Row(2, 1) :: Row(3, 1) :: Nil) } {code} {code} [info] org.apache.spark.sql.AnalysisException: Cannot resolve column name x.str among (int, str, int, str); [info] at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162) [info] at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:162) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed
[ https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6247: Target Version/s: 1.3.1 (was: 1.3.0) Certain self joins cannot be analyzed - Key: SPARK-6247 URL: https://issues.apache.org/jira/browse/SPARK-6247 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai When you try the following code {code} val df = (1 to 10) .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString)) .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2) df.registerTempTable(test) sql( |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol) |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1) |GROUP BY x.stringCol2 .stripMargin).explain() {code} The following exception will be thrown. {code} [info] java.util.NoSuchElementException: next on empty iterator [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) [info] at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) [info] at scala.collection.IterableLike$class.head(IterableLike.scala:91) [info] at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47) [info] at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) [info] at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info] at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info] at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) [info] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) [info] at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) [info] at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) [info] at scala.collection.immutable.List.foldLeft(List.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) [info] at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133) [info] at
[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints
[ https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364266#comment-14364266 ] Joseph K. Bradley commented on SPARK-6340: -- You should be able to reliably zip the RDDs back together. I just send an update to that post, which I'll copy here: {quote} This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): {code} val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) {code} {quote} Do report it if you run into problems with this! Thanks. mllib.IDF for LabelPoints - Key: SPARK-6340 URL: https://issues.apache.org/jira/browse/SPARK-6340 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Environment: python 2.7.8 pyspark OS: Linux Mint 17 Qiana (Cinnamon 64-bit) Reporter: Kian Ho Priority: Minor Labels: feature as per: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528 Having the IDF.fit accept LabelPoints would be useful since, correct me if i'm wrong, there currently isn't a way of keeping track of which labels belong to which documents if one needs to apply a conventional tf-idf transformation on labelled text data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6247) Certain self joins cannot be analyzed
[ https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6247: Priority: Blocker (was: Critical) Certain self joins cannot be analyzed - Key: SPARK-6247 URL: https://issues.apache.org/jira/browse/SPARK-6247 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Blocker When you try the following code {code} val df = (1 to 10) .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString)) .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2) df.registerTempTable(test) sql( |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol) |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1) |GROUP BY x.stringCol2 .stripMargin).explain() {code} The following exception will be thrown. {code} [info] java.util.NoSuchElementException: next on empty iterator [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) [info] at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) [info] at scala.collection.IterableLike$class.head(IterableLike.scala:91) [info] at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47) [info] at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) [info] at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info] at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info] at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) [info] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) [info] at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) [info] at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) [info] at scala.collection.immutable.List.foldLeft(List.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) [info] at
[jira] [Updated] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6368: Priority: Critical (was: Major) Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Priority: Critical Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6367) Use the proper data type for those expressions that are hijacking existing data types.
[ https://issues.apache.org/jira/browse/SPARK-6367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6367: Assignee: Yin Huai Use the proper data type for those expressions that are hijacking existing data types. -- Key: SPARK-6367 URL: https://issues.apache.org/jira/browse/SPARK-6367 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai Assignee: Yin Huai For the following expressions, the actual value type does not match the type of our internal representation. ApproxCountDistinctPartition NewSet AddItemToSet CombineSets CollectHashSet We should create UDTs for data types of these expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364297#comment-14364297 ] Michael Armbrust commented on SPARK-6200: - I'll add this seems to be mostly implemented already here: https://github.com/apache/spark/pull/4015 Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350 ] yuhao yang commented on SPARK-5563: --- Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [jira] [Created] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
What's the bug? Each element is sampled with probability 0.5. I think the expected size is 14 but not all samples would be that size. On Mar 17, 2015 12:12 AM, Marko Bonaci (JIRA) j...@apache.org wrote: Marko Bonaci created SPARK-6370: --- Summary: RDD sampling with replacement intermittently yields incorrect number of samples Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6250. - Resolution: Won't Fix Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364353#comment-14364353 ] Michael Armbrust commented on SPARK-6250: - We have confirmed that this does work if you escape the field names using backticks. Since this is pretty standard, I'm going to close this wont fix. If there is some case where this is not possible, please reopen with details. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364417#comment-14364417 ] Nitay Joffe commented on SPARK-6250: The error is always the same: https://gist.github.com/nitay/8ba0efd739cf2e22ad23 Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364264#comment-14364264 ] Michael Armbrust commented on SPARK-6200: - Thank you for working on this. I would like to support plug-able dialects, but I'm not sure I fully agree with the implementation. In general it would be good to post a design on the JIRA and get agreement before doing too much implementation. At a high level, I wonder if something much simpler would be sufficient. I don't expect that users will spend a lot of time switching between dialects. Probably they will configure their preferred one in spark.defaults and never think about it again. Additionally, we will still need {{SET spark.sql.dialect=}} as this is public API, so why not just extend that? Basically I would propose we do the following. Add a simple interface {{Dialect}} that takes a {{String}} and returns a {{LogicalPlan}} as you have done. For the built in ones you just say {{SET spark.sql.dialect=sql}} or {{SET spark.sql.dialect=hiveql}}. For external one you simply provide the fully qualified classname. It would also be good to be clear in the interface what the contract is for DDL. I would suggest that Spark SQL always parses its own DDL first and only defers to the dialect when the build in DDL parser does not handle the given string. Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6109) Unit tests fail when compiled against Hive 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-6109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6109: Target Version/s: 1.4.0 (was: 1.3.0) Unit tests fail when compiled against Hive 0.12.0 - Key: SPARK-6109 URL: https://issues.apache.org/jira/browse/SPARK-6109 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Currently, Jenkins doesn't run unit tests against Hive 0.12.0, and several Hive 0.13.1 specific test cases always fail against Hive 0.12.0. Need to blacklist them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6199) Support CTE
[ https://issues.apache.org/jira/browse/SPARK-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6199: Target Version/s: 1.4.0 (was: 1.3.0) Support CTE --- Key: SPARK-6199 URL: https://issues.apache.org/jira/browse/SPARK-6199 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Support CTE in SQLContext and HiveContext -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type
[ https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5911: Target Version/s: 1.4.0 (was: 1.3.0) Make Column.cast(to: String) support fixed precision and scale decimal type --- Key: SPARK-5911 URL: https://issues.apache.org/jira/browse/SPARK-5911 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.
[ https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6366: Priority: Blocker (was: Major) In Python API, the default save mode for save and saveAsTable should be error instead of append. Key: SPARK-6366 URL: https://issues.apache.org/jira/browse/SPARK-6366 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker If a user want to append data, he/she should explicitly specify the save mode. Also, in Scala and Java, the default save mode is ErrorIfExists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6247) Certain self joins cannot be analyzed
[ https://issues.apache.org/jira/browse/SPARK-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-6247: --- Assignee: Michael Armbrust Certain self joins cannot be analyzed - Key: SPARK-6247 URL: https://issues.apache.org/jira/browse/SPARK-6247 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Critical When you try the following code {code} val df = (1 to 10) .map(i = (i, i.toDouble, i.toLong, i.toString, i.toString)) .toDF(intCol, doubleCol, longCol, stringCol1, stringCol2) df.registerTempTable(test) sql( |SELECT x.stringCol2, avg(y.intCol), sum(x.doubleCol) |FROM test x JOIN test y ON (x.stringCol1 = y.stringCol1) |GROUP BY x.stringCol2 .stripMargin).explain() {code} The following exception will be thrown. {code} [info] java.util.NoSuchElementException: next on empty iterator [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) [info] at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) [info] at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64) [info] at scala.collection.IterableLike$class.head(IterableLike.scala:91) [info] at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47) [info] at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120) [info] at scala.collection.mutable.ArrayBuffer.head(ArrayBuffer.scala:47) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7.applyOrElse(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) [info] at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:263) [info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info] at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info] at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info] at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info] at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) [info] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) [info] at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) [info] at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:292) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:247) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:197) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:196) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) [info] at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) [info] at scala.collection.immutable.List.foldLeft(List.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:1071) [info] at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1069) [info] at
[jira] [Updated] (SPARK-6231) Join on two tables (generated from same one) is broken
[ https://issues.apache.org/jira/browse/SPARK-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6231: Priority: Blocker (was: Critical) Join on two tables (generated from same one) is broken -- Key: SPARK-6231 URL: https://issues.apache.org/jira/browse/SPARK-6231 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Michael Armbrust Priority: Blocker Labels: DataFrame If the two column used in joinExpr come from the same table, they have the same id, then the joniExpr is explained in wrong way. {code} val df = sqlContext.load(path, parquet) val txns = df.groupBy(cust_id).agg($cust_id, countDistinct($day_num).as(txns)) val spend = df.groupBy(cust_id).agg($cust_id, sum($extended_price).as(spend)) val rmJoin = txns.join(spend, txns(cust_id) === spend(cust_id), inner) scala rmJoin.explain == Physical Plan == CartesianProduct Filter (cust_id#0 = cust_id#0) Aggregate false, [cust_id#0], [cust_id#0,CombineAndCount(partialSets#25) AS txns#7L] Exchange (HashPartitioning [cust_id#0], 200) Aggregate true, [cust_id#0], [cust_id#0,AddToHashSet(day_num#2L) AS partialSets#25] PhysicalRDD [cust_id#0,day_num#2L], MapPartitionsRDD[1] at map at newParquet.scala:542 Aggregate false, [cust_id#17], [cust_id#17,SUM(PartialSum#38) AS spend#8] Exchange (HashPartitioning [cust_id#17], 200) Aggregate true, [cust_id#17], [cust_id#17,SUM(extended_price#20) AS PartialSum#38] PhysicalRDD [cust_id#17,extended_price#20], MapPartitionsRDD[3] at map at newParquet.scala:542 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6366) In Python API, the default save mode for save and saveAsTable should be error instead of append.
[ https://issues.apache.org/jira/browse/SPARK-6366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6366: Assignee: Yin Huai In Python API, the default save mode for save and saveAsTable should be error instead of append. Key: SPARK-6366 URL: https://issues.apache.org/jira/browse/SPARK-6366 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai If a user want to append data, he/she should explicitly specify the save mode. Also, in Scala and Java, the default save mode is ErrorIfExists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364309#comment-14364309 ] tanyinyan commented on SPARK-6348: -- Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364311#comment-14364311 ] tanyinyan commented on SPARK-6348: -- Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364313#comment-14364313 ] tanyinyan commented on SPARK-6348: -- Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364312#comment-14364312 ] tanyinyan commented on SPARK-6348: -- Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanyinyan updated SPARK-6348: - Comment: was deleted (was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :)) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanyinyan updated SPARK-6348: - Comment: was deleted (was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :)) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanyinyan updated SPARK-6348: - Comment: was deleted (was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :)) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanyinyan updated SPARK-6348: - Comment: was deleted (was: Yes,I use a one-hot encoding before SVM , which is the 'sparsed before SVM ' exactly means :)) Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364350#comment-14364350 ] yuhao yang edited comment on SPARK-5563 at 3/17/15 1:13 AM: Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation (C++) provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. was (Author: yuhaoyan): Matthew Willson. Thanks for the attention and idea. Apart from Gensim, vowpal-wabbit also has a distributed implementation provided by Matthew D. Hoffman, which seems to be amazingly fast. I'll refer to those libraries as much as possible. And suggestions are always welcome. LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364414#comment-14364414 ] Nitay Joffe commented on SPARK-6250: Backticks doesn't work for me on existing data. For example I work with a table that is structured like: ... foo: structtimestamp:bigint,timezone:string ... Selecting *, `foo`, `foo.timestamp`, foo.`timestamp` all don't work. What am I doing wrong? Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6375) Bad formatting in analysis errors
Michael Armbrust created SPARK-6375: --- Summary: Bad formatting in analysis errors Key: SPARK-6375 URL: https://issues.apache.org/jira/browse/SPARK-6375 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical {code} [info] org.apache.spark.sql.AnalysisException: Ambiguous references to str: (str#3,List()),(str#5,List()); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6377) Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation.
Yin Huai created SPARK-6377: --- Summary: Set the number of shuffle partitions automatically based on the size of input tables and the reduce-side operation. Key: SPARK-6377 URL: https://issues.apache.org/jira/browse/SPARK-6377 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai It will be helpful to automatically set the number of shuffle partitions based on the size of input tables and the operation at the reduce side for an Exchange operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6369) InsertIntoHiveTable should use logic from SparkHadoopWriter
Michael Armbrust created SPARK-6369: --- Summary: InsertIntoHiveTable should use logic from SparkHadoopWriter Key: SPARK-6369 URL: https://issues.apache.org/jira/browse/SPARK-6369 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical Right now it is possible that we will corrupt the output if there is a race between competing speculative tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5451) And predicates are not properly pushed down
[ https://issues.apache.org/jira/browse/SPARK-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-5451: Target Version/s: 1.4.0 (was: 1.3.0, 1.2.2) And predicates are not properly pushed down --- Key: SPARK-5451 URL: https://issues.apache.org/jira/browse/SPARK-5451 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Critical This issue is actually caused by PARQUET-173. The following {{spark-shell}} session can be used to reproduce this bug: {code} import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sc) import sc._ import sqlContext._ case class KeyValue(key: Int, value: String) parallelize(1 to 1024 * 1024 * 20). flatMap(i = Seq.fill(10)(KeyValue(i, i.toString))). saveAsParquetFile(large.parquet) parquetFile(large.parquet).registerTempTable(large) hadoopConfiguration.set(parquet.task.side.metadata, false) sql(SET spark.sql.parquet.filterPushdown=true) sql(SELECT value FROM large WHERE 1024 value AND value 2048).collect() {code} From the log we can find: {code} There were no row groups that could be dropped due to filter predicates {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6200) Support dialect in SQL
[ https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6200: Target Version/s: 1.4.0 (was: 1.3.0) Support dialect in SQL -- Key: SPARK-6200 URL: https://issues.apache.org/jira/browse/SPARK-6200 Project: Spark Issue Type: Improvement Components: SQL Reporter: haiyang Created a new dialect manager,support dialect command and add new dialect use sql statement etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5183) Document data source API
[ https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-5183: --- Assignee: Michael Armbrust Document data source API Key: SPARK-5183 URL: https://issues.apache.org/jira/browse/SPARK-5183 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Reporter: Yin Huai Assignee: Michael Armbrust Priority: Critical Fix For: 1.3.0 We need to document the data types the caller needs to support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364355#comment-14364355 ] Sean Owen commented on SPARK-6370: -- What's the bug? Each element is sampled with probability 0.5. I think the expected size is 14 but not all samples would be that size. RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6370) RDD sampling with replacement intermittently yields incorrect number of samples
[ https://issues.apache.org/jira/browse/SPARK-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364354#comment-14364354 ] Sean Owen commented on SPARK-6370: -- What's the bug? Each element is sampled with probability 0.5. I think the expected size is 14 but not all samples would be that size. RDD sampling with replacement intermittently yields incorrect number of samples --- Key: SPARK-6370 URL: https://issues.apache.org/jira/browse/SPARK-6370 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Environment: Ubuntu 14.04 64-bit, spark-1.3.0-bin-hadoop2.4 Reporter: Marko Bonaci Labels: PoissonSampler, sample, sampler Here's the repl output: {{code:java}} scala uniqueIds.collect res10: Array[String] = Array(4, 8, 21, 80, 20, 98, 42, 15, 48, 36, 90, 46, 55, 16, 31, 71, 9, 50, 28, 61, 68, 85, 12, 94, 38, 77, 2, 11, 10) scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[22] at sample at console:27 scala swr.count res17: Long = 16 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[23] at sample at console:27 scala swr.count res18: Long = 8 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[24] at sample at console:27 scala swr.count res19: Long = 18 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[25] at sample at console:27 scala swr.count res20: Long = 15 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[26] at sample at console:27 scala swr.count res21: Long = 11 scala val swr = uniqueIds.sample(true, 0.5) swr: org.apache.spark.rdd.RDD[String] = PartitionwiseSampledRDD[27] at sample at console:27 scala swr.count res22: Long = 10 {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364367#comment-14364367 ] Nitay Joffe commented on SPARK-6250: Is it hard to fix this? Seems to me it would be a pretty straightforward thing to do? It's not always easy to change the underlying data model, for example working with existing hive metastore not under your control. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364373#comment-14364373 ] Michael Armbrust commented on SPARK-6250: - I'm not suggesting you change your datamodel. Just that if you are going to name your columns after datatypes you escape them with backticks (as you must do when using any reserved word or non-standard characters). When interacting with the Hive metastore I would expect all required escaping to happen automatically. Please let me know if you have a counter example. Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364375#comment-14364375 ] Nitay Joffe commented on SPARK-6250: How would I do a select * from an existing hive metastores which have types in their names? Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6348) Enable useFeatureScaling in SVMWithSGD
[ https://issues.apache.org/jira/browse/SPARK-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364408#comment-14364408 ] Apache Spark commented on SPARK-6348: - User 'tanyinyan' has created a pull request for this issue: https://github.com/apache/spark/pull/5055 Enable useFeatureScaling in SVMWithSGD -- Key: SPARK-6348 URL: https://issues.apache.org/jira/browse/SPARK-6348 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: tanyinyan Priority: Minor Original Estimate: 2h Remaining Estimate: 2h Currently,useFeatureScaling are set to false by default in class GeneralizedLinearAlgorithm, and it is only enabled in LogisticRegressionWithLBFGS. SVMWithSGD class is a private class,train methods are provide in SVMWithSGD object. So there is no way to set useFeatureScaling when using SVM. I am using SVM on dataset(https://www.kaggle.com/c/avazu-ctr-prediction/data), train on the first day's dataset(ignore field id/device_id/device_ip, all remaining fields are concidered as categorical variable, and sparsed before SVM) and predict on the same data with threshold cleared, the predict result are all negative. Then i set useFeatureScaling to true, the predict result are normal(including negative and positive result) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6371) Update version to 1.4.0-SNAPSHOT
Marcelo Vanzin created SPARK-6371: - Summary: Update version to 1.4.0-SNAPSHOT Key: SPARK-6371 URL: https://issues.apache.org/jira/browse/SPARK-6371 Project: Spark Issue Type: Task Components: Build Reporter: Marcelo Vanzin Priority: Critical See summary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6372) spark-submit --conf is not being propagated to child processes
[ https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1436#comment-1436 ] Apache Spark commented on SPARK-6372: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5057 spark-submit --conf is not being propagated to child processes Key: SPARK-6372 URL: https://issues.apache.org/jira/browse/SPARK-6372 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Marcelo Vanzin Priority: Blocker Thanks to [~irashid] for bringing this up. It seems that the new launcher library is incorrectly handling --conf and not passing it down to the child processes. Fix is simple, PR coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
[ https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364471#comment-14364471 ] Apache Spark commented on SPARK-6374: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/5058 Add getter for GeneralizedLinearAlgorithm - Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor Original Estimate: 1h Remaining Estimate: 1h I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm
yuhao yang created SPARK-6374: - Summary: Add getter for GeneralizedLinearAlgorithm Key: SPARK-6374 URL: https://issues.apache.org/jira/browse/SPARK-6374 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.1 Reporter: yuhao yang Priority: Minor I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6330: Target Version/s: 1.3.1 newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Fix For: 1.4.0, 1.3.1 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)
[ https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363767#comment-14363767 ] Manoj Kumar edited comment on SPARK-6192 at 3/17/15 3:16 AM: - [~mengxr] Google Summer of Code applications are open today. I have submitted my proposal here, http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264 It would be great if you could provide feedback, register as a mentor, and do the needful as described here (https://community.apache.org/mentee-ranking-process.html). Thanks! was (Author: mechcoder): [~mengxr] Google Summer of Code applications are open today. I have submitted my proposal here, http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/manojkumar/5654792596619264 It would be great if you could register as a mentor, and do the needful as described here (https://community.apache.org/mentee-ranking-process.html). Thanks! Enhance MLlib's Python API (GSoC 2015) -- Key: SPARK-6192 URL: https://issues.apache.org/jira/browse/SPARK-6192 Project: Spark Issue Type: Umbrella Components: ML, MLlib, PySpark Reporter: Xiangrui Meng Assignee: Manoj Kumar Labels: gsoc, gsoc2015, mentor This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme is to enhance MLlib's Python API, to make it on par with the Scala/Java API. The main tasks are: 1. For all models in MLlib, provide save/load method. This also includes save/load in Scala. 2. Python API for evaluation metrics. 3. Python API for streaming ML algorithms. 4. Python API for distributed linear algebra. 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use customized serialization, making MLLibPythonAPI hard to maintain. It would be nice to use the DataFrames for serialization. I'll link the JIRAs for each of the tasks. Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. The TODO list will be dynamic based on the backlog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364490#comment-14364490 ] Apache Spark commented on SPARK-5068: - User 'lazyman500' has created a pull request for this issue: https://github.com/apache/spark/pull/5059 When the path not found in the hdfs,we can't get the result --- Key: SPARK-5068 URL: https://issues.apache.org/jira/browse/SPARK-5068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: {noformat} hive show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) {noformat} {noformat} hive dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 {noformat} when i run the sql {noformat} select * from partition_test limit 10 {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get the error as follow: {noformat} Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} -- This message was sent by Atlassian
[jira] [Comment Edited] (SPARK-6340) mllib.IDF for LabelPoints
[ https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364246#comment-14364246 ] Kian Ho edited comment on SPARK-6340 at 3/17/15 12:01 AM: -- Hi Joseph, I initially considered that as a solution, however it was my understanding that you couldn't guarantee the same ordering between the instances pre- and post- transformations (since the transformations will be distributed across worker nodes). Hence, you may end up with features that will be zipped with labels they weren't originally assigned. Is this correct? This question was also mentioned by a couple of users in that thread. Thanks was (Author: kian.ho): Hi Joseph, I initially considered that as a solution, however it was my understanding that you couldn't guarantee the same ordering between the instances pre- and post- transformations (since the transformations will be distributed across worker nodes). Is this correct? This question was also mentioned by a couple of users in that thread. Thanks mllib.IDF for LabelPoints - Key: SPARK-6340 URL: https://issues.apache.org/jira/browse/SPARK-6340 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Environment: python 2.7.8 pyspark OS: Linux Mint 17 Qiana (Cinnamon 64-bit) Reporter: Kian Ho Priority: Minor Labels: feature as per: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528 Having the IDF.fit accept LabelPoints would be useful since, correct me if i'm wrong, there currently isn't a way of keeping track of which labels belong to which documents if one needs to apply a conventional tf-idf transformation on labelled text data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6340) mllib.IDF for LabelPoints
[ https://issues.apache.org/jira/browse/SPARK-6340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364321#comment-14364321 ] Kian Ho commented on SPARK-6340: I appreciate the swift response! happy to keep this issue closed. mllib.IDF for LabelPoints - Key: SPARK-6340 URL: https://issues.apache.org/jira/browse/SPARK-6340 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Environment: python 2.7.8 pyspark OS: Linux Mint 17 Qiana (Cinnamon 64-bit) Reporter: Kian Ho Priority: Minor Labels: feature as per: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-td19429.html#a19528 Having the IDF.fit accept LabelPoints would be useful since, correct me if i'm wrong, there currently isn't a way of keeping track of which labels belong to which documents if one needs to apply a conventional tf-idf transformation on labelled text data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6368: Assignee: Yin Huai Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6293) SQLContext.implicits should provide automatic conversion for RDD[Row]
[ https://issues.apache.org/jira/browse/SPARK-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364340#comment-14364340 ] Chen Song commented on SPARK-6293: -- OK, I have created a pull request https://github.com/apache/spark/pull/5040. I'm the user KAKA1992. SQLContext.implicits should provide automatic conversion for RDD[Row] - Key: SPARK-6293 URL: https://issues.apache.org/jira/browse/SPARK-6293 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley When a DataFrame is converted to an RDD[Row], it should be easier to convert it back to a DataFrame via toDF. E.g.: {code} val df: DataFrame = myRDD.toDF(col1, col2) // This works for types like RDD[scala.Tuple2[...]] val splits = df.rdd.randomSplit(...) val split0: RDD[Row] = splits(0) val df0 = split0.toDF(col1, col2) // This fails {code} The failure happens because SQLContext.implicits does not provide an automatic conversion for Rows. (It does handle Products, but Row does not implement Product.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6250) Types are now reserved words in DDL parser.
[ https://issues.apache.org/jira/browse/SPARK-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14364382#comment-14364382 ] Yin Huai commented on SPARK-6250: - [~nitay] Have you tried backticks? Does it work? Types are now reserved words in DDL parser. --- Key: SPARK-6250 URL: https://issues.apache.org/jira/browse/SPARK-6250 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file
[ https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363232#comment-14363232 ] Sean Owen commented on SPARK-6355: -- Oh, I learned something then. Yeah that looks like the intended behavior and this should work. Maybe it is just not applied in standalone mode for the app jar. Spark standalone cluster does not support local:/ url for jar file -- Key: SPARK-6355 URL: https://issues.apache.org/jira/browse/SPARK-6355 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Reporter: Jesper Lundgren Submitting a new spark application to a standalone cluster with local:/path will result in an exception. Driver successfully submitted as driver-20150316171157-0004 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150316171157-0004 is ERROR Exception from cluster was: java.io.IOException: No FileSystem for scheme: local java.io.IOException: No FileSystem for scheme: local at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file
[ https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363227#comment-14363227 ] Jesper Lundgren edited comment on SPARK-6355 at 3/16/15 2:14 PM: - [~srowen] Thank you for your reply. I use spark-submit --class class.Main local:/application.jar . https://spark.apache.org/docs/1.2.1/submitting-applications.html under Advanced Dependency Management mentions local:/ can be used when a jar is pre-distributed instead of uploading using the built in file server. Maybe I am misunderstanding but I believe it is meant to work for the main application jar as well as for --jars config option. I am running standalone cluster with Zookeeper HA and have on occasion had problem crashing on restart due to the spark fileserver being unavailable to distribute the jar to the worker nodes (I can't reliably reproduce this yet). I intended to use local:/ as a fix but seems this option does not work in standalone cluster. was (Author: koudelka): [~srowen] spark-submit --class class.Main local:/application.jar . https://spark.apache.org/docs/1.2.1/submitting-applications.html under Advanced Dependency Management mentions local:/ can be used when a jar is pre-distributed instead of uploading using the built in file server. Maybe I am misunderstanding but I believe it is meant to work for the main application jar as well as for --jars config option. I am running standalone cluster with Zookeeper HA and have on occasion had problem crashing on restart due to the spark fileserver being unavailable to distribute the jar to the worker nodes (I can't reliably reproduce this yet). I intended to use local:/ as a fix but seems this option does not work in standalone cluster. Spark standalone cluster does not support local:/ url for jar file -- Key: SPARK-6355 URL: https://issues.apache.org/jira/browse/SPARK-6355 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Reporter: Jesper Lundgren Submitting a new spark application to a standalone cluster with local:/path will result in an exception. Driver successfully submitted as driver-20150316171157-0004 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150316171157-0004 is ERROR Exception from cluster was: java.io.IOException: No FileSystem for scheme: local java.io.IOException: No FileSystem for scheme: local at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file
[ https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363242#comment-14363242 ] Jesper Lundgren edited comment on SPARK-6355 at 3/16/15 2:23 PM: - I haven't tried to see if --jars is working properly. There is (or at least used to be) an similar issue with spark-submit --master option. The spark standalone cluster documentation says that multiple master hosts can be provided as a comma separated list like spark://host1:port1,host2:port2 but it did not work (at least in 1.2.0 ). But it does work when setting the master url within the driver code while starting the SparkContext. I am wondering if I can do a similar workaround for this issue. was (Author: koudelka): I haven't tried to see if --jars is working properly. There is (or at least used to be) an similar issue with spark-submit --master option. The spark standalone cluster documentation says that multiple master hosts can be provided as a comma separated list like spark://host1:port1,host2:port2 but it did not work (at least in 1.2.0 ). But it does work when setting the master url within the driver code starting the SparkContext. I am wondering if I can do a similar workaround for this issue. Spark standalone cluster does not support local:/ url for jar file -- Key: SPARK-6355 URL: https://issues.apache.org/jira/browse/SPARK-6355 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Reporter: Jesper Lundgren Submitting a new spark application to a standalone cluster with local:/path will result in an exception. Driver successfully submitted as driver-20150316171157-0004 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150316171157-0004 is ERROR Exception from cluster was: java.io.IOException: No FileSystem for scheme: local java.io.IOException: No FileSystem for scheme: local at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6355) Spark standalone cluster does not support local:/ url for jar file
[ https://issues.apache.org/jira/browse/SPARK-6355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363227#comment-14363227 ] Jesper Lundgren commented on SPARK-6355: [~srowen] spark-submit --class class.Main local:/application.jar . https://spark.apache.org/docs/1.2.1/submitting-applications.html under Advanced Dependency Management mentions local:/ can be used when a jar is pre-distributed instead of uploading using the built in file server. Maybe I am misunderstanding but I believe it is meant to work for the main application jar as well as for --jars config option. I am running standalone cluster with Zookeeper HA and have on occasion had problem crashing on restart due to the spark fileserver being unavailable to distribute the jar to the worker nodes (I can't reliably reproduce this yet). I intended to use local:/ as a fix but seems this option does not work in standalone cluster. Spark standalone cluster does not support local:/ url for jar file -- Key: SPARK-6355 URL: https://issues.apache.org/jira/browse/SPARK-6355 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Reporter: Jesper Lundgren Submitting a new spark application to a standalone cluster with local:/path will result in an exception. Driver successfully submitted as driver-20150316171157-0004 ... waiting before polling master for driver state ... polling master for driver state State of driver-20150316171157-0004 is ERROR Exception from cluster was: java.io.IOException: No FileSystem for scheme: local java.io.IOException: No FileSystem for scheme: local at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:141) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362986#comment-14362986 ] Martin Zapletal commented on SPARK-3278: Vladimir, just to update you on the progress. I was able to complete the isotonic regression with 100M records, but failed with insufficient memory error with 150M records on my machine. You may be able to run with larger amounts of data on better machines. Pool adjacent violators algorithm can theoretically have linear time complexity, but although I have used the best algorithm I could find I am not convinced it reaches this efficiency. I will work on providing evidence. The biggest issue with the current algorithm is however with the parallelization approach. Its properties are unfortunately nowhere near linear scalability (linear solution time increase with linear parallelism increase or constant solution time with linear parallelism increase and linear problem size increase). This was expected and is caused by the algorithm itself for the following reasons 1) The algorithm works in two steps. First the computation is distributed to all partitions, but then gathered and the algorithm is run again on the whole data set. This approach may leave most of work for the last sequential step and thus gaining very little compared to purely sequential implementation or even performing worse. That can happen in case where parallel isotonic regressions return a locally optimal solution that will however have to change for a global solution in the last step. Another performance drawback in comparison to sequential processing is the potential need to copy data to each process. 2) It requires the whole dataset to fit into one process’ memory in the last step (or repeated disk access). I started looking into the issue and was able to design an iterative algorithm that adressed both the above issues and performed very close to linear scalability. It however still has correctness (rounding) issues and will require further research. Let me know if that helped. In the meantime I will continue working on benchmarks and performance quantification of the current algorithm as well as on research for potentially more efficient solutions. Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6299) ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363239#comment-14363239 ] Apache Spark commented on SPARK-6299: - User 'swkimme' has created a pull request for this issue: https://github.com/apache/spark/pull/5046 ClassNotFoundException in standalone mode when running groupByKey with class defined in REPL. - Key: SPARK-6299 URL: https://issues.apache.org/jira/browse/SPARK-6299 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0, 1.2.1 Reporter: Kevin (Sangwoo) Kim Anyone can reproduce this issue by the code below (runs well in local mode, got exception with clusters) (it runs well in Spark 1.1.1) case class ClassA(value: String) val rdd = sc.parallelize(List((k1, ClassA(v1)), (k1, ClassA(v2)) )) rdd.groupByKey.collect org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
[jira] [Comment Edited] (SPARK-3278) Isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-3278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362986#comment-14362986 ] Martin Zapletal edited comment on SPARK-3278 at 3/16/15 9:37 AM: - Vladimir, just to update you on the progress. I was able to complete the isotonic regression with 100M records (similar data to what you requested, using NASDAQ index prices), but failed with insufficient memory error with 150M records on my machine. You may be able to run with larger amounts of data on better machines. Pool adjacent violators algorithm can theoretically have linear time complexity, but although I have used the best algorithm I could find I am not convinced it reaches this efficiency. I will work on providing evidence. The biggest issue with the current algorithm is however with the parallelization approach. Its properties are unfortunately nowhere near linear scalability (linear solution time increase with linear parallelism increase or constant solution time with linear parallelism increase and linear problem size increase). This was expected and is caused by the algorithm itself for the following reasons 1) The algorithm works in two steps. First the computation is distributed to all partitions, but then gathered and the algorithm is run again on the whole data set. This approach may leave most of work for the last sequential step and thus gaining very little compared to purely sequential implementation or even performing worse. That can happen in case where parallel isotonic regressions return a locally optimal solution that will however have to change for a global solution in the last step. Another performance drawback in comparison to sequential processing is the potential need to copy data to each process. 2) It requires the whole dataset to fit into one process’ memory in the last step (or repeated disk access). I started looking into the issue and was able to design an iterative algorithm that adressed both the above issues and performed very close to linear scalability. It however still has correctness (rounding) issues and will require further research. Let me know if that helped. In the meantime I will continue working on benchmarks and performance quantification of the current algorithm as well as on research for potentially more efficient solutions. was (Author: zapletal-martin): Vladimir, just to update you on the progress. I was able to complete the isotonic regression with 100M records, but failed with insufficient memory error with 150M records on my machine. You may be able to run with larger amounts of data on better machines. Pool adjacent violators algorithm can theoretically have linear time complexity, but although I have used the best algorithm I could find I am not convinced it reaches this efficiency. I will work on providing evidence. The biggest issue with the current algorithm is however with the parallelization approach. Its properties are unfortunately nowhere near linear scalability (linear solution time increase with linear parallelism increase or constant solution time with linear parallelism increase and linear problem size increase). This was expected and is caused by the algorithm itself for the following reasons 1) The algorithm works in two steps. First the computation is distributed to all partitions, but then gathered and the algorithm is run again on the whole data set. This approach may leave most of work for the last sequential step and thus gaining very little compared to purely sequential implementation or even performing worse. That can happen in case where parallel isotonic regressions return a locally optimal solution that will however have to change for a global solution in the last step. Another performance drawback in comparison to sequential processing is the potential need to copy data to each process. 2) It requires the whole dataset to fit into one process’ memory in the last step (or repeated disk access). I started looking into the issue and was able to design an iterative algorithm that adressed both the above issues and performed very close to linear scalability. It however still has correctness (rounding) issues and will require further research. Let me know if that helped. In the meantime I will continue working on benchmarks and performance quantification of the current algorithm as well as on research for potentially more efficient solutions. Isotonic regression --- Key: SPARK-3278 URL: https://issues.apache.org/jira/browse/SPARK-3278 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add isotonic regression for score calibration. -- This message was sent by
[jira] [Resolved] (SPARK-6300) sc.addFile(path) does not support the relative path.
[ https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6300. -- Resolution: Fixed Fix Version/s: 1.3.1 1.4.0 Issue resolved by pull request 4993 [https://github.com/apache/spark/pull/4993] sc.addFile(path) does not support the relative path. Key: SPARK-6300 URL: https://issues.apache.org/jira/browse/SPARK-6300 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.2.1 Reporter: DoingDone9 Assignee: DoingDone9 Priority: Critical Fix For: 1.4.0, 1.3.1 when i run cmd like that sc.addFile(../test.txt), it did not work and throw an exception java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) ... Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6316) add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file , e.g. gbk
[ https://issues.apache.org/jira/browse/SPARK-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14363161#comment-14363161 ] yunzhi.lyz commented on SPARK-6316: --- i have a try read file non UTF-8 encodings code example: sc.hadoopFile(/inputdir, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],5).map(pair = new String(pair._2.getBytes(), 0 , pair._2.getLength(), gbk)) write file non UTF-8 encodings code example: file.map(x = (NullWritable.get(), new Text(String.valueOf(x).getBytes(gbk.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](/output) RESOLVED sc.textFile and rdd.saveAsTextFile not support non UTF-8 encodings question add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file , e.g. gbk Key: SPARK-6316 URL: https://issues.apache.org/jira/browse/SPARK-6316 Project: Spark Issue Type: New Feature Environment: linux LANG=en_US.UTF-8 Reporter: yunzhi.lyz add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file . e.g. val file = new SparkContext(conf).textFile(args(0), 10,gbk) modify the code: org.apache.spark.SparkContext + def defaultEncoding: String = utf-8 -- def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair = pair._2.toString).setName(path) } ++def textFile(path: String, minPartitions: Int = defaultMinPartitions,encoding: String = defaultEncoding): RDD[String] = { hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair = new String(pair._2.getBytes(), 0 , pair._2.getLength(), encoding)).setName(path) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext
Yadong Qi created SPARK-6356: Summary: Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext Key: SPARK-6356 URL: https://issues.apache.org/jira/browse/SPARK-6356 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yadong Qi Support for the expression below: ``` GROUP BY expression list WITH ROLLUP GROUP BY expression list WITH CUBE GROUP BY expression list GROUPING SETS (expression list2) ``` And ``` GROUP BY ROLLUP(expression list) GROUP BY CUBE(expression list) GROUP BY expression list GROUPING SETS(expression list2) ``` And ``` GROUPING (expression list) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6356) Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext
[ https://issues.apache.org/jira/browse/SPARK-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi updated SPARK-6356: - Description: Support for the expression below: ``` GROUP BY expression list WITH ROLLUP GROUP BY expression list WITH CUBE GROUP BY expression list GROUPING SETS (expression list2) ``` And ``` GROUP BY ROLLUP(expression list) GROUP BY CUBE(expression list) GROUP BY expression list GROUPING SETS(expression list2) ``` And ``` GROUPING (expression list) ``` was: Support for the expression below: ` GROUP BY expression list WITH ROLLUP GROUP BY expression list WITH CUBE GROUP BY expression list GROUPING SETS (expression list2) ` And ``` GROUP BY ROLLUP(expression list) GROUP BY CUBE(expression list) GROUP BY expression list GROUPING SETS(expression list2) ``` And ``` GROUPING (expression list) ``` Support the ROLLUP/CUBE/GROUPING SETS/grouping() in SQLContext -- Key: SPARK-6356 URL: https://issues.apache.org/jira/browse/SPARK-6356 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yadong Qi Support for the expression below: ``` GROUP BY expression list WITH ROLLUP GROUP BY expression list WITH CUBE GROUP BY expression list GROUPING SETS (expression list2) ``` And ``` GROUP BY ROLLUP(expression list) GROUP BY CUBE(expression list) GROUP BY expression list GROUPING SETS(expression list2) ``` And ``` GROUPING (expression list) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362996#comment-14362996 ] Lior Chaga commented on SPARK-6305: --- Works by adding log4j 2.x jars with log4j1.2-api bridge to the classpath with SPARK_CLASSPATH. No need for changing spark distribution. Closed the pull request. Add support for log4j 2.x to Spark -- Key: SPARK-6305 URL: https://issues.apache.org/jira/browse/SPARK-6305 Project: Spark Issue Type: New Feature Components: Build Reporter: Tal Sliwowicz log4j 2 requires replacing the slf4j binding and adding the log4j jars in the classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org