[jira] [Assigned] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely
[ https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5681: --- Assignee: (was: Apache Spark) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely --- Key: SPARK-5681 URL: https://issues.apache.org/jira/browse/SPARK-5681 Project: Spark Issue Type: Bug Components: Streaming Reporter: Liang-Chi Hsieh Sometimes the receiver will be registered into tracker after ssc.stop is called. Especially when stop() is called immediately after start(). So the receiver doesn't get the StopReceiver message from the tracker. In this case, when you call stop() in graceful mode, stop() would get stuck indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5681) Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely
[ https://issues.apache.org/jira/browse/SPARK-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5681: --- Assignee: Apache Spark Calling graceful stop() immediately after start() on StreamingContext should not get stuck indefinitely --- Key: SPARK-5681 URL: https://issues.apache.org/jira/browse/SPARK-5681 Project: Spark Issue Type: Bug Components: Streaming Reporter: Liang-Chi Hsieh Assignee: Apache Spark Sometimes the receiver will be registered into tracker after ssc.stop is called. Especially when stop() is called immediately after start(). So the receiver doesn't get the StopReceiver message from the tracker. In this case, when you call stop() in graceful mode, stop() would get stuck indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6694: --- Assignee: (was: Apache Spark) SparkSQL CLI must be able to specify an option --database on the command line. -- Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi SparkSQL CLI has an option --database as follows. But, an option --database doesn't work properly. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394191#comment-14394191 ] Apache Spark commented on SPARK-6694: - User 'adachij2002' has created a pull request for this issue: https://github.com/apache/spark/pull/5345 SparkSQL CLI must be able to specify an option --database on the command line. -- Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi SparkSQL CLI has an option --database as follows. But, an option --database doesn't work properly. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6428) Add to style checker public method must have explicit type defined
[ https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394121#comment-14394121 ] Apache Spark commented on SPARK-6428: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5342 Add to style checker public method must have explicit type defined Key: SPARK-6428 URL: https://issues.apache.org/jira/browse/SPARK-6428 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Otherwise it is too easy to accidentally leak or define an incorrect return type in user facing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6428) Add to style checker public method must have explicit type defined
[ https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6428: --- Assignee: Apache Spark (was: Reynold Xin) Add to style checker public method must have explicit type defined Key: SPARK-6428 URL: https://issues.apache.org/jira/browse/SPARK-6428 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Apache Spark Otherwise it is too easy to accidentally leak or define an incorrect return type in user facing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-1095) Ensure all public methods return explicit types
[ https://issues.apache.org/jira/browse/SPARK-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-1095: Assignee: Reynold Xin (was: prashant) Ensure all public methods return explicit types --- Key: SPARK-1095 URL: https://issues.apache.org/jira/browse/SPARK-1095 Project: Spark Issue Type: Sub-task Reporter: Patrick Wendell Assignee: Reynold Xin Fix For: 1.0.0 This talk explains some of the challenges around typing for binary compatibility: http://www.slideshare.net/mircodotta/managing-binary-compatibility-in-scala For public methods we should always declare the type. We've had this as a guideline in the past but we need to make sure we obey it in all public interfaces. Also, we should return the most general type possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated
[ https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-6692: --- Assignee: Cheolsoo Park Make it possible to kill AM in YARN cluster mode when the client is terminated -- Key: SPARK-6692 URL: https://issues.apache.org/jira/browse/SPARK-6692 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Priority: Minor Labels: yarn I understand that the yarn-cluster mode is designed for fire-and-forget model; therefore, terminating the yarn client doesn't kill AM. However, it is very common that users submit Spark jobs via job scheduler (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected that killing the yarn client will terminate AM. It is true that the yarn-client mode can be used in such cases. But then, the yarn client sometimes needs lots of heap memory for big jobs if it runs in the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because AM can be given arbitrary heap memory unlike the yarn client. So it would be very useful to make it possible to kill AM even in the yarn-cluster mode. In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon as they're accepted (but not yet running). Although they're eventually shutdown after AM timeout, it would be nice if AM could immediately get killed in such cases too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6693) add to string with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394146#comment-14394146 ] Apache Spark commented on SPARK-6693: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/5344 add to string with max lines and width for matrix - Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6693) add to string with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6693: --- Assignee: (was: Apache Spark) add to string with max lines and width for matrix - Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394147#comment-14394147 ] Florian Verhein commented on SPARK-6664: I guess the other thing is - we can union RDDs, so why not be able to 'undo' that? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6693) add to string with max lines and width for matrix
yuhao yang created SPARK-6693: - Summary: add to string with max lines and width for matrix Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6693) add to string with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6693: --- Assignee: Apache Spark add to string with max lines and width for matrix - Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Assignee: Apache Spark Priority: Minor Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6693) add toString with max lines and width for matrix
[ https://issues.apache.org/jira/browse/SPARK-6693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-6693: -- Summary: add toString with max lines and width for matrix (was: add to string with max lines and width for matrix) add toString with max lines and width for matrix Key: SPARK-6693 URL: https://issues.apache.org/jira/browse/SPARK-6693 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: yuhao yang Priority: Minor Original Estimate: 2h Remaining Estimate: 2h It's kind of annoying when debugging and found you cannot print out the matrix as you want. original toString of Matrix only print like following, 0.178101025969091830.5616906241468385... (100 total) 0.9692861997823815 0.015558159784155756 ... 0.8513015122819192 0.031523763918528847 ... 0.5396875653953941 0.3267864552779176... The def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6211) Test Python Kafka API using Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6211: --- Assignee: Saisai Shao (was: Apache Spark) Test Python Kafka API using Python unit tests - Key: SPARK-6211 URL: https://issues.apache.org/jira/browse/SPARK-6211 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Saisai Shao Priority: Critical This is tricky in python because the KafkaStreamSuiteBase (which has the functionality of creating embedded kafka clusters) is in the test package, which is not in the python path. To fix that, we have to ways. 1. Add test jar to classpath in python test. Thats kind of trickier. 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and then wrap that in python to use it from python. If (2) does not add any extra test dependencies to the main Kafka pom, then 2 should be simpler to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6211) Test Python Kafka API using Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6211: --- Assignee: Apache Spark (was: Saisai Shao) Test Python Kafka API using Python unit tests - Key: SPARK-6211 URL: https://issues.apache.org/jira/browse/SPARK-6211 Project: Spark Issue Type: Test Components: Streaming, Tests Reporter: Tathagata Das Assignee: Apache Spark Priority: Critical This is tricky in python because the KafkaStreamSuiteBase (which has the functionality of creating embedded kafka clusters) is in the test package, which is not in the python path. To fix that, we have to ways. 1. Add test jar to classpath in python test. Thats kind of trickier. 2. Bring that into the src package (maybe renamed as KafkaTestUtils), and then wrap that in python to use it from python. If (2) does not add any extra test dependencies to the main Kafka pom, then 2 should be simpler to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
Saisai Shao created SPARK-6691: -- Summary: Abstract and add a dynamic RateLimiter for Spark Streaming Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6428) Add to style checker public method must have explicit type defined
[ https://issues.apache.org/jira/browse/SPARK-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6428: --- Assignee: Reynold Xin (was: Apache Spark) Add to style checker public method must have explicit type defined Key: SPARK-6428 URL: https://issues.apache.org/jira/browse/SPARK-6428 Project: Spark Issue Type: Improvement Components: Build Reporter: Reynold Xin Assignee: Reynold Xin Otherwise it is too easy to accidentally leak or define an incorrect return type in user facing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5523: --- Assignee: (was: Apache Spark) TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394141#comment-14394141 ] Florian Verhein commented on SPARK-6664: Thanks [~sowen]. I disagree :-) ...If you think there's non-stationarity you most certainly want to see how well a model trained in the past holds up in the future (possibly with more than one out of time sample if one is used for pruning, etc), and you can do this for temporal data by adjusting the way you do cross validation... actually, the exact method you describe is one common approach in time series data, e.g. see http://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection Doing this multiple times does exactly what is does for normal cross-validation - gives you a distribution of your error estimate, rather than a single value (a sample of it). So it's quite important. The size of the data isn't really relevant to this argument (also consider that I might like to employ larger datasets to remove the risk of overfitting a more complex but better fitting model, rather than to improve my error estimates). Note that this proposal doesn't define how the split RDDs are used (i.e. unioned) to create training sets and test sets. So the test set can be a single RDD, or multiple ones. It's entirely up to the user. Allowing overlapping partitions (i.e. part 2) is a little different, because you probably wouldn't union the resulting RDDs due to duplication. It would be more useful for as a primitive for bootstrapping the performance measures of streaming models or simulations (so, you're not resampling records, but resampling subsequences). Alternatively if you have big data but a class imbalance problem, you might need to resort to overlaps in the training sets to get multiple test sets with enough examples of your minority class. From what I understand MLUtils.kFold is standard randomised k-fold cross validation *but without shuffling* (from a cursory look at the code, It looks like ordering will always be maintained... which should probably be documented if it is the case because it can lead to bad things... and adds another argument for #6665). Either way, since elements of its splits are non-consecutive, it's not applicable for time series. Do you know how the performance of filterByRange would compare? It should be pretty performant if and only if the data is RangePartitioned right? Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2*
[jira] [Created] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
Jin Adachi created SPARK-6694: - Summary: SparkSQL CLI must be able to specify an option --database on the command line. Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi SparkSQL CLI has an option --database as follows. But, an option --database doesn't work properly. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated
Cheolsoo Park created SPARK-6692: Summary: Make it possible to kill AM in YARN cluster mode when the client is terminated Key: SPARK-6692 URL: https://issues.apache.org/jira/browse/SPARK-6692 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor I understand that the yarn-cluster mode is designed for fire-and-forget model; therefore, terminating the yarn client doesn't kill AM. However, it is very common that users submit Spark jobs via job scheduler (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected that killing the yarn client will terminate AM. It is true that the yarn-client mode can be used in such cases. But then, the yarn client sometimes needs lots of heap memory for big jobs if it runs in the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because AM can be given arbitrary heap memory unlike the yarn client. So it would be very useful to make it possible to kill AM even in the yarn-cluster mode. In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon as they're accepted (but not yet running). Although they're eventually shutdown after AM timeout, it would be nice if AM could immediately get killed in such cases too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5523) TaskMetrics and TaskInfo have innumerable copies of the hostname string
[ https://issues.apache.org/jira/browse/SPARK-5523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5523: --- Assignee: Apache Spark TaskMetrics and TaskInfo have innumerable copies of the hostname string --- Key: SPARK-5523 URL: https://issues.apache.org/jira/browse/SPARK-5523 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Reporter: Tathagata Das Assignee: Apache Spark TaskMetrics and TaskInfo objects have the hostname associated with the task. As these are created (directly or through deserialization of RPC messages), each of them have a separate String object for the hostname even though most of them have the same string data in them. This results in thousands of string objects, increasing memory requirement of the driver. This can be easily deduped when deserializing a TaskMetrics object, or when creating a TaskInfo object. This affects streaming particularly bad due to the rate of job/stage/task generation. For solution, see how this dedup is done for StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L226 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394211#comment-14394211 ] Sean Owen commented on SPARK-6694: -- What problem do you encounter? You only showed the help message. SparkSQL CLI must be able to specify an option --database on the command line. -- Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi SparkSQL CLI has an option --database as follows. But, an option --database doesn't work properly. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6560) PairRDDFunctions suppresses exceptions in writeFile
[ https://issues.apache.org/jira/browse/SPARK-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6560. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5223 [https://github.com/apache/spark/pull/5223] PairRDDFunctions suppresses exceptions in writeFile --- Key: SPARK-6560 URL: https://issues.apache.org/jira/browse/SPARK-6560 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Stephen Haberman Priority: Minor Fix For: 1.4.0 In PairRDDFunctions, saveAsHadoopDataset uses a try/finally to manage SparkHadoopWriter. Briefly: {code} try { ... writer.write(...) } finally { writer.close() } {code} However, if an exception happens in writer.write, and then writer.close is called, and an exception in writer.close happens, the original (real) exception from writer.write is suppressed. This makes debugging very painful, as the exception that is shown in the logs (from writer.close) is spurious, and the original, real exception has been lost and not logged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path
[ https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394240#comment-14394240 ] Masayoshi TSUZUKI commented on SPARK-6568: -- {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar at java.net.URI$Parser.fail(URI.java:2829) at java.net.URI$Parser.checkChars(URI.java:3002) at java.net.URI$Parser.parseHierarchical(URI.java:3086) at java.net.URI$Parser.parse(URI.java:3034) at java.net.URI.init(URI.java:595) at org.apache.spark.util.Utils$.resolveURI(Utils.scala:1721) at org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1745) at org.apache.spark.util.Utils$$anonfun$resolveURIs$1.apply(Utils.scala:1745) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.util.Utils$.resolveURIs(Utils.scala:1745) at org.apache.spark.deploy.SparkSubmitArguments.handle(SparkSubmitArguments.scala:367) at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:155) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:92) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:105) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 15/04/02 14:23:46 DEBUG Utils: Shutdown hook called {code} spark-shell.cmd --jars option does not accept the jar that has space in its path Key: SPARK-6568 URL: https://issues.apache.org/jira/browse/SPARK-6568 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI spark-shell.cmd --jars option does not accept the jar that has space in its path. The path of jar sometimes containes space in Windows. {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jin Adachi updated SPARK-6694: -- Description: SparkSQL CLI has an option --database as follows. But, the option --database is ignored. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} was: SparkSQL CLI has an option --database as follows. But, an option --database doesn't work properly. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} SparkSQL CLI must be able to specify an option --database on the command line. -- Key: SPARK-6694 URL: https://issues.apache.org/jira/browse/SPARK-6694 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.4.0 Reporter: Jin Adachi SparkSQL CLI has an option --database as follows. But, the option --database is ignored. {code:} $ spark-sql --help : CLI options: : --database databasename Specify the database to use ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings
[ https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394262#comment-14394262 ] Sean Owen commented on SPARK-6569: -- [~c...@koeninger.org] what do you think about just removing this log or making it debug level? Kafka directInputStream logs what appear to be incorrect warnings - Key: SPARK-6569 URL: https://issues.apache.org/jira/browse/SPARK-6569 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Environment: Spark 1.3.0 Reporter: Platon Potapov Priority: Minor During what appears to be normal operation of streaming from a Kafka topic, the following log records are observed, logged periodically: {code} [Stage 391:== (3 + 0) / 4] 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping raw 0 {code} * the part.fromOffset placeholder is not correctly substituted to a value * is the condition really mandates a warning being logged? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: (was: stages.png) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: (was: taskDetails.png) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: (was: stage-timeline.png) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: (was: executors.png) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: (was: tasks.png) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Comment: was deleted (was: Sorry for pending this ticket for a long time. I've re considered how and what to visualize. One of my ideas, Timeline based visualization for each task at a stage, is taking a shape and almost be implemented. !stage-timeline.png! This feature is integrated into existing WebUI, zoomable and scrollable. Now the code of this feature is a little bit messy but I'll cleanup and show the code soon.) WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: tasks.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3468) WebUI Timeline-View feature
[ https://issues.apache.org/jira/browse/SPARK-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3468: -- Attachment: TaskAssignmentTimelineView.png JobTimelineView.png ApplicationTimeliView.png I've attached the screen shots of this feature. WebUI Timeline-View feature --- Key: SPARK-3468 URL: https://issues.apache.org/jira/browse/SPARK-3468 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 1.4.0 Reporter: Kousuke Saruta Attachments: ApplicationTimeliView.png, JobTimelineView.png, TaskAssignmentTimelineView.png I sometimes trouble-shoot and analyse the cause of long time spending job. At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task. Another case, I find executors which spends long time for running a task and analyse the details of a task. In such situation, I think it's helpful to visualize timeline view of stages / tasks / executors and visualize details of proportion of activity for each task. Now I'm developing prototypes like captures I attached. I'll integrate these viewer into WebUI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6489) Optimize lateral view with explode to not read unnecessary columns
[ https://issues.apache.org/jira/browse/SPARK-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6489: --- Assignee: (was: Apache Spark) Optimize lateral view with explode to not read unnecessary columns -- Key: SPARK-6489 URL: https://issues.apache.org/jira/browse/SPARK-6489 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Konstantin Shaposhnikov Labels: starter Currently a query with lateral view explode(...) results in an execution plan that reads all columns of the underlying RDD. E.g. given *ppl* table is DF created from Person case class: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) {code} the following SQL: {code} select name, sum(d) from ppl lateral view explode(data) d as d group by name {code} executes as follows: {noformat} == Physical Plan == Aggregate false, [name#0], [name#0,SUM(PartialSum#38L) AS _c1#18L] Exchange (HashPartitioning [name#0], 200) Aggregate true, [name#0], [name#0,SUM(CAST(d#21, LongType)) AS PartialSum#38L] Project [name#0,d#21] Generate explode(data#2), true, false InMemoryColumnarTableScan [name#0,age#1,data#2], [], (InMemoryRelation [name#0,age#1,data#2], true, 1, StorageLevel(true, true, false, true, 1), (PhysicalRDD [name#0,age#1,data#2], MapPartitionsRDD[1] at mapPartitions at ExistingRDD.scala:35), Some(ppl)) {noformat} Note that *age* column is not needed to produce the output but it is still read from the underlying RDD. A sample program to demonstrate the issue: {code} case class Person(val name: String, val age: Int, val data: Array[Int]) object ExplodeDemo extends App { val ppl = Array( Person(A, 20, Array(10, 12, 19)), Person(B, 25, Array(7, 8, 4)), Person(C, 19, Array(12, 4, 232))) val conf = new SparkConf().setMaster(local[2]).setAppName(sql) val sc = new SparkContext(conf) val sqlCtx = new HiveContext(sc) import sqlCtx.implicits._ val df = sc.makeRDD(ppl).toDF df.registerTempTable(ppl) sqlCtx.cacheTable(ppl) // cache table otherwise ExistingRDD will be used that do not support column pruning val s = sqlCtx.sql(select name, sum(d) from ppl lateral view explode(data) d as d group by name) s.explain(true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6695) Add an external iterator: a hadoop-like output collector
[ https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-6695: Description: In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: {code: borderStyle=solid} rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } {code} I have done some related works, and I need your opinions, thanks! was: In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: ``` rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } ``` I have done some related works, and I need your opinions, thanks! Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: {code: borderStyle=solid} rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } {code} I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path
[ https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6568: --- Assignee: (was: Apache Spark) spark-shell.cmd --jars option does not accept the jar that has space in its path Key: SPARK-6568 URL: https://issues.apache.org/jira/browse/SPARK-6568 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI spark-shell.cmd --jars option does not accept the jar that has space in its path. The path of jar sometimes containes space in Windows. {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6568) spark-shell.cmd --jars option does not accept the jar that has space in its path
[ https://issues.apache.org/jira/browse/SPARK-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6568: --- Assignee: Apache Spark spark-shell.cmd --jars option does not accept the jar that has space in its path Key: SPARK-6568 URL: https://issues.apache.org/jira/browse/SPARK-6568 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI Assignee: Apache Spark spark-shell.cmd --jars option does not accept the jar that has space in its path. The path of jar sometimes containes space in Windows. {code} bin\spark-shell.cmd --jars C:\Program Files\some\jar1.jar {code} this gets {code} Exception in thread main java.net.URISyntaxException: Illegal character in path at index 10: C:/Program Files/some/jar1.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead
[ https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394247#comment-14394247 ] Apache Spark commented on SPARK-6239: - User 'kretes' has created a pull request for this issue: https://github.com/apache/spark/pull/5246 Spark MLlib fpm#FPGrowth minSupport should use long instead --- Key: SPARK-6239 URL: https://issues.apache.org/jira/browse/SPARK-6239 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Littlestar Priority: Minor Spark MLlib fpm#FPGrowth minSupport should use long instead == val minCount = math.ceil(minSupport * count).toLong because: 1. [count]numbers of datasets is not kown before read. 2. [minSupport ]double precision. from mahout#FPGrowthDriver.java addOption(minSupport, s, (Optional) The minimum number of times a co-occurrence must be present. + Default Value: 3, 3); I just want to set minCount=2 for test. Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6687) In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty
[ https://issues.apache.org/jira/browse/SPARK-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394265#comment-14394265 ] Sean Owen commented on SPARK-6687: -- Does this cause any problem? I expect a lot of things to be different. This is also a very old version of Hadoop. In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty Key: SPARK-6687 URL: https://issues.apache.org/jira/browse/SPARK-6687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sai Nishanth Parepally excerpt from mvn -Dverbose dependency:tree of spark-core, note the org.jboss.netty:netty dependency: [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:0.23.10:compile [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:0.23.10:compile [INFO] | | | | +- (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- org.apache.hadoop:hadoop-yarn-server-common:jar:0.23.10:compile [INFO] | | | | | +- (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | | +- (org.apache.zookeeper:zookeeper:jar:3.4.5:compile - version managed from 3.4.2; omitted for duplicate) [INFO] | | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | | +- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | | +- (commons-io:commons-io:jar:2.1:compile - omitted for duplicate) [INFO] | | | | | +- (com.google.inject:guice:jar:3.0:compile - omitted for duplicate) [INFO] | | | | | +- (com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.8:compile - omitted for duplicate) [INFO] | | | | | +- (com.sun.jersey:jersey-server:jar:1.8:compile - omitted for duplicate) [INFO] | | | | | \- (com.sun.jersey.contribs:jersey-guice:jar:1.8:compile - omitted for duplicate) [INFO] | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-hdfs:jar:1.23.10:compile - omitted for duplicate) [INFO] | | | | \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:0.23.10:compile [INFO] | | | | +- (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | | +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | | \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - omitted for duplicate) [INFO] | | | +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate) [INFO] | | | +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version managed from 1.6.1; omitted for duplicate) [INFO] | | | +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - omitted for duplicate) [INFO] | | | \- org.jboss.netty:netty:jar:3.2.4.Final:compile -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0
[ https://issues.apache.org/jira/browse/SPARK-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394290#comment-14394290 ] Sean Owen commented on SPARK-6681: -- That literal doesn't occur in Spark. That looks like how YARN writes placeholders to be expanded locally (see {{ApplicationConstants}}). My guess is that you don't have {{JAVA_HOME}} exposed to the local YARN workers, or, somehow you have some YARN version mismatch, maybe caused by bundling YARN with your app. YARN stuff changed in general and might have uncovered a problem; at this point I doubt it's a Spark issue as otherwise YARN wouldn't really work at all. JAVA_HOME error with upgrade to Spark 1.3.0 --- Key: SPARK-6681 URL: https://issues.apache.org/jira/browse/SPARK-6681 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Environment: Client is Mac OS X version 10.10.2, cluster is running HDP 2.1 stack. Reporter: Ken Williams I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my `build.sbt` like so: {code} -libraryDependencies += org.apache.spark %% spark-core % 1.2.1 % provided +libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided {code} then make an `assembly` jar, and submit it: {code} HADOOP_CONF_DIR=/etc/hadoop/conf \ spark-submit \ --driver-class-path=/etc/hbase/conf \ --conf spark.hadoop.validateOutputSpecs=false \ --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --deploy-mode=cluster \ --master=yarn \ --class=TestObject \ --num-executors=54 \ target/scala-2.11/myapp-assembly-1.2.jar {code} The job fails to submit, with the following exception in the terminal: {code} 15/03/19 10:30:07 INFO yarn.Client: 15/03/19 10:20:03 INFO yarn.Client: client token: N/A diagnostics: Application application_1420225286501_4698 failed 2 times due to AM Container for appattempt_1420225286501_4698_02 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} Finally, I go and check the YARN app master’s web interface (since the job is there, I know it at least made it that far), and the only logs it shows are these: {code} Log Type: stderr Log Length: 61 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory Log Type: stdout Log Length: 0 {code} I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment troubleshoot? I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of the cluster: {code} % grep JAVA_HOME /etc/hadoop/conf/*.sh /etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 /etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 {code} Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no other changes, the job completes fine. (Note: I originally posted this on the Spark mailing list and also on Stack Overflow, I'll update both places if/when I find a solution.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-6691: --- Issue Type: Improvement (was: New Feature) Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated
[ https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394135#comment-14394135 ] Apache Spark commented on SPARK-6692: - User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/5343 Make it possible to kill AM in YARN cluster mode when the client is terminated -- Key: SPARK-6692 URL: https://issues.apache.org/jira/browse/SPARK-6692 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor Labels: yarn I understand that the yarn-cluster mode is designed for fire-and-forget model; therefore, terminating the yarn client doesn't kill AM. However, it is very common that users submit Spark jobs via job scheduler (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected that killing the yarn client will terminate AM. It is true that the yarn-client mode can be used in such cases. But then, the yarn client sometimes needs lots of heap memory for big jobs if it runs in the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because AM can be given arbitrary heap memory unlike the yarn client. So it would be very useful to make it possible to kill AM even in the yarn-cluster mode. In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon as they're accepted (but not yet running). Although they're eventually shutdown after AM timeout, it would be nice if AM could immediately get killed in such cases too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6692) Make it possible to kill AM in YARN cluster mode when the client is terminated
[ https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6692: --- Assignee: (was: Apache Spark) Make it possible to kill AM in YARN cluster mode when the client is terminated -- Key: SPARK-6692 URL: https://issues.apache.org/jira/browse/SPARK-6692 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Priority: Minor Labels: yarn I understand that the yarn-cluster mode is designed for fire-and-forget model; therefore, terminating the yarn client doesn't kill AM. However, it is very common that users submit Spark jobs via job scheduler (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is expected that killing the yarn client will terminate AM. It is true that the yarn-client mode can be used in such cases. But then, the yarn client sometimes needs lots of heap memory for big jobs if it runs in the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs because AM can be given arbitrary heap memory unlike the yarn client. So it would be very useful to make it possible to kill AM even in the yarn-cluster mode. In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon as they're accepted (but not yet running). Although they're eventually shutdown after AM timeout, it would be nice if AM could immediately get killed in such cases too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2489) Unsupported parquet datatype optional fixed_len_byte_array
[ https://issues.apache.org/jira/browse/SPARK-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394326#comment-14394326 ] Ishaaq Chandy commented on SPARK-2489: -- I see [~joesu]'s pull request got closed without being merged in. Does this mean that there is currently no solution/workaround to this issue? Unsupported parquet datatype optional fixed_len_byte_array -- Key: SPARK-2489 URL: https://issues.apache.org/jira/browse/SPARK-2489 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee tested against commit 9fe693b5 {noformat} scala sqlContext.parquetFile(/tmp/foo) java.lang.RuntimeException: Unsupported parquet datatype optional fixed_len_byte_array(4) b at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:58) at org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:109) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:282) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:279) {noformat} example avro schema {noformat} protocol Test { fixed Bytes4(4); record Foo { union {null, Bytes4} b; } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6689) MiniYarnCLuster still test failed with hadoop-2.2
[ https://issues.apache.org/jira/browse/SPARK-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6689: - Priority: Minor (was: Major) I imagine this is a problem because you are building with SBT, and it can't fully parse the Maven build. The fix depends on some Maven profiles which may not fully affect the SBT build in the same way. I'm not 100% sure, but I know there is some difference. The build also fails for me with your build command, but succeeds with Maven. Since Maven is the build of reference I am not sure if this is such a big deal except to developers who have to work specifically with Hadoop 2.2 and want to use SBT. It'd be great if you can figure out a fix but it's not affecting the main build. MiniYarnCLuster still test failed with hadoop-2.2 - Key: SPARK-6689 URL: https://issues.apache.org/jira/browse/SPARK-6689 Project: Spark Issue Type: Test Components: Tests, YARN Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor when running unit test *YarnClusterSuite* with *hadoop-2.2*, exception will throw because *Timed out waiting for RM to come up*. Some previously related discussion can be traced in [spark-3710|https://issues.apache.org/jira/browse/SPARK-3710] ([PR2682|https://github.com/apache/spark/pull/2682]) and [spark-2778|https://issues.apache.org/jira/browse/SPARK-2778] ([PR2605|https://github.com/apache/spark/pull/2605]). With command *build/sbt -Pyarn -Phadoop-2.2 test-only org.apache.spark.deploy.yarn.YarnClusterSuite*, will get following exceptions: {noformat} [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (15 seconds, 799 milliseconds) [info] java.lang.IllegalStateException: Timed out waiting for RM to come up. [info] at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:114) [info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) [info] at org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:44) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) [info] at org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:44) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:294) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:284) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) {noformat} And without *-Phadoop-2.2* or replace it with *-Dhadoop.version* (e.g. build/sbt -Pyarn test-only org.apache.spark.deploy.yarn.YarnClusterSuite) more info will come out: {noformat} Exception in thread Thread-7 java.lang.NoClassDefFoundError: org/mortbay/jetty/servlet/Context at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:602) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:655) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper$2.run(MiniYARNCluster.java:219) Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.servlet.Context at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) [info] Resolving org.apache.hadoop#hadoop-yarn-server-common;2.2.0 ... Exception in thread Thread-18 java.lang.NoClassDefFoundError: org/mortbay/jetty/servlet/Context at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309) at org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer.serviceStart(WebServer.java:62) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at
[jira] [Created] (SPARK-6697) PeriodicGraphCheckpointer is not clear Edges.
Guoqiang Li created SPARK-6697: -- Summary: PeriodicGraphCheckpointer is not clear Edges. Key: SPARK-6697 URL: https://issues.apache.org/jira/browse/SPARK-6697 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.3.0 Reporter: Guoqiang Li When I run this [branch(lrGraphxSGD)| https://github.com/witgo/spark/tree/lrGraphxSGD] . PeriodicGraphCheckpointer only clear the vertices. {code} def run(iterations: Int): Unit = { for (iter - 1 to iterations) { logInfo(sStart train (Iteration $iter/$iterations)) val margin = forward() margin.setName(smargin-$iter).persist(storageLevel) println(strain (Iteration $iter/$iterations) cost : ${error(margin)}) var gradient = backward(margin) gradient = updateDeltaSum(gradient, iter) dataSet = updateWeight(gradient, iter) dataSet.vertices.setName(svertices-$iter) dataSet.edges.setName(sedges-$iter) dataSet.persist(storageLevel) graphCheckpointer.updateGraph(dataSet) margin.unpersist(blocking = false) gradient.unpersist(blocking = false) logInfo(sEnd train (Iteration $iter/$iterations)) innerIter += 1 } graphCheckpointer.deleteAllCheckpoints() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6664) Split Ordered RDD into multiple RDDs by keys (boundaries or intervals)
[ https://issues.apache.org/jira/browse/SPARK-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394231#comment-14394231 ] Sean Owen commented on SPARK-6664: -- Yes _k_ estimates is better than 1; this is both more expensive and less important when the data size is large. But, yes it has value and I can see the argument that it's more important if the response is actually time-dependent. I wasn't suggesting that {{MLUtils.kFold}} implements this, but that it was a related piece of code. If ordering matters and the input has an ordering that biases the result, then yes you would randomly permute the partition or RDD. This isn't true for every algorithm but for some. Same thing here really, I think you can order, bucket by range, and union in the straightforward way and it will be as performant as anything I can think of. You have to write some code, but it's flexible. The question is how much it's worth adding another method versus how often this is used. I can see this being useful for time series. I suppose that if it turns out there's a much fast-er way to do this but it's complex, and it is used, then it does need to be wrapped up in a utility method. Split Ordered RDD into multiple RDDs by keys (boundaries or intervals) -- Key: SPARK-6664 URL: https://issues.apache.org/jira/browse/SPARK-6664 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Florian Verhein I can't find this functionality (if I missed something, apologies!), but it would be very useful for evaluating ml models. *Use case example* suppose you have pre-processed web logs for a few months, and now want to split it into a training set (where you train a model to predict some aspect of site accesses, perhaps per user) and an out of time test set (where you evaluate how well your model performs in the future). This example has just a single split, but in general you could want more for cross validation. You may also want to have multiple overlaping intervals. *Specification* 1. Given an Ordered RDD and an ordered sequence of n boundaries (i.e. keys), return n+1 RDDs such that values in the ith RDD are within the (i-1)th and ith boundary. 2. More complex alternative (but similar under the hood): provide a sequence of possibly overlapping intervals (ordered by the start key of the interval), and return the RDDs containing values within those intervals. *Implementation ideas / notes for 1* - The ordered RDDs are likely RangePartitioned (or there should be a simple way to find ranges from partitions in an ordered RDD) - Find the partitions containing the boundary, and split them in two. - Construct the new RDDs from the original partitions (and any split ones) I suspect this could be done by launching only a few jobs to split the partitions containing the boundaries. Alternatively, it might be possible to decorate these partitions and use them in more than one RDD. I.e. let one of these partitions (for boundary i) be p. Apply two decorators p' and p'', where p' is masks out values above the ith boundary, and p'' masks out values below the ith boundary. Any operations on these partitions apply only to values not masked out. Then assign p' to the ith output RDD and p'' to the (i+1)th output RDD. If I understand Spark correctly, this should not require any jobs. Not sure whether it's worth trying this optimisation. *Implementation ideas / notes for 2* This is very similar, except that we have to handle entire (or parts) of partitions belonging to more than one output RDD, since they are no longer mutually exclusive. But since RDDs are immutable(??), the decorator idea should still work? Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6695) Add an external iterator: a hadoop-like output collector
uncleGen created SPARK-6695: --- Summary: Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: ``` rdd.mapPartition { it = ... val collector = new ExteranalCollector() collector.collect(a) ... collector.iterator } ``` I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6695) Add an external iterator: a hadoop-like output collector
[ https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-6695: Description: In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: ``` rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } ``` I have done some related works, and I need your opinions, thanks! was: In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: ``` rdd.mapPartition { it = ... val collector = new ExteranalCollector() collector.collect(a) ... collector.iterator } ``` I have done some related works, and I need your opinions, thanks! Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: ``` rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } ``` I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector
[ https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394236#comment-14394236 ] Sean Owen commented on SPARK-6695: -- I am not sure what the use case is here. You already have an iterator; why does spilling it to disk then re-iterating over it help? Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: {code: borderStyle=solid} rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } {code} I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6694) SparkSQL CLI must be able to specify an option --database on the command line.
[ https://issues.apache.org/jira/browse/SPARK-6694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394251#comment-14394251 ] Jin Adachi commented on SPARK-6694: --- SparkSQL CLI doesn't work option --database, and that forced database to default. For example, It is said that I have a database 'test_db' and a table 't_user'. I caught the error as follows. {code:} $ spark-sql --database test_db -e 'select * from t_user order by id' : 15/04/03 13:26:30 ERROR metadata.Hive: NoSuchObjectException(message:default.t_user table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy9.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:180) at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:252) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161) at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:252) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:175) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:187) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:182) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:236) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at
[jira] [Commented] (SPARK-6665) Randomly Shuffle an RDD
[ https://issues.apache.org/jira/browse/SPARK-6665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394291#comment-14394291 ] Florian Verhein commented on SPARK-6665: Fair enough. I'll have to implement it because I need it so may as well report back when I've had the chance to (perhaps there's a better place for it - e.g. not in the core API). Randomly Shuffle an RDD Key: SPARK-6665 URL: https://issues.apache.org/jira/browse/SPARK-6665 Project: Spark Issue Type: New Feature Components: Spark Shell Reporter: Florian Verhein Priority: Minor *Use case* RDD created in a way that has some ordering, but you need to shuffle it because the ordering would cause problems downstream. E.g. - will be used to train a ML algorithm that makes stochastic assumptions (like SGD) - used as input for cross validation. e.g. after the shuffle, you could just grab partitions (or part files if saved to hdfs) as folds Related question in mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/random-shuffle-streaming-RDDs-td17965.html *Possible implementation* As mentioned by [~sowen] in the above thread, could sort by( a good hash of( the element (or key if it's paired) and a random salt)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6638) optimize StringType in SQL
[ https://issues.apache.org/jira/browse/SPARK-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394614#comment-14394614 ] Apache Spark commented on SPARK-6638: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5350 optimize StringType in SQL -- Key: SPARK-6638 URL: https://issues.apache.org/jira/browse/SPARK-6638 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu java.lang.String is encoded in UTF-16, it's not efficient for IO. We could change to use Array[Byte] of UTF-8 internally for better performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394839#comment-14394839 ] Apache Spark commented on SPARK-6330: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5353 newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Fix For: 1.3.1, 1.4.0 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6697) PeriodicGraphCheckpointer is not clear edges.
[ https://issues.apache.org/jira/browse/SPARK-6697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394785#comment-14394785 ] Joseph K. Bradley commented on SPARK-6697: -- Thanks for pointing this out. I don't recall this happening in LDA, but I think that's because LDA's edges do not change. We may need to add an option for this to PeriodicGraphCheckpointer in order to make it generally usable beyond LDA. PeriodicGraphCheckpointer is not clear edges. - Key: SPARK-6697 URL: https://issues.apache.org/jira/browse/SPARK-6697 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.3.0 Reporter: Guoqiang Li Attachments: QQ20150403-1.png When I run this [branch(lrGraphxSGD)| https://github.com/witgo/spark/tree/lrGraphxSGD] . PeriodicGraphCheckpointer only clear the vertices. {code} def run(iterations: Int): Unit = { for (iter - 1 to iterations) { logInfo(sStart train (Iteration $iter/$iterations)) val margin = forward() margin.setName(smargin-$iter).persist(storageLevel) println(strain (Iteration $iter/$iterations) cost : ${error(margin)}) var gradient = backward(margin) gradient = updateDeltaSum(gradient, iter) dataSet = updateWeight(gradient, iter) dataSet.vertices.setName(svertices-$iter) dataSet.edges.setName(sedges-$iter) dataSet.persist(storageLevel) graphCheckpointer.updateGraph(dataSet) margin.unpersist(blocking = false) gradient.unpersist(blocking = false) logInfo(sEnd train (Iteration $iter/$iterations)) innerIter += 1 } graphCheckpointer.deleteAllCheckpoints() } // Updater for L1 regularized problems private def updateWeight(delta: VertexRDD[Double], iter: Int): Graph[VD, ED] = { val thisIterStepSize = if (useAdaGrad) stepSize else stepSize / sqrt(iter) val thisIterL1StepSize = stepSize / sqrt(iter) val newVertices = dataSet.vertices.leftJoin(delta) { (_, attr, gradient) = gradient match { case Some(gard) = { var weight = attr weight -= thisIterStepSize * gard if (regParam 0.0 weight != 0.0) { val shrinkageVal = regParam * thisIterL1StepSize weight = signum(weight) * max(0.0, abs(weight) - shrinkageVal) } assert(!weight.isNaN) weight } case None = attr } } GraphImpl(newVertices, dataSet.edges) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6615) Add missing methods to Word2Vec's Python API
[ https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6615: - Assignee: Kai Sasaki Add missing methods to Word2Vec's Python API Key: SPARK-6615 URL: https://issues.apache.org/jira/browse/SPARK-6615 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor Labels: MLLib,, Python Fix For: 1.4.0 This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6615) Python API for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6615. -- Resolution: Fixed Issue resolved by pull request 5296 [https://github.com/apache/spark/pull/5296] Python API for Word2Vec --- Key: SPARK-6615 URL: https://issues.apache.org/jira/browse/SPARK-6615 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Labels: MLLib,, Python Fix For: 1.4.0 This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6615) Add missing methods to Word2Vec's Python API
[ https://issues.apache.org/jira/browse/SPARK-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6615: - Summary: Add missing methods to Word2Vec's Python API (was: Python API for Word2Vec) Add missing methods to Word2Vec's Python API Key: SPARK-6615 URL: https://issues.apache.org/jira/browse/SPARK-6615 Project: Spark Issue Type: Task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Kai Sasaki Priority: Minor Labels: MLLib,, Python Fix For: 1.4.0 This is the sub-task of [SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254]. Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Bieniosek updated SPARK-6698: - Attachment: SPARK-6698.patch Attaching proposed patch to copy StorageLevel from input RDD RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Attachments: SPARK-6698.patch In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Bieniosek updated SPARK-6698: - Attachment: (was: SPARK-6698.patch) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Priority: Minor In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6682: - Description: In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] was: In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6330. - Resolution: Fixed newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Blocker Fix For: 1.3.1, 1.4.0 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394864#comment-14394864 ] Yin Huai commented on SPARK-6330: - Please ignore my comment. newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Blocker Fix For: 1.3.1, 1.4.0 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394776#comment-14394776 ] Joseph K. Bradley commented on SPARK-6682: -- Note: We could keep 2 APIs for Scala/Java, but this is not a great solution for 2 reasons: * 2 APIs means more code to maintain, and they are confusing to users figuring out which API to use whether the APIs are the same. * The static train() methods are not workable for some algorithms with 10 parameters (because of Scala style constraints). Also, once we add SparkR, we will not be able to keep uniform APIs everywhere since R has such different syntax. We can make a best effort, but I feel we should tailor it to the particular language when it makes sense. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394843#comment-14394843 ] Yu Ishikawa commented on SPARK-6682: Hi [~josephkb], Thank you for your proposal. That sounds good. But I think how to call the python train () method should be the same way as Scala/Java builder method for users. It would be nice if there is any mechanism to keep a builder method consistent between Scala/Java and Python automatically. However, if that is very difficult or impossible, I totally agree with your proposal. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reopened SPARK-6330: - I am reopening the issue since for s3n, {{fs.makeQualified(qualifiedPath)}} does not. It will throw a very confusing error message. {code} java.lang.IllegalArgumentException: Wrong FS: s3n://ID:KEY@bucket/path, expected: s3n://ID:KEY@bucket. {code} When I put a relative path, it is fine. Also, if I use qualifiedPath.makeQualified(fs). It is fine. newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Fix For: 1.3.1, 1.4.0 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6492. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5277 [https://github.com/apache/spark/pull/5277] SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies --- Key: SPARK-6492 URL: https://issues.apache.org/jira/browse/SPARK-6492 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.4.0 Reporter: Josh Rosen Priority: Critical Fix For: 1.4.0 A deadlock can occur when DAGScheduler death causes a SparkContext to be shut down while user code is concurrently racing to stop the SparkContext in a finally block. For example: {code} try { sc = new SparkContext(local, test) // start running a job that causes the DAGSchedulerEventProcessor to crash someRDD.doStuff() } } finally { sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes the above job to fail with an exception } {code} This leads to a deadlock. The event processor thread tries to lock on the {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because the thread that holds that lock is waiting for the event processor thread to join: {code} dag-scheduler-event-loop daemon prio=5 tid=0x7ffa69456000 nid=0x9403 waiting for monitor entry [0x0001223ad000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) - waiting to lock 0x0007f5037b08 (a java.lang.Object) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) {code} {code} pool-1-thread-1-ScalaTest-running-SparkContextSuite prio=5 tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007f4b28000 (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1281) - locked 0x0007f4b28000 (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) - locked 0x0007f5037b08 (a java.lang.Object) [...] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
[ https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6492: - Assignee: Ilya Ganelin SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies --- Key: SPARK-6492 URL: https://issues.apache.org/jira/browse/SPARK-6492 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0, 1.4.0 Reporter: Josh Rosen Assignee: Ilya Ganelin Priority: Critical Fix For: 1.4.0 A deadlock can occur when DAGScheduler death causes a SparkContext to be shut down while user code is concurrently racing to stop the SparkContext in a finally block. For example: {code} try { sc = new SparkContext(local, test) // start running a job that causes the DAGSchedulerEventProcessor to crash someRDD.doStuff() } } finally { sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes the above job to fail with an exception } {code} This leads to a deadlock. The event processor thread tries to lock on the {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because the thread that holds that lock is waiting for the event processor thread to join: {code} dag-scheduler-event-loop daemon prio=5 tid=0x7ffa69456000 nid=0x9403 waiting for monitor entry [0x0001223ad000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.spark.SparkContext.stop(SparkContext.scala:1398) - waiting to lock 0x0007f5037b08 (a java.lang.Object) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52) {code} {code} pool-1-thread-1-ScalaTest-running-SparkContextSuite prio=5 tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0x0007f4b28000 (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1281) - locked 0x0007f4b28000 (a org.apache.spark.util.EventLoop$$anon$1) at java.lang.Thread.join(Thread.java:1355) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352) at org.apache.spark.SparkContext.stop(SparkContext.scala:1405) - locked 0x0007f5037b08 (a java.lang.Object) [...] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4205) Timestamp and Date objects with comparison operators
[ https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4205: --- Assignee: Apache Spark Timestamp and Date objects with comparison operators Key: SPARK-4205 URL: https://issues.apache.org/jira/browse/SPARK-4205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Marc Culler Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4205) Timestamp and Date objects with comparison operators
[ https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4205: --- Assignee: (was: Apache Spark) Timestamp and Date objects with comparison operators Key: SPARK-4205 URL: https://issues.apache.org/jira/browse/SPARK-4205 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Marc Culler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
Michael Bieniosek created SPARK-6698: Summary: RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5203. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4004 [https://github.com/apache/spark/pull/4004] union with different decimal type report error -- Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei Fix For: 1.4.0 Test case like this: {code:sql} create table test (a decimal(10,1)); select a from test union all select a*2 from test; {code} Exception thown: {noformat} 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6330) newParquetRelation gets incorrect FileSystem
[ https://issues.apache.org/jira/browse/SPARK-6330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6330: Priority: Blocker (was: Major) newParquetRelation gets incorrect FileSystem Key: SPARK-6330 URL: https://issues.apache.org/jira/browse/SPARK-6330 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Volodymyr Lyubinets Assignee: Volodymyr Lyubinets Priority: Blocker Fix For: 1.3.1, 1.4.0 Here's a snippet from newParquet.scala: def refresh(): Unit = { val fs = FileSystem.get(sparkContext.hadoopConfiguration) // Support either reading a collection of raw Parquet part-files, or a collection of folders // containing Parquet files (e.g. partitioned Parquet table). val baseStatuses = paths.distinct.map { p = val qualified = fs.makeQualified(new Path(p)) if (!fs.exists(qualified) maybeSchema.isDefined) { fs.mkdirs(qualified) prepareMetadata(qualified, maybeSchema.get, sparkContext.hadoopConfiguration) } fs.getFileStatus(qualified) }.toArray If we are running this locally and path points to S3, fs would be incorrect. A fix is to construct fs for each file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters
[ https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394670#comment-14394670 ] Yash Datta commented on SPARK-4258: --- [~yhuai] No it does not. I fixed this in parquet master. Waiting for parquet to release the next version. Current version is 1.6.0rc3 (being used in spark) NPE with new Parquet Filters Key: SPARK-4258 URL: https://issues.apache.org/jira/browse/SPARK-4258 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): java.lang.NullPointerException: parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206) parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$Or.accept(Operators.java:302) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201) parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47) parquet.filter2.predicate.Operators$And.accept(Operators.java:290) parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46) parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22) parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108) parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28) parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158) parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) {code} This occurs when reading parquet data encoded with the older version of the library for TPC-DS query 34. Will work on coming up with a smaller reproduction -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6640) Executor may connect to HeartbeartReceiver before it's setup in the driver side
[ https://issues.apache.org/jira/browse/SPARK-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6640. Resolution: Fixed Fix Version/s: 1.4.0 Target Version/s: 1.4.0 Executor may connect to HeartbeartReceiver before it's setup in the driver side --- Key: SPARK-6640 URL: https://issues.apache.org/jira/browse/SPARK-6640 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.4.0 Here is the current code about starting LocalBackend and creating HeartbeatReceiver: {code} // Create and start the scheduler private[spark] var (schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} When creating LocalBackend, it will start `LocalActor`. `LocalActor` will create Executor, and Executor's constructor will retrieve `HeartbeatReceiver`. So we should make sure this line: {code} private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(this, taskScheduler)), HeartbeatReceiver) {code} happen before creating LocalActor. However, current codes can not guarantee that. Sometimes, creating Executor will crash. The issue was reported by sparkdi shopaddr1...@dubna.us in http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-td22265.html#a22324 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6698: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) (Open a PR; changes aren't managed by patches here.) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Priority: Minor Attachments: SPARK-6698.patch In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6698: --- Assignee: (was: Apache Spark) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Priority: Minor Attachments: SPARK-6698.patch In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6698: --- Assignee: Apache Spark RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Assignee: Apache Spark Priority: Minor Attachments: SPARK-6698.patch In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6698) RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK
[ https://issues.apache.org/jira/browse/SPARK-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394681#comment-14394681 ] Apache Spark commented on SPARK-6698: - User 'bien' has created a pull request for this issue: https://github.com/apache/spark/pull/5351 RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK -- Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek Priority: Minor Attachments: SPARK-6698.patch In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6688) EventLoggingListener should always operate on resolved URIs
[ https://issues.apache.org/jira/browse/SPARK-6688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6688. Resolution: Fixed Fix Version/s: 1.4.0 1.3.1 Assignee: Marcelo Vanzin EventLoggingListener should always operate on resolved URIs --- Key: SPARK-6688 URL: https://issues.apache.org/jira/browse/SPARK-6688 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Minor Fix For: 1.3.1, 1.4.0 A small bug was introduced in 1.3.0, where a check in EventLoggingListener.scala is performed on the non-resolved log path. This means that if fs.defaultFS is not the local filesystem, and the user is trying to store logs in the local filesystem by providing a path with no file: protocol, thing will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6647) Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter
[ https://issues.apache.org/jira/browse/SPARK-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6647. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5309 [https://github.com/apache/spark/pull/5309] Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter --- Key: SPARK-6647 URL: https://issues.apache.org/jira/browse/SPARK-6647 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Fix For: 1.4.0 Now trait {{StringComparison}} is a {{BinaryExpression}}. In fact, it should be a {{BinaryPredicate}}. By making {{StringComparison}} as {{BinaryPredicate}}, we can throw error when a {{expressions.Predicate}} can't translate to a data source {{Filter}} in function {{selectFilters}}. Without this modification, because we will wrap a {{Filter}} outside the scanned results in {{pruneFilterProjectRaw}}, we can't detect about something is wrong in translating predicates to filters in {{selectFilters}}. The unit test of SPARK-6625 demonstrates such problem. In that pr, even {{expressions.Contains}} is not properly translated to {{sources.StringContains}}, the filtering is still performed by the {{Filter}} and so the test passes. Of course, by doing this modification, all {{expressions.Predicate}} classes need to have its data source {{Filter}} correspondingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6683) Handling feature scaling properly for GLMs
[ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6683: - Description: GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: ** Hide featureScaling from API. (breaking change) ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior) ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step. Details on implementation: * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here. was: GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'll argue these are necessary for the long-term. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: ** Hide featureScaling from API. (breaking change) ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior) ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step. Details on implementation: * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here. Handling feature scaling properly for GLMs -- Key: SPARK-6683 URL: https://issues.apache.org/jira/browse/SPARK-6683 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: **
[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs
[ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395122#comment-14395122 ] Joseph K. Bradley commented on SPARK-6683: -- Great, it sounds like we're in agreement about the API and algorithm behavior. W.r.t. implementation, I haven't thought through it too carefully. I would have thought squared error would be the easiest loss to handle since (I believe) it would reduce to scaling stepSize for each feature (applied to the loss gradient, not the regularization gradient). I'm not sure about the others... Handling feature scaling properly for GLMs -- Key: SPARK-6683 URL: https://issues.apache.org/jira/browse/SPARK-6683 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: ** Hide featureScaling from API. (breaking change) ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior) ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step. Details on implementation: * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6700: --- Assignee: Lianhui Wang (was: Apache Spark) flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395127#comment-14395127 ] Apache Spark commented on SPARK-6700: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5356 flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at
[jira] [Assigned] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6700: --- Assignee: Apache Spark (was: Lianhui Wang) flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Apache Spark Priority: Critical org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Updated] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-6700: -- Labels: test yarn (was: ) flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14394998#comment-14394998 ] Joseph K. Bradley commented on SPARK-6682: -- I don't know of an automatic mechanism. It might be possible to do code generation, but that's a bit hacky and might be more trouble than it is worth. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6577: --- Assignee: Apache Spark SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6577: --- Assignee: (was: Apache Spark) SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395099#comment-14395099 ] Apache Spark commented on SPARK-6577: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/5355 SparseMatrix should be supported in PySpark --- Key: SPARK-6577 URL: https://issues.apache.org/jira/browse/SPARK-6577 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Manoj Kumar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6700) flaky test: run Python application in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-6700. Resolution: Fixed flaky test: run Python application in yarn-cluster mode Key: SPARK-6700 URL: https://issues.apache.org/jira/browse/SPARK-6700 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Lianhui Wang Priority: Critical Labels: test, yarn org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode Failing for the past 1 build (Since Failed#2025 ) Took 12 sec. Error Message {code} Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 Stacktrace sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties, --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:318) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) at org.scalatest.Suite$class.run(Suite.scala:1424) at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) at org.scalatest.SuperEngine.runImpl(Engine.scala:545) at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$scalatest$BeforeAndAfterAll$$super$run(YarnClusterSuite.scala:44) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) at
[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs
[ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395088#comment-14395088 ] DB Tsai commented on SPARK-6683: I have this implemented in our lab including handling the intercept without adding bias in the training dataset which improves the performance a lot without doing extra caching. In logistic regression, since the objective function is sum of logP which is invariance under transformation, this implies that instead of rescaling x, we can get the same result by rescaling the gradient. As a result, this can be done just right before optimization. However, in linear regression, the objective value will be changed under transformation as well, so I need to handle them differently. As a result, it will be changelling to come out with one framework which works for all different type of generalized linear models. I will like to have them implemented differently in each new SparkML codebase instead of sharing the same GLM base class. What do you think? Handling feature scaling properly for GLMs -- Key: SPARK-6683 URL: https://issues.apache.org/jira/browse/SPARK-6683 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: ** Hide featureScaling from API. (breaking change) ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior) ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step. Details on implementation: * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6703) Provide a way to discover existing SparkContext's
Patrick Wendell created SPARK-6703: -- Summary: Provide a way to discover existing SparkContext's Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to have a way to write an application where you can get or create a SparkContext and have some standard type of synchronization point application authors can access. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5992: - Shepherd: Xiangrui Meng Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6701) Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application
Andrew Or created SPARK-6701: Summary: Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application Key: SPARK-6701 URL: https://issues.apache.org/jira/browse/SPARK-6701 Project: Spark Issue Type: Bug Components: Tests, YARN Affects Versions: 1.3.0 Reporter: Andrew Or Priority: Critical Observed in Master and 1.3, both in SBT and in Maven (with YARN). {code} Error Message Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties, --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) exited with code 1 sbt.ForkMain$ForkError: Process List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit, --master, yarn-cluster, --num-executors, 1, --properties-file, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/spark968020731409047027.properties, --py-files, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test2.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/test.py, /tmp/spark-ea49597c-2a95-4d8c-a9ea-23861a02c9bd/result961582960984674264.tmp) exited with code 1 at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122) at org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs
[ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395145#comment-14395145 ] Joseph K. Bradley commented on SPARK-6683: -- If you're referring to what I was saying about needing to rescale both step size and regularization for least squares, I agree. Handling feature scaling properly for GLMs -- Key: SPARK-6683 URL: https://issues.apache.org/jira/browse/SPARK-6683 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley GeneralizedLinearAlgorithm can scale features. This has 2 effects: * improves optimization behavior (essentially always improves behavior in practice) * changes the optimal solution (often for the better in terms of standardizing feature importance) Current problems: * Inefficient implementation: We make a rescaled copy of the data. * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries. (Note: Feature scaling could be handled without changing the solution.) * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option. This is a proposal discussed with [~mengxr] for an ideal solution. This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of. Proposal: * Implementation: Change to avoid making a rescaled copy of the data (described below). No API issues here. * API: ** Hide featureScaling from API. (breaking change) ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior) ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step. Details on implementation: * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above). This would require storing a vector of length numFeatures, rather than making a full copy of the data. * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows
[ https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395001#comment-14395001 ] Alexander Ulanov commented on SPARK-6673: - Probably similar issue: I am trying to execute unit tests in MLlib with LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log saying that: Cannot find any assembly build directories. If I do set SPARK_SCALA_VERSION=2.10 then I get No assemblies found in 'C:\dev\spark\mllib\.\assembly\target\scala-2.10' spark-shell.cmd can't start even when spark was built in Windows Key: SPARK-6673 URL: https://issues.apache.org/jira/browse/SPARK-6673 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.3.0 Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Blocker spark-shell.cmd can't start. {code} bin\spark-shell.cmd --master local {code} will get {code} Failed to find Spark assembly JAR. You need to build Spark before running this program. {code} even when we have built spark. This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which is used in {{spark-class2.cmd}}. In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in {{load-spark-env.sh}}, but there are no equivalent script in Windows. As workaround, by executing {code} set SPARK_SCALA_VERSION=2.10 {code} before execute spark-shell.cmd, we can successfully start it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org