GitHub user avinashkolla opened a pull request: https://github.com/apache/spark/pull/15118
Branch 2.0 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15118.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15118 ---- commit f46a074510e47206de9d3b3ac6902af321923ce8 Author: Sylvain Zimmer <sylv...@sylvainzimmer.com> Date: 2016-07-28T16:51:45Z [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMap Avoid overflow of Long type causing a NegativeArraySizeException a few lines later. Unit tests for HashedRelationSuite still pass. I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this. Thanks! Author: Sylvain Zimmer <sylv...@sylvainzimmer.com> Closes #14373 from sylvinus/master. (cherry picked from commit 1178d61ede816bf1c8d5bb3dbb3b965c9b944407) Signed-off-by: Reynold Xin <r...@databricks.com> commit fb09a693d6f58d71ec042224b8ea66b972c1adc2 Author: Sameer Agarwal <samee...@cs.berkeley.edu> Date: 2016-07-28T20:04:19Z [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on OutOfMemoryError ## What changes were proposed in this pull request? We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <samee...@cs.berkeley.edu> Closes #14387 from sameeragarwal/oom. (cherry picked from commit 3fd39b87bda77f3c3a4622d854f23d4234683571) Signed-off-by: Reynold Xin <r...@databricks.com> commit 5cd79c396f98660e12b02c0151a084b4d1599b6b Author: Nicholas Chammas <nicholas.cham...@gmail.com> Date: 2016-07-28T21:57:15Z [SPARK-16772] Correct API doc references to PySpark classes + formatting fixes ## What's Been Changed The PR corrects several broken or missing class references in the Python API docs. It also correct formatting problems. For example, you can see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction) how Sphinx is not picking up the reference to `DataType`. That's because the reference is relative to the current module, whereas `DataType` is in a different module. You can also see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) how the formatting for byte, tinyint, and so on is italic instead of monospace. That's because in ReST single backticks just make things italic, unlike in Markdown. ## Testing I tested this PR by [building the Python docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html) and reviewing the results locally in my browser. I confirmed that the broken or missing class references were resolved, and that the formatting was corrected. Author: Nicholas Chammas <nicholas.cham...@gmail.com> Closes #14393 from nchammas/python-docstring-fixes. (cherry picked from commit 274f3b9ec86e4109c7678eef60f990d41dc3899f) Signed-off-by: Reynold Xin <r...@databricks.com> commit ed03d0a690c9a7920a21c858df7f42f9a41f28d7 Author: Wesley Tang <tangming...@mininglamp.com> Date: 2016-07-29T11:26:05Z [SPARK-16664][SQL] Fix persist call on Data frames with more than 200⦠## What changes were proposed in this pull request? f12f11e578169b47e3f8b18b299948c0670ba585 introduced this bug, missed foreach as map ## How was this patch tested? Test added Author: Wesley Tang <tangming...@mininglamp.com> Closes #14324 from breakdawn/master. (cherry picked from commit d1d5069aa3744d46abd3889abab5f15e9067382a) Signed-off-by: Sean Owen <so...@cloudera.com> commit efad4aa1468867b36cffb1e8c91f9731c48eca81 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-07-29T11:40:20Z [SPARK-16750][ML] Fix GaussianMixture training failed due to feature column type mistake ## What changes were proposed in this pull request? ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake. See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug. Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <yblia...@gmail.com> Closes #14378 from yanboliang/spark-16750. (cherry picked from commit 0557a45452f6e73877e5ec972110825ce8f3fbc5) Signed-off-by: Sean Owen <so...@cloudera.com> commit 268bf144004952385e4573a11d981b3440f31f5d Author: Adam Roberts <arobe...@uk.ibm.com> Date: 2016-07-29T11:43:01Z [SPARK-16751] Upgrade derby to 10.12.1.1 Version of derby upgraded based on important security info at VersionEye. Test scope added so we don't include it in our final package anyway. NB: I think this should be backported to all previous releases as it is a security problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1 The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs Existing tests with the change making sure that we see no new failures. I checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder. I then used dev/make-distribution.sh and checked the dist/jars folder for Spark 2.0: no derby jar is present. I don't know if this would also remove it from the assembly jar in our 1.x branches. Author: Adam Roberts <arobe...@uk.ibm.com> Closes #14379 from a-roberts/patch-4. (cherry picked from commit 04a2c072d94874f3f7ae9dd94c026e8826a75ccd) Signed-off-by: Sean Owen <so...@cloudera.com> commit a32531a72cc2b6a0ff95a7a73b256ffc5d9eff60 Author: Sun Dapeng <s...@apache.org> Date: 2016-07-29T13:01:23Z [SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.md ## What changes were proposed in this pull request? Fix the link at http://spark.apache.org/docs/latest/ml-guide.html. ## How was this patch tested? None Author: Sun Dapeng <s...@apache.org> Closes #14386 from sundapeng/doclink. (cherry picked from commit 2c15323ad026da64caa68787c5d103a8595f63a0) Signed-off-by: Sean Owen <so...@cloudera.com> commit 7d87fc9649b141a1888b89363a8e311690d0fb56 Author: Tathagata Das <tathagata.das1...@gmail.com> Date: 2016-07-30T02:59:35Z [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in TreeNodeException ## What changes were proposed in this pull request? We do not want SparkExceptions from job failures in the planning phase to create TreeNodeException. Hence do not wrap SparkException in TreeNodeException. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1...@gmail.com> Closes #14395 from tdas/SPARK-16748. (cherry picked from commit bbc247548ac6faeca15afc05c266cee37ef13416) Signed-off-by: Yin Huai <yh...@databricks.com> commit 26da5a7fc37ac961e7b4d8f423e8e58aefb5c2bc Author: Bryan Cutler <cutl...@gmail.com> Date: 2016-07-30T15:08:33Z [SPARK-16800][EXAMPLES][ML] Fix Java examples that fail to run due to exception ## What changes were proposed in this pull request? Some Java examples are using mllib.linalg.Vectors instead of ml.linalg.Vectors and causes an exception when run. Also there are some Java examples that incorrectly specify data types in the schema, also causing an exception. ## How was this patch tested? Ran corrected examples locally Author: Bryan Cutler <cutl...@gmail.com> Closes #14405 from BryanCutler/java-examples-ml.Vectors-fix-SPARK-16800. (cherry picked from commit a6290e51e402e8434d6207d553db1f551e714fde) Signed-off-by: Sean Owen <so...@cloudera.com> commit 75dd78130d29154a3147490c57bce6883c992469 Author: Reynold Xin <r...@databricks.com> Date: 2016-07-31T06:05:03Z [SPARK-16812] Open up SparkILoop.getAddedJars ## What changes were proposed in this pull request? This patch makes SparkILoop.getAddedJars a public developer API. It is a useful function to get the list of jars added. ## How was this patch tested? N/A - this is a simple visibility change. Author: Reynold Xin <r...@databricks.com> Closes #14417 from rxin/SPARK-16812. (cherry picked from commit 7c27d075c39ebaf3e762284e2536fe7be0e3da87) Signed-off-by: Reynold Xin <r...@databricks.com> commit d357ca3023c84e472927380bed65b1cee33c4e03 Author: Reynold Xin <r...@databricks.com> Date: 2016-07-31T08:31:06Z [SPARK-16813][SQL] Remove private[sql] and private[spark] from catalyst package The catalyst package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime. This patch removes all private[sql] and private[spark] visibility modifiers in org.apache.spark.sql.catalyst. N/A - just visibility changes. Author: Reynold Xin <r...@databricks.com> Closes #14418 from rxin/SPARK-16813. (cherry picked from commit 064d91ff7342002414d3274694a8e2e37f154986) Signed-off-by: Reynold Xin <r...@databricks.com> commit c651ff53adefd0c74c84500e929ed37f8ad668d2 Author: Reynold Xin <r...@databricks.com> Date: 2016-08-01T01:21:06Z [SPARK-16805][SQL] Log timezone when query result does not match ## What changes were proposed in this pull request? It is useful to log the timezone when query result does not match, especially on build machines that have different timezone from AMPLab Jenkins. ## How was this patch tested? This is a test-only change. Author: Reynold Xin <r...@databricks.com> Closes #14413 from rxin/SPARK-16805. (cherry picked from commit 579fbcf3bd9717003025caecc0c0b85bcff7ac7f) Signed-off-by: Yin Huai <yh...@databricks.com> commit 4bdc558989ef4a9490ca42e7330c10136151134b Author: Holden Karau <hol...@us.ibm.com> Date: 2016-08-01T13:55:31Z [SPARK-16778][SQL][TRIVIAL] Fix deprecation warning with SQLContext ## What changes were proposed in this pull request? Change to non-deprecated constructor for SQLContext. ## How was this patch tested? Existing tests Author: Holden Karau <hol...@us.ibm.com> Closes #14406 from holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor. (cherry picked from commit 1e9b59b73bdb8aacf5a85e0eed29efc6485a3bc3) Signed-off-by: Sean Owen <so...@cloudera.com> commit b49091e10100dfbefeabdd2dfe0b64cdf613a052 Author: hyukjinkwon <gurwls...@gmail.com> Date: 2016-08-01T13:56:52Z [SPARK-16776][STREAMING] Replace deprecated API in KafkaTestUtils for 0.10.0. ## What changes were proposed in this pull request? This PR replaces the old Kafka API to 0.10.0 ones in `KafkaTestUtils`. The change include: - `Producer` to `KafkaProducer` - Change configurations to equalvant ones. (I referred [here](http://kafka.apache.org/documentation.html#producerconfigs) for 0.10.0 and [here](http://kafka.apache.org/082/documentation.html#producerconfigs ) for old, 0.8.2). This PR will remove the build warning as below: ```scala [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:71: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead. [WARNING] private var producer: Producer[String, String] = _ [WARNING] ^ [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:181: class Producer in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.KafkaProducer instead. [WARNING] producer = new Producer[String, String](new ProducerConfig(producerConfiguration)) [WARNING] ^ [WARNING] .../spark/streaming/kafka010/KafkaTestUtils.scala:181: class ProducerConfig in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerConfig instead. [WARNING] producer = new Producer[String, String](new ProducerConfig(producerConfiguration)) [WARNING] ^ [WARNING] .../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:182: class KeyedMessage in package producer is deprecated: This class has been deprecated and will be removed in a future release. Please use org.apache.kafka.clients.producer.ProducerRecord instead. [WARNING] producer.send(messages.map { new KeyedMessage[String, String](topic, _ ) }: _*) [WARNING] ^ [WARNING] four warnings found [WARNING] warning: [options] bootstrap class path not set in conjunction with -source 1.7 [WARNING] 1 warning ``` ## How was this patch tested? Existing tests that use `KafkaTestUtils` should cover this. Author: hyukjinkwon <gurwls...@gmail.com> Closes #14416 from HyukjinKwon/SPARK-16776. (cherry picked from commit f93ad4fe7c9728c8dd67a8095de3d39fad21d03f) Signed-off-by: Sean Owen <so...@cloudera.com> commit 1523bf69a0ef87f36b0b3995ce2d7a33aaff6046 Author: eyal farago <eyal farago> Date: 2016-08-01T14:43:32Z [SPARK-16791][SQL] cast struct with timestamp field fails ## What changes were proposed in this pull request? a failing test case + fix to SPARK-16791 (https://issues.apache.org/jira/browse/SPARK-16791) ## How was this patch tested? added a failing test case to CastSuit, then fixed the Cast code and rerun the entire CastSuit Author: eyal farago <eyal farago> Author: Eyal Farago <eyal.far...@actimize.com> Closes #14400 from eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails. (cherry picked from commit 338a98d65c8efe0c41f39a8dddeab7040dcda125) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit 4e73cb8ebdb0dcb1be4dce562bac9214e9905b8e Author: Holden Karau <hol...@us.ibm.com> Date: 2016-08-01T20:57:05Z [SPARK-16774][SQL] Fix use of deprecated timestamp constructor & improve timezone handling ## What changes were proposed in this pull request? Removes the deprecated timestamp constructor and incidentally fixes the use which was using system timezone rather than the one specified when working near DST. This change also causes the roundtrip tests to fail since it now actually uses all the timezones near DST boundaries where it didn't before. Note: this is only a partial the solution, longer term we should follow up with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & simplify our timezone handling code. ## How was this patch tested? New tests for two timezones added so even if user timezone happens to coincided with one, the other tests should still fail. Important note: this (temporarily) disables the round trip tests until we can fix the issue more thoroughly. Author: Holden Karau <hol...@us.ibm.com> Closes #14398 from holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor. (cherry picked from commit ab1e761f9691b41385e2ed2202c5a671c63c963d) Signed-off-by: Sean Owen <so...@cloudera.com> commit 1813bbd9bf7cb9afd29e1385f0dc52e8fcc4f132 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-08-01T21:41:22Z [SPARK-15869][STREAMING] Fix a potential NPE in StreamingJobProgressListener.getBatchUIData ## What changes were proposed in this pull request? Moved `asScala` to a `map` to avoid NPE. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixi...@databricks.com> Closes #14443 from zsxwing/SPARK-15869. (cherry picked from commit 03d46aafe561b03e25f4e25cf01e631c18dd827c) Signed-off-by: Shixiong Zhu <shixi...@databricks.com> commit 5fbf5f93ee5aa4d1aca0fa0c8fb769a085dd7b93 Author: Eric Liang <e...@databricks.com> Date: 2016-08-02T02:46:20Z [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of partitions https://github.com/apache/spark/pull/14425 rebased for branch-2.0 Author: Eric Liang <e...@databricks.com> Closes #14427 from ericl/spark-16818-br-2. commit 9d9956e8f8abd41a603fde2347384428b7ec715c Author: Cheng Lian <l...@databricks.com> Date: 2016-08-02T07:02:40Z [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings ## What changes were proposed in this pull request? This PR makes various minor updates to examples of all language bindings to make sure they are consistent with each other. Some typos and missing parts (JDBC example in Scala/Java/Python) are also fixed. ## How was this patch tested? Manually tested. Author: Cheng Lian <l...@databricks.com> Closes #14368 from liancheng/revise-examples. (cherry picked from commit 10e1c0e638774f5d746771b6dd251de2480f94eb) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit c5516ab60da860320693bbc245818cb6d8a282c8 Author: Xusen Yin <yinxu...@gmail.com> Date: 2016-08-02T14:28:46Z [SPARK-16558][EXAMPLES][MLLIB] examples/mllib/LDAExample should use MLVector instead of MLlib Vector ## What changes were proposed in this pull request? mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format. ## How was this patch tested? Test manually. Author: Xusen Yin <yinxu...@gmail.com> Closes #14212 from yinxusen/SPARK-16558. (cherry picked from commit dd8514fa2059a695143073f852b1abee50e522bd) Signed-off-by: Yanbo Liang <yblia...@gmail.com> commit fc18e259a311c0f1dffe47edef0e42182afca8e9 Author: Maciej Brynski <maciej.bryn...@adpilot.pl> Date: 2016-08-02T15:07:08Z [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (master branch) ## What changes were proposed in this pull request? Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7 ## How was this patch tested? Compilation. Existing automatic tests Author: Maciej Brynski <maciej.bryn...@adpilot.pl> Closes #14459 from maver1ck/spark-15541-master. (cherry picked from commit 511dede1118f20a7756f614acb6fc88af52c9de9) Signed-off-by: Sean Owen <so...@cloudera.com> commit 22f0899bc78e1f2021084c6397a4c05ad6317bae Author: Tom Magrino <tmagr...@fb.com> Date: 2016-08-02T16:16:44Z [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in constructors ## What changes were proposed in this pull request? Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow. The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837 ## How was this patch tested? Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent. Author: Tom Magrino <tmagr...@fb.com> Closes #14441 from tmagrino/windowing-fix. (cherry picked from commit 1dab63d8d3c59a3d6b4ee8e777810c44849e58b8) Signed-off-by: Sean Owen <so...@cloudera.com> commit ef7927e8e77558f9a18eacc8491b0c28231e2769 Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Date: 2016-08-02T17:08:18Z [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs ## What changes were proposed in this pull request? There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know. ### First bug: When MapObjects works on Python-only UDTs `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like: import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT()) df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema) df.show() File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString. : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType) ... ### Second bug: When Python-only UDTs is the element type of ArrayType import pyspark.sql.group from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql.types import * schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT())) df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema) df.show() ## How was this patch tested? PySpark's sql tests. Author: Liang-Chi Hsieh <sim...@tw.ibm.com> Closes #13778 from viirya/fix-pyudt. (cherry picked from commit 146001a9ffefc7aaedd3d888d68c7a9b80bca545) Signed-off-by: Davies Liu <davies....@gmail.com> commit a937c9ee44e0766194fc8ca4bce2338453112a53 Author: Herman van Hovell <hvanhov...@databricks.com> Date: 2016-08-02T17:09:47Z [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals ## What changes were proposed in this pull request? In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example: ```SQL select /* Spark 1.6: */ current_date, /* Spark 1.6 & Spark 2.0: */ current_date() ``` This was accidentally dropped in Spark 2.0. This PR reinstates this functionality. ## How was this patch tested? Added a case to ExpressionParserSuite. Author: Herman van Hovell <hvanhov...@databricks.com> Closes #14442 from hvanhovell/SPARK-16836. (cherry picked from commit 2330f3ecbbd89c7eaab9cc0d06726aa743b16334) Signed-off-by: Reynold Xin <r...@databricks.com> commit f190bb83beaafb65c8e6290e9ecaa61ac51e04bb Author: petermaxlee <petermax...@gmail.com> Date: 2016-08-02T11:32:35Z [SPARK-16850][SQL] Improve type checking error message for greatest/least Greatest/least function does not have the most friendly error message for data types. This patch improves the error message to not show the Seq type, and use more human readable data types. Before: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; line 1 pos 7 ``` After: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7 ``` Manually verified the output and also added unit tests to ConditionalExpressionSuite. Author: petermaxlee <petermax...@gmail.com> Closes #14453 from petermaxlee/SPARK-16850. (cherry picked from commit a1ff72e1cce6f22249ccc4905e8cef30075beb2f) Signed-off-by: Reynold Xin <r...@databricks.com> commit 063a507fce862d14061b0c0464b7a51a0afde066 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-08-02T19:02:11Z [SPARK-16787] SparkContext.addFile() should not throw if called twice with the same file ## What changes were proposed in this pull request? The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0. Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions. In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration. This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode. I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed. At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older v ersion)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above. As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded. ## How was this patch tested? I added regression tests in `SparkContextSuite`. Author: Josh Rosen <joshro...@databricks.com> Closes #14396 from JoshRosen/SPARK-16787. (cherry picked from commit e9fc0b6a8b4ce62cab56d18581f588c67b811f5b) Signed-off-by: Josh Rosen <joshro...@databricks.com> commit d9d3504b91ebf4fc9f06551f1b23fada0f4e1b0e Author: =^_^= <maxmo...@gmail.com> Date: 2016-08-03T11:18:28Z [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics ## What changes were proposed in this pull request? avgMetrics was summed, not averaged, across folds Author: =^_^= <maxmo...@gmail.com> Closes #14456 from pkch/pkch-patch-1. (cherry picked from commit 639df046a250873c26446a037cb832ab28cb5272) Signed-off-by: Sean Owen <so...@cloudera.com> commit 969313bb20a6695dee0959cabab7e5265f8de311 Author: Artur Sukhenko <artur.sukhe...@gmail.com> Date: 2016-08-02T23:13:12Z [SPARK-16796][WEB UI] Visible passwords on Spark environment page ## What changes were proposed in this pull request? Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, spark.ssl.trustStorePassword in Web UI environment page. (Changes their values to ***** in env. page) ## How was this patch tested? I've built spark, run spark shell and checked that this values have been masked with *****. Also run tests: ./dev/run-tests [info] ScalaTest [info] Run completed in 1 hour, 9 minutes, 5 seconds. [info] Total number of tests run: 2166 [info] Suites: completed 65, aborted 0 [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0 [info] All tests passed. ![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png) Author: Artur Sukhenko <artur.sukhe...@gmail.com> Closes #14409 from Devian-ua/maskpass. (cherry picked from commit 3861273771c2631e88e1f37a498c644ad45ac1c0) Signed-off-by: Sean Owen <so...@cloudera.com> commit 2daab33c4ed2cdbe1025252ae10bf19320df9d25 Author: Wenchen Fan <wenc...@databricks.com> Date: 2016-08-03T18:15:09Z [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type coercion should handle decimal type ## What changes were proposed in this pull request? Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres: | |Hive|MySQL|Postgres| |---|---|---|---|---| |`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem| |`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion| I think these behaviours makes sense and Spark SQL should follow them. This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type. This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion. ## How was this patch tested? new tests in `TypeCoersionSuite` Author: Wenchen Fan <wenc...@databricks.com> Author: Yin Huai <yh...@databricks.com> Closes #14439 from cloud-fan/bug. (cherry picked from commit b55f34370f695de355b72c1518b5f2a45c324af0) Signed-off-by: Yin Huai <yh...@databricks.com> commit b44da5b4e2a62023b127fdd8b81c6ae95d2cdbc7 Author: Kevin McHale <ke...@premise.com> Date: 2016-08-03T20:15:13Z [SPARK-14204][SQL] register driverClass rather than user-specified class This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well. srowen zzcclp JoshRosen This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204 My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user. This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed. Author: Kevin McHale <ke...@premise.com> Closes #14420 from mchalek/mchalek-jdbc_driver_registration. (cherry picked from commit 685b08e2611b69f8db60a00c0c94aecd315e2a3e) Signed-off-by: Sean Owen <so...@cloudera.com> ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org