[GitHub] spark pull request #15118: Branch 2.0

avinashkolla Fri, 16 Sep 2016 12:18:07 -0700

GitHub user avinashkolla opened a pull request:

    https://github.com/apache/spark/pull/15118


    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15118
    
----
commit f46a074510e47206de9d3b3ac6902af321923ce8
Author: Sylvain Zimmer <sylv...@sylvainzimmer.com>
Date:   2016-07-28T16:51:45Z

    [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMap
    
    Avoid overflow of Long type causing a NegativeArraySizeException a few 
lines later.
    
    Unit tests for HashedRelationSuite still pass.
    
    I can confirm the python script I included in 
https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. 
Unfortunately I don't have the knowledge/time to write a Scala test case for 
HashedRelationSuite right now. As the patch is pretty obvious I hope it can be 
included without this.
    
    Thanks!
    
    Author: Sylvain Zimmer <sylv...@sylvainzimmer.com>
    
    Closes #14373 from sylvinus/master.
    
    (cherry picked from commit 1178d61ede816bf1c8d5bb3dbb3b965c9b944407)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit fb09a693d6f58d71ec042224b8ea66b972c1adc2
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Date:   2016-07-28T20:04:19Z

    [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on 
OutOfMemoryError
    
    ## What changes were proposed in this pull request?
    
    We currently don't bound or manage the data array size used by column 
vectors in the vectorized reader (they're just bound by INT.MAX) which may lead 
to OOMs while reading data. As a short term fix, this patch intercepts the 
OutOfMemoryError exception and suggest the user to disable the vectorized 
parquet reader.
    
    ## How was this patch tested?
    
    Existing Tests
    
    Author: Sameer Agarwal <samee...@cs.berkeley.edu>
    
    Closes #14387 from sameeragarwal/oom.
    
    (cherry picked from commit 3fd39b87bda77f3c3a4622d854f23d4234683571)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 5cd79c396f98660e12b02c0151a084b4d1599b6b
Author: Nicholas Chammas <nicholas.cham...@gmail.com>
Date:   2016-07-28T21:57:15Z

    [SPARK-16772] Correct API doc references to PySpark classes + formatting 
fixes
    
    ## What's Been Changed
    
    The PR corrects several broken or missing class references in the Python 
API docs. It also correct formatting problems.
    
    For example, you can see 
[here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction)
 how Sphinx is not picking up the reference to `DataType`. That's because the 
reference is relative to the current module, whereas `DataType` is in a 
different module.
    
    You can also see 
[here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame)
 how the formatting for byte, tinyint, and so on is italic instead of 
monospace. That's because in ReST single backticks just make things italic, 
unlike in Markdown.
    
    ## Testing
    
    I tested this PR by [building the Python 
docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html)
 and reviewing the results locally in my browser. I confirmed that the broken 
or missing class references were resolved, and that the formatting was 
corrected.
    
    Author: Nicholas Chammas <nicholas.cham...@gmail.com>
    
    Closes #14393 from nchammas/python-docstring-fixes.
    
    (cherry picked from commit 274f3b9ec86e4109c7678eef60f990d41dc3899f)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit ed03d0a690c9a7920a21c858df7f42f9a41f28d7
Author: Wesley Tang <tangming...@mininglamp.com>
Date:   2016-07-29T11:26:05Z

    [SPARK-16664][SQL] Fix persist call on Data frames with more than 200â¦
    
    ## What changes were proposed in this pull request?
    
    f12f11e578169b47e3f8b18b299948c0670ba585 introduced this bug, missed 
foreach as map
    
    ## How was this patch tested?
    
    Test added
    
    Author: Wesley Tang <tangming...@mininglamp.com>
    
    Closes #14324 from breakdawn/master.
    
    (cherry picked from commit d1d5069aa3744d46abd3889abab5f15e9067382a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit efad4aa1468867b36cffb1e8c91f9731c48eca81
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-07-29T11:40:20Z

    [SPARK-16750][ML] Fix GaussianMixture training failed due to feature column 
type mistake
    
    ## What changes were proposed in this pull request?
    ML ```GaussianMixture``` training failed due to feature column type 
mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got 
```mllib.linalg.VectorUDT``` by mistake.
    See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for 
how to reproduce this bug.
    Why the unit tests did not complain this errors? Because some 
estimators/transformers missed calling ```transformSchema(dataset.schema)``` 
firstly during ```fit``` or ```transform```. I will also add this function to 
all estimators/transformers who missed in this PR.
    
    ## How was this patch tested?
    No new tests, should pass existing ones.
    
    Author: Yanbo Liang <yblia...@gmail.com>
    
    Closes #14378 from yanboliang/spark-16750.
    
    (cherry picked from commit 0557a45452f6e73877e5ec972110825ce8f3fbc5)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 268bf144004952385e4573a11d981b3440f31f5d
Author: Adam Roberts <arobe...@uk.ibm.com>
Date:   2016-07-29T11:43:01Z

    [SPARK-16751] Upgrade derby to 10.12.1.1
    
    Version of derby upgraded based on important security info at VersionEye. 
Test scope added so we don't include it in our final package anyway. NB: I 
think this should be backported to all previous releases as it is a security 
problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1
    
    The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs
    
    Existing tests with the change making sure that we see no new failures. I 
checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.
    
    I then used dev/make-distribution.sh and checked the dist/jars folder for 
Spark 2.0: no derby jar is present.
    
    I don't know if this would also remove it from the assembly jar in our 1.x 
branches.
    
    Author: Adam Roberts <arobe...@uk.ibm.com>
    
    Closes #14379 from a-roberts/patch-4.
    
    (cherry picked from commit 04a2c072d94874f3f7ae9dd94c026e8826a75ccd)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a32531a72cc2b6a0ff95a7a73b256ffc5d9eff60
Author: Sun Dapeng <s...@apache.org>
Date:   2016-07-29T13:01:23Z

    [SPARK-16761][DOC][ML] Fix doc link in docs/ml-guide.md
    
    ## What changes were proposed in this pull request?
    
    Fix the link at http://spark.apache.org/docs/latest/ml-guide.html.
    
    ## How was this patch tested?
    
    None
    
    Author: Sun Dapeng <s...@apache.org>
    
    Closes #14386 from sundapeng/doclink.
    
    (cherry picked from commit 2c15323ad026da64caa68787c5d103a8595f63a0)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 7d87fc9649b141a1888b89363a8e311690d0fb56
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-07-30T02:59:35Z

    [SPARK-16748][SQL] SparkExceptions during planning should not wrapped in 
TreeNodeException
    
    ## What changes were proposed in this pull request?
    We do not want SparkExceptions from job failures in the planning phase to 
create TreeNodeException. Hence do not wrap SparkException in TreeNodeException.
    
    ## How was this patch tested?
    New unit test
    
    Author: Tathagata Das <tathagata.das1...@gmail.com>
    
    Closes #14395 from tdas/SPARK-16748.
    
    (cherry picked from commit bbc247548ac6faeca15afc05c266cee37ef13416)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 26da5a7fc37ac961e7b4d8f423e8e58aefb5c2bc
Author: Bryan Cutler <cutl...@gmail.com>
Date:   2016-07-30T15:08:33Z

    [SPARK-16800][EXAMPLES][ML] Fix Java examples that fail to run due to 
exception
    
    ## What changes were proposed in this pull request?
    Some Java examples are using mllib.linalg.Vectors instead of 
ml.linalg.Vectors and causes an exception when run.  Also there are some Java 
examples that incorrectly specify data types in the schema, also causing an 
exception.
    
    ## How was this patch tested?
    Ran corrected examples locally
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #14405 from BryanCutler/java-examples-ml.Vectors-fix-SPARK-16800.
    
    (cherry picked from commit a6290e51e402e8434d6207d553db1f551e714fde)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 75dd78130d29154a3147490c57bce6883c992469
Author: Reynold Xin <r...@databricks.com>
Date:   2016-07-31T06:05:03Z

    [SPARK-16812] Open up SparkILoop.getAddedJars
    
    ## What changes were proposed in this pull request?
    This patch makes SparkILoop.getAddedJars a public developer API. It is a 
useful function to get the list of jars added.
    
    ## How was this patch tested?
    N/A - this is a simple visibility change.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #14417 from rxin/SPARK-16812.
    
    (cherry picked from commit 7c27d075c39ebaf3e762284e2536fe7be0e3da87)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit d357ca3023c84e472927380bed65b1cee33c4e03
Author: Reynold Xin <r...@databricks.com>
Date:   2016-07-31T08:31:06Z

    [SPARK-16813][SQL] Remove private[sql] and private[spark] from catalyst 
package
    
    The catalyst package is meant to be internal, and as a result it does not 
make sense to mark things as private[sql] or private[spark]. It simply makes 
debugging harder when Spark developers need to inspect the plans at runtime.
    
    This patch removes all private[sql] and private[spark] visibility modifiers 
in org.apache.spark.sql.catalyst.
    
    N/A - just visibility changes.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #14418 from rxin/SPARK-16813.
    
    (cherry picked from commit 064d91ff7342002414d3274694a8e2e37f154986)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit c651ff53adefd0c74c84500e929ed37f8ad668d2
Author: Reynold Xin <r...@databricks.com>
Date:   2016-08-01T01:21:06Z

    [SPARK-16805][SQL] Log timezone when query result does not match
    
    ## What changes were proposed in this pull request?
    It is useful to log the timezone when query result does not match, 
especially on build machines that have different timezone from AMPLab Jenkins.
    
    ## How was this patch tested?
    This is a test-only change.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #14413 from rxin/SPARK-16805.
    
    (cherry picked from commit 579fbcf3bd9717003025caecc0c0b85bcff7ac7f)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 4bdc558989ef4a9490ca42e7330c10136151134b
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-08-01T13:55:31Z

    [SPARK-16778][SQL][TRIVIAL] Fix deprecation warning with SQLContext
    
    ## What changes were proposed in this pull request?
    
    Change to non-deprecated constructor for SQLContext.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Holden Karau <hol...@us.ibm.com>
    
    Closes #14406 from 
holdenk/SPARK-16778-fix-use-of-deprecated-SQLContext-constructor.
    
    (cherry picked from commit 1e9b59b73bdb8aacf5a85e0eed29efc6485a3bc3)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit b49091e10100dfbefeabdd2dfe0b64cdf613a052
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-08-01T13:56:52Z

    [SPARK-16776][STREAMING] Replace deprecated API in KafkaTestUtils for 
0.10.0.
    
    ## What changes were proposed in this pull request?
    
    This PR replaces the old Kafka API to 0.10.0 ones in `KafkaTestUtils`.
    
    The change include:
    
     - `Producer` to `KafkaProducer`
     - Change configurations to equalvant ones. (I referred 
[here](http://kafka.apache.org/documentation.html#producerconfigs) for 0.10.0 
and [here](http://kafka.apache.org/082/documentation.html#producerconfigs
    ) for old, 0.8.2).
    
    This PR will remove the build warning as below:
    
    ```scala
    [WARNING] 
.../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:71:
 class Producer in package producer is deprecated: This class has been 
deprecated and will be removed in a future release. Please use 
org.apache.kafka.clients.producer.KafkaProducer instead.
    [WARNING]   private var producer: Producer[String, String] = _
    [WARNING]                         ^
    [WARNING] 
.../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:181:
 class Producer in package producer is deprecated: This class has been 
deprecated and will be removed in a future release. Please use 
org.apache.kafka.clients.producer.KafkaProducer instead.
    [WARNING]     producer = new Producer[String, String](new 
ProducerConfig(producerConfiguration))
    [WARNING]                    ^
    [WARNING] .../spark/streaming/kafka010/KafkaTestUtils.scala:181: class 
ProducerConfig in package producer is deprecated: This class has been 
deprecated and will be removed in a future release. Please use 
org.apache.kafka.clients.producer.ProducerConfig instead.
    [WARNING]     producer = new Producer[String, String](new 
ProducerConfig(producerConfiguration))
    [WARNING]                                                 ^
    [WARNING] 
.../spark/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaTestUtils.scala:182:
 class KeyedMessage in package producer is deprecated: This class has been 
deprecated and will be removed in a future release. Please use 
org.apache.kafka.clients.producer.ProducerRecord instead.
    [WARNING]     producer.send(messages.map { new KeyedMessage[String, 
String](topic, _ ) }: _*)
    [WARNING]                                      ^
    [WARNING] four warnings found
    [WARNING] warning: [options] bootstrap class path not set in conjunction 
with -source 1.7
    [WARNING] 1 warning
    ```
    
    ## How was this patch tested?
    
    Existing tests that use `KafkaTestUtils` should cover this.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #14416 from HyukjinKwon/SPARK-16776.
    
    (cherry picked from commit f93ad4fe7c9728c8dd67a8095de3d39fad21d03f)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 1523bf69a0ef87f36b0b3995ce2d7a33aaff6046
Author: eyal farago <eyal farago>
Date:   2016-08-01T14:43:32Z

    [SPARK-16791][SQL] cast struct with timestamp field fails
    
    ## What changes were proposed in this pull request?
    a failing test case + fix to SPARK-16791 
(https://issues.apache.org/jira/browse/SPARK-16791)
    
    ## How was this patch tested?
    added a failing test case to CastSuit, then fixed the Cast code and rerun 
the entire CastSuit
    
    Author: eyal farago <eyal farago>
    Author: Eyal Farago <eyal.far...@actimize.com>
    
    Closes #14400 from 
eyalfa/SPARK-16791_cast_struct_with_timestamp_field_fails.
    
    (cherry picked from commit 338a98d65c8efe0c41f39a8dddeab7040dcda125)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 4e73cb8ebdb0dcb1be4dce562bac9214e9905b8e
Author: Holden Karau <hol...@us.ibm.com>
Date:   2016-08-01T20:57:05Z

    [SPARK-16774][SQL] Fix use of deprecated timestamp constructor & improve 
timezone handling
    
    ## What changes were proposed in this pull request?
    
    Removes the deprecated timestamp constructor and incidentally fixes the use 
which was using system timezone rather than the one specified when working near 
DST.
    
    This change also causes the roundtrip tests to fail since it now actually 
uses all the timezones near DST boundaries where it didn't before.
    
    Note: this is only a partial the solution, longer term we should follow up 
with https://issues.apache.org/jira/browse/SPARK-16788 to avoid this problem & 
simplify our timezone handling code.
    
    ## How was this patch tested?
    
    New tests for two timezones added so even if user timezone happens to 
coincided with one, the other tests should still fail. Important note: this 
(temporarily) disables the round trip tests until we can fix the issue more 
thoroughly.
    
    Author: Holden Karau <hol...@us.ibm.com>
    
    Closes #14398 from 
holdenk/SPARK-16774-fix-use-of-deprecated-timestamp-constructor.
    
    (cherry picked from commit ab1e761f9691b41385e2ed2202c5a671c63c963d)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 1813bbd9bf7cb9afd29e1385f0dc52e8fcc4f132
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-08-01T21:41:22Z

    [SPARK-15869][STREAMING] Fix a potential NPE in 
StreamingJobProgressListener.getBatchUIData
    
    ## What changes were proposed in this pull request?
    
    Moved `asScala` to a `map` to avoid NPE.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #14443 from zsxwing/SPARK-15869.
    
    (cherry picked from commit 03d46aafe561b03e25f4e25cf01e631c18dd827c)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 5fbf5f93ee5aa4d1aca0fa0c8fb769a085dd7b93
Author: Eric Liang <e...@databricks.com>
Date:   2016-08-02T02:46:20Z

    [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets 
of partitions
    
    https://github.com/apache/spark/pull/14425 rebased for branch-2.0
    
    Author: Eric Liang <e...@databricks.com>
    
    Closes #14427 from ericl/spark-16818-br-2.

commit 9d9956e8f8abd41a603fde2347384428b7ec715c
Author: Cheng Lian <l...@databricks.com>
Date:   2016-08-02T07:02:40Z

    [SPARK-16734][EXAMPLES][SQL] Revise examples of all language bindings
    
    ## What changes were proposed in this pull request?
    
    This PR makes various minor updates to examples of all language bindings to 
make sure they are consistent with each other. Some typos and missing parts 
(JDBC example in Scala/Java/Python) are also fixed.
    
    ## How was this patch tested?
    
    Manually tested.
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #14368 from liancheng/revise-examples.
    
    (cherry picked from commit 10e1c0e638774f5d746771b6dd251de2480f94eb)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit c5516ab60da860320693bbc245818cb6d8a282c8
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2016-08-02T14:28:46Z

    [SPARK-16558][EXAMPLES][MLLIB] examples/mllib/LDAExample should use 
MLVector instead of MLlib Vector
    
    ## What changes were proposed in this pull request?
    
    mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former 
transforms original data into MLVector format, while the latter uses 
MLlibVector format.
    
    ## How was this patch tested?
    
    Test manually.
    
    Author: Xusen Yin <yinxu...@gmail.com>
    
    Closes #14212 from yinxusen/SPARK-16558.
    
    (cherry picked from commit dd8514fa2059a695143073f852b1abee50e522bd)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit fc18e259a311c0f1dffe47edef0e42182afca8e9
Author: Maciej Brynski <maciej.bryn...@adpilot.pl>
Date:   2016-08-02T15:07:08Z

    [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (master branch)
    
    ## What changes were proposed in this pull request?
    
    Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with 
Java 8 on Java 7
    
    ## How was this patch tested?
    
    Compilation. Existing automatic tests
    
    Author: Maciej Brynski <maciej.bryn...@adpilot.pl>
    
    Closes #14459 from maver1ck/spark-15541-master.
    
    (cherry picked from commit 511dede1118f20a7756f614acb6fc88af52c9de9)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 22f0899bc78e1f2021084c6397a4c05ad6317bae
Author: Tom Magrino <tmagr...@fb.com>
Date:   2016-08-02T16:16:44Z

    [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in 
constructors
    
    ## What changes were proposed in this pull request?
    
    Fix of incorrect arguments (dropping slideDuration and using 
windowDuration) in constructors for TimeWindow.
    
    The JIRA this addresses is here: 
https://issues.apache.org/jira/browse/SPARK-16837
    
    ## How was this patch tested?
    
    Added a test to TimeWindowSuite to check that the results of TimeWindow 
object apply and TimeWindow class constructor are equivalent.
    
    Author: Tom Magrino <tmagr...@fb.com>
    
    Closes #14441 from tmagrino/windowing-fix.
    
    (cherry picked from commit 1dab63d8d3c59a3d6b4ee8e777810c44849e58b8)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit ef7927e8e77558f9a18eacc8491b0c28231e2769
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-08-02T17:08:18Z

    [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs
    
    ## What changes were proposed in this pull request?
    
    There are two related bugs of Python-only UDTs. Because the test case of 
second one needs the first fix too. I put them into one PR. If it is not 
appropriate, please let me know.
    
    ### First bug: When MapObjects works on Python-only UDTs
    
    `RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer 
expression. If the sql type is `ArrayType`, we will have `MapObjects` working 
on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input 
data type. It causes error like:
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
        df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), 
float(i))) for i in range(10)], schema=schema)
        df.show()
    
        File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", 
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred 
while calling o36.showString.
        : java.lang.RuntimeException: Error while decoding: scala.MatchError: 
org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class 
org.apache.spark.sql.types.PythonUserDefinedType)
        ...
    
    ### Second bug: When Python-only UDTs is the element type of ArrayType
    
        import pyspark.sql.group
        from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
        from pyspark.sql.types import *
    
        schema = StructType().add("key", LongType()).add("val", 
ArrayType(PythonOnlyUDT()))
        df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), 
float(i))]) for i in range(10)], schema=schema)
        df.show()
    
    ## How was this patch tested?
    PySpark's sql tests.
    
    Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
    
    Closes #13778 from viirya/fix-pyudt.
    
    (cherry picked from commit 146001a9ffefc7aaedd3d888d68c7a9b80bca545)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit a937c9ee44e0766194fc8ca4bce2338453112a53
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-08-02T17:09:47Z

    [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals
    
    ## What changes were proposed in this pull request?
    In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and 
`CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
    ```SQL
    select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ 
current_date()
    ```
    This was accidentally dropped in Spark 2.0. This PR reinstates this 
functionality.
    
    ## How was this patch tested?
    Added a case to ExpressionParserSuite.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #14442 from hvanhovell/SPARK-16836.
    
    (cherry picked from commit 2330f3ecbbd89c7eaab9cc0d06726aa743b16334)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit f190bb83beaafb65c8e6290e9ecaa61ac51e04bb
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-02T11:32:35Z

    [SPARK-16850][SQL] Improve type checking error message for greatest/least
    
    Greatest/least function does not have the most friendly error message for 
data types. This patch improves the error message to not show the Seq type, and 
use more human readable data types.
    
    Before:
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 
AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all 
have the same type, got GREATEST (ArrayBuffer(DecimalType(2,1), StringType)).; 
line 1 pos 7
    ```
    
    After:
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(CAST(1.0 
AS DECIMAL(2,1)), "1.0")' due to data type mismatch: The expressions should all 
have the same type, got GREATEST(decimal(2,1), string).; line 1 pos 7
    ```
    
    Manually verified the output and also added unit tests to 
ConditionalExpressionSuite.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14453 from petermaxlee/SPARK-16850.
    
    (cherry picked from commit a1ff72e1cce6f22249ccc4905e8cef30075beb2f)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 063a507fce862d14061b0c0464b7a51a0afde066
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-08-02T19:02:11Z

    [SPARK-16787] SparkContext.addFile() should not throw if called twice with 
the same file
    
    ## What changes were proposed in this pull request?
    
    The behavior of `SparkContext.addFile()` changed slightly with the 
introduction of the Netty-RPC-based file server, which was introduced in Spark 
1.6 (where it was disabled by default) and became the default / only file 
server in Spark 2.0.0.
    
    Prior to 2.0, calling `SparkContext.addFile()` with files that have the 
same name and identical contents would succeed. This behavior was never 
explicitly documented but Spark has behaved this way since very early 1.x 
versions.
    
    In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` 
call will fail with a requirement error because NettyStreamManager tries to 
guard against duplicate file registration.
    
    This problem also affects `addJar()` in a more subtle way: the 
`fileServer.addJar()` call will also fail with an exception but that exception 
is logged and ignored; I believe that the problematic exception-catching path 
was mistakenly copied from some old code which was only relevant to very old 
versions of Spark and YARN mode.
    
    I believe that this change of behavior was unintentional, so this patch 
weakens the `require` check so that adding the same filename at the same path 
will succeed.
    
    At file download time, Spark tasks will fail with exceptions if an executor 
already has a local copy of a file and that file's contents do not match the 
contents of the file being downloaded / added. As a result, it's important that 
we prevent files with the same name and different contents from being served 
because allowing that can effectively brick an executor by preventing it from 
successfully launching any new tasks. Before this patch's change, this was 
prevented by forbidding `addFile()` from being called twice on files with the 
same name. Because Spark does not defensively copy local files that are passed 
to `addFile` it is vulnerable to files' contents changing, so I think it's okay 
to rely on an implicit assumption that these files are intended to be immutable 
(since if they _are_ mutable then this can lead to either explicit task 
failures or implicit incorrectness (in case new executors silently get newer 
copies of the file while old executors continue to use an older v
 ersion)). To guard against this, I have decided to only update the file 
addition timestamps on the first call to `addFile()`; duplicate calls will 
succeed but will not update the timestamp. This behavior is fine as long as we 
assume files are immutable, which seems reasonable given the behaviors 
described above.
    
    As part of this change, I also improved the thread-safety of the 
`addedJars` and `addedFiles` maps; this is important because these maps may be 
concurrently read by a task launching thread and written by a driver thread in 
case the user's driver code is multi-threaded.
    
    ## How was this patch tested?
    
    I added regression tests in `SparkContextSuite`.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #14396 from JoshRosen/SPARK-16787.
    
    (cherry picked from commit e9fc0b6a8b4ce62cab56d18581f588c67b811f5b)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit d9d3504b91ebf4fc9f06551f1b23fada0f4e1b0e
Author: =^_^= <maxmo...@gmail.com>
Date:   2016-08-03T11:18:28Z

    [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics
    
    ## What changes were proposed in this pull request?
    
    avgMetrics was summed, not averaged, across folds
    
    Author: =^_^= <maxmo...@gmail.com>
    
    Closes #14456 from pkch/pkch-patch-1.
    
    (cherry picked from commit 639df046a250873c26446a037cb832ab28cb5272)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 969313bb20a6695dee0959cabab7e5265f8de311
Author: Artur Sukhenko <artur.sukhe...@gmail.com>
Date:   2016-08-02T23:13:12Z

    [SPARK-16796][WEB UI] Visible passwords on Spark environment page
    
    ## What changes were proposed in this pull request?
    
    Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, 
spark.ssl.trustStorePassword in Web UI environment page.
    (Changes their values to ***** in env. page)
    
    ## How was this patch tested?
    
    I've built spark, run spark shell and checked that this values have been 
masked with *****.
    
    Also run tests:
    ./dev/run-tests
    
    [info] ScalaTest
    [info] Run completed in 1 hour, 9 minutes, 5 seconds.
    [info] Total number of tests run: 2166
    [info] Suites: completed 65, aborted 0
    [info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0
    [info] All tests passed.
    
    
![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png)
    
    Author: Artur Sukhenko <artur.sukhe...@gmail.com>
    
    Closes #14409 from Devian-ua/maskpass.
    
    (cherry picked from commit 3861273771c2631e88e1f37a498c644ad45ac1c0)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 2daab33c4ed2cdbe1025252ae10bf19320df9d25
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-08-03T18:15:09Z

    [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type 
coercion should handle decimal type
    
    ## What changes were proposed in this pull request?
    
    Here is a table about the behaviours of `array`/`map` and 
`greatest`/`least` in Hive, MySQL and Postgres:
    
    |    |Hive|MySQL|Postgres|
    |---|---|---|---|---|
    |`array`/`map`|can find a wider type with decimal type arguments, and will 
truncate the wider decimal type if necessary|can find a wider type with decimal 
type arguments, no truncation problem|can find a wider type with decimal type 
arguments, no truncation problem|
    |`greatest`/`least`|can find a wider type with decimal type arguments, and 
truncate if necessary, but can't do string promotion|can find a wider type with 
decimal type arguments, no truncation problem, but can't do string 
promotion|can find a wider type with decimal type arguments, no truncation 
problem, but can't do string promotion|
    
    I think these behaviours makes sense and Spark SQL should follow them.
    
    This PR fixes `array` and `map` by using `findWiderCommonType` to get the 
wider type.
    This PR fixes `greatest` and `least` by add a 
`findWiderTypeWithoutStringPromotion`, which provides similar semantic of 
`findWiderCommonType`, but without string promotion.
    
    ## How was this patch tested?
    
    new tests in `TypeCoersionSuite`
    
    Author: Wenchen Fan <wenc...@databricks.com>
    Author: Yin Huai <yh...@databricks.com>
    
    Closes #14439 from cloud-fan/bug.
    
    (cherry picked from commit b55f34370f695de355b72c1518b5f2a45c324af0)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit b44da5b4e2a62023b127fdd8b81c6ae95d2cdbc7
Author: Kevin McHale <ke...@premise.com>
Date:   2016-08-03T20:15:13Z

    [SPARK-14204][SQL] register driverClass rather than user-specified class
    
    This is a pull request that was originally merged against branch-1.6 as 
#12000, now being merged into master as well.  srowen zzcclp JoshRosen
    
    This pull request fixes an issue in which cluster-mode executors fail to 
properly register a JDBC driver when the driver is provided in a jar by the 
user, but the driver class name is derived from a JDBC URL (rather than 
specified by the user). The consequence of this is that all JDBC accesses under 
the described circumstances fail with an IllegalStateException. I reported the 
issue here: https://issues.apache.org/jira/browse/SPARK-14204
    
    My proposed solution is to have the executors register the JDBC driver 
class under all circumstances, not only when the driver is specified by the 
user.
    
    This patch was tested manually. I built an assembly jar, deployed it to a 
cluster, and confirmed that the problem was fixed.
    
    Author: Kevin McHale <ke...@premise.com>
    
    Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
    
    (cherry picked from commit 685b08e2611b69f8db60a00c0c94aecd315e2a3e)
    Signed-off-by: Sean Owen <so...@cloudera.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15118: Branch 2.0

Reply via email to