[GitHub] spark pull request #17732: Branch 2.0

tangchun Sat, 22 Apr 2017 22:24:54 -0700

GitHub user tangchun opened a pull request:

    https://github.com/apache/spark/pull/17732


    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17732.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17732
    
----
commit 0cdd7370a61618d042417ee387a3c32ee5c924e6
Author: Bjarne Fruergaard <bwahlgr...@gmail.com>
Date:   2016-09-29T22:39:57Z

    [SPARK-17721][MLLIB][ML] Fix for multiplying transposed SparseMatrix with 
SparseVector
    
    ## What changes were proposed in this pull request?
    
    * changes the implementation of gemv with transposed SparseMatrix and 
SparseVector both in mllib-local and mllib (identical)
    * adds a test that was failing before this change, but succeeds with these 
changes.
    
    The problem in the previous implementation was that it only increments `i`, 
that is enumerating the columns of a row in the SparseMatrix, when the 
row-index of the vector matches the column-index of the SparseMatrix. In cases 
where a particular row of the SparseMatrix has non-zero values at 
column-indices lower than corresponding non-zero row-indices of the 
SparseVector, the non-zero values of the SparseVector are enumerated without 
ever matching the column-index at index `i` and the remaining column-indices 
i+1,...,indEnd-1 are never attempted. The test cases in this PR illustrate this 
issue.
    
    ## How was this patch tested?
    
    I have run the specific `gemv` tests in both mllib-local and mllib. I am 
currently still running `./dev/run-tests`.
    
    ## ___
    As per instructions, I hereby state that this is my original work and that 
I license the work to the project (Apache Spark) under the project's open 
source license.
    
    Mentioning dbtsai, viirya and brkyvz whom I can see have worked/authored on 
these parts before.
    
    Author: Bjarne Fruergaard <bwahlgr...@gmail.com>
    
    Closes #15296 from bwahlgreen/bugfix-spark-17721.
    
    (cherry picked from commit 29396e7d1483d027960b9a1bed47008775c4253e)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit a99ea4c9e0e2f91e4b524987788f0acee88e564d
Author: Bryan Cutler <cutl...@gmail.com>
Date:   2016-09-29T23:31:30Z

    Updated the following PR with minor changes to allow cherry-pick to 
branch-2.0
    
    [SPARK-17697][ML] Fixed bug in summary calculations that pattern match 
against label without casting
    
    In calling LogisticRegression.evaluate and 
GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of 
a double type, calculations pattern match against a double and throw a 
MatchError.  This fix casts the Label column to a DoubleType to ensure there is 
no MatchError.
    
    Added unit tests to call evaluate with a dataset that has Label as other 
numeric types.
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.
    
    (cherry picked from commit 2f739567080d804a942cfcca0e22f91ab7cbea36)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 744aac8e6ff04d7a3f1e8ccad335605ac8fe2f29
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-10-01T05:05:59Z

    [MINOR][DOC] Add an up-to-date description for default serialization during 
shuffling
    
    ## What changes were proposed in this pull request?
    
    This PR aims to make the doc up-to-date. The documentation is generally 
correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark 
starts to choose Kyro as a default serialization library during shuffling of 
simple types, arrays of simple types, or string type.
    
    ## How was this patch tested?
    
    This is a documentation update.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
    
    (cherry picked from commit 15e9bbb49e00b3982c428d39776725d0dea2cdfa)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit b57e2acb134d94dafc81686da875c5dd3ea35c74
Author: Jagadeesan <a...@us.ibm.com>
Date:   2016-10-03T09:46:38Z

    [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,â¦
    
    ## What changes were proposed in this pull request?
    
    To build R docs (which are built when R tests are run), users need to 
install pandoc and rmarkdown. This was done for Jenkins in 
~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~
    
    â¦ pandoc]
    
    Author: Jagadeesan <a...@us.ibm.com>
    
    Closes #15309 from jagadeesanas2/SPARK-17736.
    
    (cherry picked from commit a27033c0bbaae8f31db9b91693947ed71738ed11)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 613863b116b6cbc9ac83845c68a2d11b3b02f7cb
Author: zero323 <zero...@users.noreply.github.com>
Date:   2016-10-04T00:57:54Z

    [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow 
__getitem__ contract
    
    ## What changes were proposed in this pull request?
    
    Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` 
`SparseVector.__getitem__` is out of range. This ensures correct iteration 
behavior.
    
    Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` 
in `ml` / `mllib`.
    
    ## How was this patch tested?
    
    PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the 
problem has been resolved.
    
    Author: zero323 <zero...@users.noreply.github.com>
    
    Closes #15144 from zero323/SPARK-17587.
    
    (cherry picked from commit d8399b600cef706c22d381b01fab19c610db439a)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 5843932021cc8bbe0277943c6c480cfeae1b29e2
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-10-04T02:32:59Z

    [SPARK-17753][SQL] Allow a complex expression as the input a value based 
case statement
    
    ## What changes were proposed in this pull request?
    We currently only allow relatively simple expressions as the input for a 
value based case statement. Expressions like `case (a > 1) or (b = 2) when true 
then 1 when false then 0 end` currently fail. This PR adds support for such 
expressions.
    
    ## How was this patch tested?
    Added a test to the ExpressionParserSuite.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #15322 from hvanhovell/SPARK-17753.
    
    (cherry picked from commit 2bbecdec2023143fd144e4242ff70822e0823986)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit 7429199e5b34d5594e3fcedb57eda789d16e26f3
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-10-04T04:28:16Z

    [SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException 
in Thriftserver
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark Thrift Server raises `IllegalArgumentException` for 
queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT 
if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.
    
    **Before**
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Error: java.lang.IllegalArgumentException: Unrecognized type name: null 
(state=,code=0)
    Closing: 0: jdbc:hive2://localhost:10000
    
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Error: java.lang.IllegalArgumentException: Unrecognized type name: null 
(state=,code=0)
    Closing: 0: jdbc:hive2://localhost:10000
    ```
    
    **After**
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    +-------+--+
    | NULL  |
    +-------+--+
    | NULL  |
    +-------+--+
    1 row selected (3.242 seconds)
    Beeline version 1.2.1.spark2 by Apache Hive
    Closing: 0: jdbc:hive2://localhost:10000
    
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    Connecting to jdbc:hive2://localhost:10000
    Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
    Driver: Hive JDBC (version 1.2.1.spark2)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    +-------------------------+--+
    | (IF(true, NULL, NULL))  |
    +-------------------------+--+
    | NULL                    |
    +-------------------------+--+
    1 row selected (0.201 seconds)
    Beeline version 1.2.1.spark2 by Apache Hive
    Closing: 0: jdbc:hive2://localhost:10000
    ```
    
    ## How was this patch tested?
    
    * Pass the Jenkins test with a new testsuite.
    * Also, Manually, after starting Spark Thrift Server, run the following 
command.
    ```sql
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
    $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
    ```
    
    **Hive 1.2**
    ```sql
    hive> create table null_table as select null;
    hive> desc null_table;
    OK
    _c0                     void
    ```
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #15325 from dongjoon-hyun/SPARK-17112.
    
    (cherry picked from commit c571cfb2d0e1e224107fc3f0c672730cae9804cb)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 3dbe8097facb854195729da7bd577f6c14eb2b2a
Author: ding <ding@localhost.localdomain>
Date:   2016-10-04T07:00:10Z

    [SPARK-17559][MLLIB] persist edges if their storage level is non in 
PeriodicGraphCheckpointer
    
    ## What changes were proposed in this pull request?
    When use PeriodicGraphCheckpointer to persist graph, sometimes the edges 
isn't persisted. As currently only when vertices's storage level is none, graph 
is persisted. However there is a chance vertices's storage level is not none 
while edges's is none. Eg. graph created by a outerJoinVertices operation, 
vertices is automatically cached while edges is not. In this way, edges will 
not be persisted if we use PeriodicGraphCheckpointer do persist. We need 
separately check edges's storage level and persisted it if it's none.
    
    ## How was this patch tested?
     manual tests
    
    Author: ding <ding@localhost.localdomain>
    
    Closes #15124 from dding3/spark-persisitEdge.
    
    (cherry picked from commit 126baa8d32bc0e7bf8b43f9efa84f2728f02347d)
    Signed-off-by: Joseph K. Bradley <jos...@databricks.com>

commit 50f6be7598547fed5190a920fd3cebb4bc908524
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-10-04T16:22:26Z

    [SPARKR][DOC] minor formatting and output cleanup for R vignettes
    
    Clean up output, format table, truncate long example output, hide warnings
    
    (new - Left; existing - Right)
    
![image](https://cloud.githubusercontent.com/assets/8969467/19064018/5dcde4d0-89bc-11e6-857b-052df3f52a4e.png)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/19064034/6db09956-89bc-11e6-8e43-232d5c3fe5e6.png)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/19064058/88f09590-89bc-11e6-9993-61639e29dfdd.png)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/19064066/95ccbf64-89bc-11e6-877f-45af03ddcadc.png)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/19064082/a8445404-89bc-11e6-8532-26d8bc9b206f.png)
    
    Run create-doc.sh manually
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #15340 from felixcheung/vignettes.
    
    (cherry picked from commit 068c198e956346b90968a4d74edb7bc820c4be28)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit a9165bb1b704483ad16331945b0968cbb1a97139
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-10-04T16:38:44Z

    [SPARK-17549][SQL] Only collect table size stat in driver for cached 
relation.
    
    This reverts commit 9ac68dbc5720026ea92acc61d295ca64d0d3d132. Turns out
    the original fix was correct.
    
    Original change description:
    The existing code caches all stats for all columns for each partition
    in the driver; for a large relation, this causes extreme memory usage,
    which leads to gc hell and application failures.
    
    It seems that only the size in bytes of the data is actually used in the
    driver, so instead just colllect that. In executors, the full stats are
    still kept, but that's not a big problem; we expect the data to be 
distributed
    and thus not really incur in too much memory pressure in each individual
    executor.
    
    There are also potential improvements on the executor side, since the data
    being stored currently is very wasteful (e.g. storing boxed types vs.
    primitive types for stats). But that's a separate issue.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #15304 from vanzin/SPARK-17549.2.
    
    (cherry picked from commit 8d969a2125d915da1506c17833aa98da614a257f)
    Signed-off-by: Marcelo Vanzin <van...@cloudera.com>

commit a4f7df423e1e0aa512dfc496bc9de13831eae3f3
Author: Ergin Seyfe <ese...@fb.com>
Date:   2016-10-04T19:39:01Z

    [SPARK-17773][BRANCH-2.0] Input/Output] Add VoidObjectInspector
    
    This is the PR for branch2.0: PR https://github.com/apache/spark/pull/15337
    
    Added VoidObjectInspector to the list of PrimitiveObjectInspectors
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    Executing following query was failing.
    select SOME_UDAF*(a.arr)
    from (
    select Array(null) as arr from dim_one_row
    ) a
    
    After the fix, I am getting the correct output:
    res0: Array[org.apache.spark.sql.Row] = Array([null])
    
    Author: Ergin Seyfe <eseyfefb.com>
    
    Closes #15337 from seyfe/add_void_object_inspector.
    
    Author: Ergin Seyfe <ese...@fb.com>
    
    Closes #15345 from seyfe/add_void_object_inspector_2.0.

commit b8df2e53c38a30f51c710543c81279a59a9ab4fc
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-05T21:54:55Z

    [SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of 
BlockManagerSuite
    
    ## What changes were proposed in this pull request?
    
    Mock SparkContext to reduce memory usage of BlockManagerSuite
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15350 from zsxwing/SPARK-17778.
    
    (cherry picked from commit 221b418b1c9db7b04c600b6300d18b034a4f444e)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit 3b6463a794a754d630d69398f009c055664dd905
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-10-05T23:05:30Z

    [SPARK-17758][SQL] Last returns wrong result in case of empty partition
    
    ## What changes were proposed in this pull request?
    The result of the `Last` function can be wrong when the last partition 
processed is empty. It can return `null` instead of the expected value. For 
example, this can happen when we process partitions in the following order:
    ```
    - Partition 1 [Row1, Row2]
    - Partition 2 [Row3]
    - Partition 3 []
    ```
    In this case the `Last` function will currently return a null, instead of 
the value of `Row3`.
    
    This PR fixes this by adding a `valueSet` flag to the `Last` function.
    
    ## How was this patch tested?
    We only used end to end tests for `DeclarativeAggregateFunction`s. I have 
added an evaluator for these functions so we can tests them in catalyst. I have 
added a `LastTestSuite` to test the `Last` aggregate function.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #15348 from hvanhovell/SPARK-17758.
    
    (cherry picked from commit 5fd54b994e2078dbf0794932b4e0ffa9a9eda0c3)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 1c2dff1eeeb045f3f5c3c1423ba07371b03965d7
Author: Michael Armbrust <mich...@databricks.com>
Date:   2016-10-05T23:48:43Z

    [SPARK-17643] Remove comparable requirement from Offset (backport for 
branch-2.0)
    
    ## What changes were proposed in this pull request?
    
    Backport 
https://github.com/apache/spark/commit/988c71457354b0a443471f501cef544a85b1a76a 
to branch-2.0
    
    ## How was this patch tested?
    
    Jenkins
    
    Author: Michael Armbrust <mich...@databricks.com>
    
    Closes #15362 from zsxwing/SPARK-17643-2.0.

commit 225372adfb843afcbf9928db3989f2f8393ae6d8
Author: Reynold Xin <r...@databricks.com>
Date:   2016-10-06T17:33:45Z

    [SPARK-17798][SQL] Remove redundant Experimental annotations in 
sql.streaming
    
    ## What changes were proposed in this pull request?
    I was looking through API annotations to catch mislabeled APIs, and 
realized DataStreamReader and DataStreamWriter classes are already annotated as 
Experimental, and as a result there is no need to annotate each method within 
them.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #15373 from rxin/SPARK-17798.
    
    (cherry picked from commit 79accf45ace5549caa0cbab02f94fc87bedb5587)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit a2bf09588ed98ef33028fcf4d72c15f06af2e9ad
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-06T19:51:12Z

    [SPARK-17780][SQL] Report Throwable to user in StreamExecution
    
    ## What changes were proposed in this pull request?
    
    When using an incompatible source for structured streaming, it may throw 
NoClassDefFoundError. It's better to just catch Throwable and report it to the 
user since the streaming thread is dying.
    
    ## How was this patch tested?
    
    `test("NoClassDefFoundError from an incompatible source")`
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15352 from zsxwing/SPARK-17780.
    
    (cherry picked from commit 9a48e60e6319d85f2c3be3a3c608dab135e18a73)
    Signed-off-by: Michael Armbrust <mich...@databricks.com>

commit e355ca8e828629455228b6a346d64638ab639cfa
Author: Christian Kadner <ckad...@us.ibm.com>
Date:   2016-10-06T21:28:49Z

    [SPARK-17803][TESTS] Upgrade docker-client dependency
    
    [SPARK-17803: Docker integration tests don't run with "Docker for 
Mac"](https://issues.apache.org/jira/browse/SPARK-17803)
    
    ## What changes were proposed in this pull request?
    
    This PR upgrades the 
[docker-client](https://mvnrepository.com/artifact/com.spotify/docker-client) 
dependency from 
[3.6.6](https://mvnrepository.com/artifact/com.spotify/docker-client/3.6.6) to 
[5.0.2](https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2) to 
enable _Docker for Mac_ users to run the `docker-integration-tests` out of the 
box.
    
    The very latest docker-client version is 
[6.0.0](https://mvnrepository.com/artifact/com.spotify/docker-client/6.0.0) but 
that has one additional dependency and no usage yet.
    
    ## How was this patch tested?
    
    The code change was tested on Mac OS X Yosemite with both _Docker Toolbox_ 
as well as _Docker for Mac_ and on Linux Ubuntu 14.04.
    
    ```
    $ build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
-Phive-thriftserver -DskipTests clean package
    
    $ build/mvn -Pdocker-integration-tests -Pscala-2.11 -pl 
:spark-docker-integration-tests_2.11 clean compile test
    ```
    
    Author: Christian Kadner <ckad...@us.ibm.com>
    
    Closes #15378 from ckadner/SPARK-17803_Docker_for_Mac.
    
    (cherry picked from commit 49d11d49983fbe270f4df4fb1e34b5fbe854c5ec)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit b1a9c41e8c41c90dd15ee6f635356dd1a5bbf395
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-10-06T23:09:45Z

    [SPARK-17750][SQL][BACKPORT-2.0] Fix CREATE VIEW with INTERVAL arithmetic
    
    ## What changes were proposed in this pull request?
    
    Currently, Spark raises `RuntimeException` when creating a view with 
timestamp with INTERVAL arithmetic like the following. The root cause is the 
arithmetic expression, `TimeAdd`, was transformed into `timeadd` function as a 
VIEW definition. This PR fixes the SQL definition of `TimeAdd` and `TimeSub` 
expressions.
    
    ```scala
    scala> sql("CREATE TABLE dates (ts TIMESTAMP)")
    
    scala> sql("CREATE VIEW view1 AS SELECT ts + INTERVAL 1 DAY FROM dates")
    java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
    ```
    
    ## How was this patch tested?
    
    Pass Jenkins with a new testcase.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #15383 from dongjoon-hyun/SPARK-17750-BACK.

commit 594a2cf6f7c74c54127b8c3947aadbe0052b404c
Author: sethah <seth.hendrickso...@gmail.com>
Date:   2016-10-07T04:10:17Z

    [SPARK-17792][ML] L-BFGS solver for linear regression does not accept 
general numeric label column types
    
    ## What changes were proposed in this pull request?
    
    Before, we computed `instances` in LinearRegression in two spots, even 
though they did the same thing. One of them did not cast the label column to 
`DoubleType`. This patch consolidates the computation and always casts the 
label column to `DoubleType`.
    
    ## How was this patch tested?
    
    Added a unit test to check all solvers. This test failed before this patch.
    
    Author: sethah <seth.hendrickso...@gmail.com>
    
    Closes #15364 from sethah/linreg_numeric_type.
    
    (cherry picked from commit 3713bb199142c5e06e2e527c99650f02f41f47b1)
    Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 380b099fcfe6f70b978300ea208faf630855471a
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-10-07T05:27:20Z

    [SPARK-17612][SQL][BRANCH-2.0] Support `DESCRIBE table PARTITION` SQL syntax
    
    ## What changes were proposed in this pull request?
    
    This is a backport of SPARK-17612. This implements `DESCRIBE table 
PARTITION` SQL Syntax again. It was supported until Spark 1.6.2, but was 
dropped since 2.0.0.
    
    **Spark 1.6.2**
    ```scala
    scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
    res1: org.apache.spark.sql.DataFrame = [result: string]
    
    scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
    res2: org.apache.spark.sql.DataFrame = [result: string]
    
    scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
    +----------------------------------------------------------------+
    |result                                                          |
    +----------------------------------------------------------------+
    |a                      string                                   |
    |b                      int                                      |
    |c                      string                                   |
    |d                      string                                   |
    |                                                                |
    |# Partition Information                                         |
    |# col_name             data_type               comment          |
    |                                                                |
    |c                      string                                   |
    |d                      string                                   |
    +----------------------------------------------------------------+
    ```
    
    **Spark 2.0**
    - **Before**
    ```scala
    scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
    res0: org.apache.spark.sql.DataFrame = []
    
    scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
    res1: org.apache.spark.sql.DataFrame = []
    
    scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
    org.apache.spark.sql.catalyst.parser.ParseException:
    Unsupported SQL statement
    ```
    
    - **After**
    ```scala
    scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
    res0: org.apache.spark.sql.DataFrame = []
    
    scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
    res1: org.apache.spark.sql.DataFrame = []
    
    scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
    +-----------------------+---------+-------+
    |col_name               |data_type|comment|
    +-----------------------+---------+-------+
    |a                      |string   |null   |
    |b                      |int      |null   |
    |c                      |string   |null   |
    |d                      |string   |null   |
    |# Partition Information|         |       |
    |# col_name             |data_type|comment|
    |c                      |string   |null   |
    |d                      |string   |null   |
    +-----------------------+---------+-------+
    
    scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us', 
d=1)").show(100,false)
    
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
    |col_name                                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
|data_type|comment|
    
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
    |a                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |string 
  |null   |
    |b                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |int    
  |null   |
    |c                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |string 
  |null   |
    |d                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |string 
  |null   |
    |# Partition Information                                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |       
  |       |
    |# col_name                                                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        
|data_type|comment|
    |c                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |string 
  |null   |
    |d                                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |string 
  |null   |
    |                                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |       
  |       |
    |Detailed Partition Information CatalogPartition(
            Partition Values: [Us, 1]
            Storage(Location: 
file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1,
 InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: 
[serialization.format=1])
            Partition Parameters:{transient_lastDdlTime=1475001066})|         | 
      |
    
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
    
    scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us', 
d=1)").show(100,false)
    
+--------------------------------+---------------------------------------------------------------------------------------+-------+
    |col_name                        |data_type                                 
                                             |comment|
    
+--------------------------------+---------------------------------------------------------------------------------------+-------+
    |a                               |string                                    
                                             |null   |
    |b                               |int                                       
                                             |null   |
    |c                               |string                                    
                                             |null   |
    |d                               |string                                    
                                             |null   |
    |# Partition Information         |                                          
                                             |       |
    |# col_name                      |data_type                                 
                                             |comment|
    |c                               |string                                    
                                             |null   |
    |d                               |string                                    
                                             |null   |
    |                                |                                          
                                             |       |
    |# Detailed Partition Information|                                          
                                             |       |
    |Partition Value:                |[Us, 1]                                   
                                             |       |
    |Database:                       |default                                   
                                             |       |
    |Table:                          |partitioned_table                         
                                             |       |
    |Location:                       
|file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1|
       |
    |Partition Parameters:           |                                          
                                             |       |
    |  transient_lastDdlTime         |1475001066                                
                                             |       |
    |                                |                                          
                                             |       |
    |# Storage Information           |                                          
                                             |       |
    |SerDe Library:                  
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                             
        |       |
    |InputFormat:                    |org.apache.hadoop.mapred.TextInputFormat  
                                             |       |
    |OutputFormat:                   
|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                     
        |       |
    |Compressed:                     |No                                        
                                             |       |
    |Storage Desc Parameters:        |                                          
                                             |       |
    |  serialization.format          |1                                         
                                             |       |
    
+--------------------------------+---------------------------------------------------------------------------------------+-------+
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests with a new testcase.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #15351 from dongjoon-hyun/SPARK-17612-BACK.

commit 3487b020354988a91181f23b1c6711bfcdb4c529
Author: Bryan Cutler <cutl...@gmail.com>
Date:   2016-10-07T07:27:55Z

    [SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of 
paths
    
    ## What changes were proposed in this pull request?
    If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use 
an undefined variable `paths`.  This change checks if the param `paths` is a 
basestring and then converts it to a list, so that the same variable `paths` 
can be used for both cases
    
    ## How was this patch tested?
    Added unit test for reading list of files
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
    
    (cherry picked from commit bcaa799cb01289f73e9f48526e94653a07628983)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 9f2eb27a425385836dba5aad61babfb1db738a73
Author: Sean Owen <so...@cloudera.com>
Date:   2016-10-07T17:31:41Z

    [SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
    
    This expands calls to Jetty's simple `ServerConnector` constructor to 
explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It 
should otherwise result in exactly the same configuration, because the other 
args are copied from the constructor that is currently called.
    
    (I'm not sure we should change the Hive Thriftserver impl, but I did 
anyway.)
    
    This also adds `sc.stop()` to the quick start guide example.
    
    Existing tests; _pending_ at least manual verification of the fix.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15381 from srowen/SPARK-17707.
    
    (cherry picked from commit cff560755244dd4ccb998e0c56e81d2620cd4cff)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit f460a199e8fc78ce879b79844c6c9e340b574439
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-07T18:32:39Z

    [SPARK-17346][SQL][TEST-MAVEN] Add Kafka source for Structured Streaming 
(branch 2.0)
    
    ## What changes were proposed in this pull request?
    
    Backport 
https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db 
and 
https://github.com/apache/spark/commit/b678e465afa417780b54db0fbbaa311621311f15 
into branch 2.0.
    
    The only difference is the Spark version in pom file.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15367 from zsxwing/kafka-source-branch-2.0.

commit a84d8ef375f853c5841d458a593e41b457b9e6ff
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-10-07T10:46:39Z

    [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules
    
    ## What changes were proposed in this pull request?
    This PR adds the Kafka 0.10 subproject to the build infrastructure. This 
makes sure Kafka 0.10 tests are only triggers when it or of its dependencies 
change.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #15355 from hvanhovell/SPARK-17782.

commit 6d056c168c45d2decf5ffbb96d59623d52ed8490
Author: Davies Liu <dav...@databricks.com>
Date:   2016-10-07T22:03:47Z

    [SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin
    
    ## What changes were proposed in this pull request?
    
    In HashJoin, we try to rewrite the join key as Long to improve the 
performance of finding a match. The rewriting part is not well tested, has a 
bug that could cause wrong result when there are at least three integral 
columns in the joining key also the total length of the key exceed 8 bytes.
    
    ## How was this patch tested?
    
    Added unit test to covering the rewriting with different number of columns 
and different data types. Manually test the reported case and confirmed that 
this PR fix the bug.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #15390 from davies/rewrite_key.
    
    (cherry picked from commit 94b24b84a666517e31e9c9d693f92d9bbfd7f9ad)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit d27df35795fac0fd167e51d5ba08092a17eedfc2
Author: jiangxingbo <jiangxb1...@gmail.com>
Date:   2016-10-10T04:52:46Z

    [SPARK-17832][SQL] TableIdentifier.quotedString creates un-parseable names 
when name contains a backtick
    
    ## What changes were proposed in this pull request?
    
    The `quotedString` method in `TableIdentifier` and `FunctionIdentifier` 
produce an illegal (un-parseable) name when the name contains a backtick. For 
example:
    ```
    import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._
    import org.apache.spark.sql.catalyst.TableIdentifier
    import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
    val complexName = TableIdentifier("`weird`table`name", Some("`d`b`1"))
    parseTableIdentifier(complexName.unquotedString) // Does not work
    parseTableIdentifier(complexName.quotedString) // Does not work
    parseExpression(complexName.unquotedString) // Does not work
    parseExpression(complexName.quotedString) // Does not work
    ```
    We should handle the backtick properly to make `quotedString` parseable.
    
    ## How was this patch tested?
    Add new testcases in `TableIdentifierParserSuite` and 
`ExpressionParserSuite`.
    
    Author: jiangxingbo <jiangxb1...@gmail.com>
    
    Closes #15403 from jiangxb1987/backtick.
    
    (cherry picked from commit 26fbca480604ba258f97b9590cfd6dda1ecd31db)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit d719e9a080a909a6a56db938750d553668743f8f
Author: Dhruve Ashar <dhruveas...@gmail.com>
Date:   2016-10-10T15:55:57Z

    [SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
    
    ## What changes were proposed in this pull request?
    Currently the no. of partition files are limited to 10000 files (%05d 
format). If there are more than 10000 part files, the logic goes for a toss 
while recreating the RDD as it sorts them by string. More details can be found 
in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).
    
    ## How was this patch tested?
    I tested this patch by checkpointing a RDD and then manually renaming part 
files to the old format and tried to access the RDD. It was successfully 
created from the old format. Also verified loading a sample parquet file and 
saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them 
successfully back from the saved files. I couldn't launch the unit test from my 
local box, so will wait for the Jenkins output.
    
    Author: Dhruve Ashar <dhruveas...@gmail.com>
    
    Closes #15370 from dhruve/bug/SPARK-17417.
    
    (cherry picked from commit 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0)
    Signed-off-by: Tom Graves <tgra...@yahoo-inc.com>

commit ff9f5bbf1795d9f5b14838099dcc1bb4ac8a9b5b
Author: Davies Liu <dav...@databricks.com>
Date:   2016-10-11T02:14:01Z

    [SPARK-17738][TEST] Fix flaky test in ColumnTypeSuite
    
    ## What changes were proposed in this pull request?
    
    The default buffer size is not big enough for randomly generated MapType.
    
    ## How was this patch tested?
    
    Ran the tests in 100 times, it never fail (it fail 8 times before the 
patch).
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #15395 from davies/flaky_map.
    
    (cherry picked from commit d5ec4a3e014494a3d991a6350caffbc3b17be0fd)
    Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit a6b5e1dccf0be0e709d6d4113cdacb0cecce39fd
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-10-11T17:53:07Z

    [SPARK-17346][SQL][TESTS] Fix the flaky topic deletion in 
KafkaSourceStressSuite
    
    ## What changes were proposed in this pull request?
    
    A follow up Pr for SPARK-17346 to fix flaky 
`org.apache.spark.sql.kafka010.KafkaSourceStressSuite`.
    
    Test log: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/
    
    Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky. 
This PR just simply ignores internal topics.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15384 from zsxwing/SPARK-17346-flaky-test.
    
    (cherry picked from commit 75b9e351413dca0930e8545e6283874db09d8482)
    Signed-off-by: Tathagata Das <tathagata.das1...@gmail.com>

commit 5ec3e6680a091883369c002ae599d6b03f38c863
Author: Ergin Seyfe <ese...@fb.com>
Date:   2016-10-11T19:51:08Z

    [SPARK-17816][CORE][BRANCH-2.0] Fix ConcurrentModificationException issue 
in BlockStatusesAccumulator
    
    ## What changes were proposed in this pull request?
    Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is 
thread safe and few more cleanups.
    
    ## How was this patch tested?
    Tested in master branch and cherry-picked.
    
    Author: Ergin Seyfe <ese...@fb.com>
    
    Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17732: Branch 2.0

Reply via email to