[GitHub] spark pull request #16206: Branch 2.0

ming616 Wed, 07 Dec 2016 21:28:22 -0800

GitHub user ming616 opened a pull request:

    https://github.com/apache/spark/pull/16206


    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16206
    
----
commit fffcec90b65047c3031c2b96679401f8fbef6337
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-14T20:33:51Z

    [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value 
can be read thread-safely
    
    ## What changes were proposed in this pull request?
    
    Make CollectionAccumulator and SetAccumulator's value can be read 
thread-safely to fix the ConcurrentModificationException reported in 
[JIRA](https://issues.apache.org/jira/browse/SPARK-17463).
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15063 from zsxwing/SPARK-17463.
    
    (cherry picked from commit e33bfaed3b160fbc617c878067af17477a0044f5)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit bb2bdb44032d2e71832b3e0e771590fb2225e4f3
Author: Xing SHI <shi-...@indetail.co.jp>
Date:   2016-09-14T20:46:46Z

    [SPARK-17465][SPARK CORE] Inappropriate memory management in 
`org.apache.spark.storage.MemoryStore` may lead to memory leak
    
    The expression like `if (memoryMap(taskAttemptId) == 0) 
memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and 
`releasePendingUnrollMemoryForThisTask` should be called after release memory 
operation, whatever `memoryToRelease` is > 0 or not.
    
    If the memory of a task has been set to 0 when calling a 
`releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` 
method, the key in the memory map corresponding to that task will never be 
removed from the hash map.
    
    See the details in 
[SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465).
    
    Author: Xing SHI <shi-...@indetail.co.jp>
    
    Closes #15022 from saturday-shi/SPARK-17465.

commit 5c2bc8360019fb08e2e62e50bb261f7ce19b231e
Author: codlife <1004910...@qq.com>
Date:   2016-09-15T08:38:13Z

    [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())
    
    ## What changes were proposed in this pull request?
    
     when i use sc.makeRDD below
    ```
    val data3 = sc.makeRDD(Seq())
    println(data3.partitions.length)
    ```
    I got an error:
    Exception in thread "main" java.lang.IllegalArgumentException: Positive 
number of slices required
    
    We can fix this bug just modify the last line ,do a check of seq.size
    ```
      def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
        assertNotStopped()
        val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
        new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, 
defaultParallelism), indexToPrefs)
      }
    ```
    
    ## How was this patch tested?
    
     manual tests
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: codlife <1004910...@qq.com>
    Author: codlife <wangjianfe...@otcaix.iscas.ac.cn>
    
    Closes #15077 from codlife/master.
    
    (cherry picked from commit 647ee05e5815bde361662a9286ac602c44b4d4e6)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a09c258c9a97e701fa7650cc0651e3c6a7a1cab9
Author: junyangq <qianjuny...@gmail.com>
Date:   2016-09-15T17:00:36Z

    [SPARK-17317][SPARKR] Add SparkR vignette to branch 2.0
    
    ## What changes were proposed in this pull request?
    
    This PR adds SparkR vignette to branch 2.0, which works as a friendly 
guidance going through the functionality provided by SparkR.
    
    ## How was this patch tested?
    
    R unit test.
    
    Author: junyangq <qianjuny...@gmail.com>
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #15100 from junyangq/SPARKR-vignette-2.0.

commit e77a437d292ecda66163a895427d62e4f72e2a25
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-15T18:22:58Z

    [SPARK-17547] Ensure temp shuffle data file is cleaned up after error
    
    SPARK-8029 (#9610) modified shuffle writers to first stage their data to a 
temporary file in the same directory as the final destination file and then to 
atomically rename this temporary file at the end of the write job. However, 
this change introduced the potential for the temporary output file to be leaked 
if an exception occurs during the write because the shuffle writers' existing 
error cleanup code doesn't handle deletion of the temp file.
    
    This patch avoids this potential cause of disk-space leaks by adding 
`finally` blocks to ensure that temp files are always deleted if they haven't 
been renamed.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.
    
    (cherry picked from commit 5b8f7377d54f83b93ef2bfc2a01ca65fae6d3032)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 62ab536588e19293a84004f547ebc316346b869e
Author: Herman van Hovell <hvanhov...@databricks.com>
Date:   2016-09-15T18:24:15Z

    [SPARK-17114][SQL] Fix aggregates grouped by literals with empty input
    
    ## What changes were proposed in this pull request?
    This PR fixes an issue with aggregates that have an empty input, and use a 
literals as their grouping keys. These aggregates are currently interpreted as 
aggregates **without** grouping keys, this triggers the ungrouped code path 
(which aways returns a single row).
    
    This PR fixes the `RemoveLiteralFromGroupExpressions` optimizer rule, which 
changes the semantics of the Aggregate by eliminating all literal grouping keys.
    
    ## How was this patch tested?
    Added tests to `SQLQueryTestSuite`.
    
    Author: Herman van Hovell <hvanhov...@databricks.com>
    
    Closes #15101 from hvanhovell/SPARK-17114-3.
    
    (cherry picked from commit d403562eb4b5b1d804909861d3e8b75d8f6323b9)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit abb89c42e760357e2d7eae4be344379c7f0d17f3
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-12T20:09:33Z

    [SPARK-17483] Refactoring in BlockManager status reporting and block removal
    
    This patch makes three minor refactorings to the BlockManager:
    
    - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this 
fixes an issue where a debug logging message would incorrectly claim to have 
reported a block status to the master even though no message had been sent (in 
case `info.tellMaster == false`). This also makes it easier to write code which 
unconditionally sends block statuses to the master (which is necessary in 
another patch of mine).
    - Split  `removeBlock()` into two methods, the existing method and an 
internal `removeBlockInternal()` method which is designed to be called by 
internal code that already holds a write lock on the block. This is also needed 
by a followup patch.
    - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just 
pass `BlockStatus.empty`; the block status should always be empty following 
complete removal of a block.
    
    These changes were originally authored as part of a bug fix patch which is 
targeted at branch-2.0 and master; I've split them out here into their own 
separate PR in order to make them easier to review and so that the 
behavior-changing parts of my other patch can be isolated to their own PR.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #15036 from 
JoshRosen/cache-failure-race-conditions-refactorings-only.
    
    (cherry picked from commit 3d40896f410590c0be044b3fa7e5d32115fac05e)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 0169c2edc35ee918b2972f2f4d4e112ccbdcb0c1
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-09-15T18:53:48Z

    [SPARK-17364][SQL] Antlr lexer wrongly treats full qualified identifier as 
a decimal number token when parsing SQL string
    
    ## What changes were proposed in this pull request?
    
    The Antlr lexer we use to tokenize a SQL string may wrongly tokenize a 
fully qualified identifier as a decimal number token. For example, table 
identifier `default.123_table` is wrongly tokenized as
    ```
    default // Matches lexer rule IDENTIFIER
    .123 // Matches lexer rule DECIMAL_VALUE
    _TABLE // Matches lexer rule IDENTIFIER
    ```
    
    The correct tokenization for `default.123_table` should be:
    ```
    default // Matches lexer rule IDENTIFIER,
    . // Matches a single dot
    123_TABLE // Matches lexer rule IDENTIFIER
    ```
    
    This PR fix the Antlr grammar so that it can tokenize fully qualified 
identifier correctly:
    1. Fully qualified table name can be parsed correctly. For example, `select 
* from database.123_suffix`.
    2. Fully qualified column name can be parsed correctly, for example `select 
a.123_suffix from a`.
    
    ### Before change
    
    #### Case 1: Failed to parse fully qualified column name
    
    ```
    scala> spark.sql("select a.123_column from a").show
    org.apache.spark.sql.catalyst.parser.ParseException:
    extraneous input '.123' expecting {<EOF>,
    ...
    , IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 8)
    == SQL ==
    select a.123_column from a
    --------^^^
    ```
    
    #### Case 2: Failed to parse fully qualified table name
    ```
    scala> spark.sql("select * from default.123_table")
    org.apache.spark.sql.catalyst.parser.ParseException:
    extraneous input '.123' expecting {<EOF>,
    ...
    IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 21)
    
    == SQL ==
    select * from default.123_table
    ---------------------^^^
    ```
    
    ### After Change
    
    #### Case 1: fully qualified column name, no ParseException thrown
    ```
    scala> spark.sql("select a.123_column from a").show
    ```
    
    #### Case 2: fully qualified table name, no ParseException thrown
    ```
    scala> spark.sql("select * from default.123_table")
    ```
    
    ## How was this patch tested?
    
    Unit test.
    
    Author: Sean Zhong <seanzh...@databricks.com>
    
    Closes #15006 from clockfly/SPARK-17364.
    
    (cherry picked from commit a6b8182006d0c3dda67c06861067ca78383ecf1b)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit 9c23f4408d337f4af31ebfbcc78767df67d36aed
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-15T18:54:17Z

    [SPARK-17484] Prevent invalid block locations from being reported after 
put() exceptions
    
    ## What changes were proposed in this pull request?
    
    If a BlockManager `put()` call failed after the BlockManagerMaster was 
notified of a block's availability then incomplete cleanup logic in a `finally` 
block would never send a second block status method to inform the master of the 
block's unavailability. This, in turn, leads to fetch failures and used to be 
capable of causing complete job failures before #15037 was fixed.
    
    This patch addresses this issue via multiple small changes:
    
    - The `finally` block now calls `removeBlockInternal` when cleaning up from 
a failed `put()`; in addition to removing the `BlockInfo` entry (which was 
_all_ that the old cleanup logic did), this code (redundantly) tries to remove 
the block from the memory and disk stores (as an added layer of defense against 
bugs lower down in the stack) and optionally notifies the master of block 
removal (which now happens during exception-triggered cleanup).
    - When a BlockManager receives a request for a block that it does not have 
it will now notify the master to update its block locations. This ensures that 
bad metadata pointing to non-existent blocks will eventually be fixed. Note 
that I could have implemented this logic in the block manager client (rather 
than in the remote server), but that would introduce the problem of 
distinguishing between transient and permanent failures; on the server, 
however, we know definitively that the block isn't present.
    - Catch `NonFatal` instead of `Exception` to avoid swallowing 
`InterruptedException`s thrown from synchronous block replication calls.
    
    This patch depends upon the refactorings in #15036, so that other patch 
will also have to be backported when backporting this fix.
    
    For more background on this issue, including example logs from a real 
production failure, see 
[SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484).
    
    ## How was this patch tested?
    
    Two new regression tests in BlockManagerSuite.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #15085 from JoshRosen/SPARK-17484.
    
    (cherry picked from commit 1202075c95eabba0ffebc170077df798f271a139)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 5ad4395e1b41d5ec74785c0aef5c2f656f9db9da
Author: Reynold Xin <r...@databricks.com>
Date:   2016-09-16T18:24:26Z

    [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7.3
    
    ## What changes were proposed in this pull request?
    This patch bumps the Hadoop version in hadoop-2.7 profile from 2.7.2 to 
2.7.3, which was recently released and contained a number of bug fixes.
    
    ## How was this patch tested?
    The change should be covered by existing tests.
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #15115 from rxin/SPARK-17558.
    
    (cherry picked from commit dca771bec6edb1cd8fc75861d364e0ba9dccf7c3)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 3fce1255ad41a04e92720795ce2b162ec305cf0a
Author: Marcelo Vanzin <van...@cloudera.com>
Date:   2016-09-16T21:02:56Z

    [SPARK-17549][SQL] Only collect table size stat in driver for cached 
relation.
    
    The existing code caches all stats for all columns for each partition
    in the driver; for a large relation, this causes extreme memory usage,
    which leads to gc hell and application failures.
    
    It seems that only the size in bytes of the data is actually used in the
    driver, so instead just colllect that. In executors, the full stats are
    still kept, but that's not a big problem; we expect the data to be 
distributed
    and thus not really incur in too much memory pressure in each individual
    executor.
    
    There are also potential improvements on the executor side, since the data
    being stored currently is very wasteful (e.g. storing boxed types vs.
    primitive types for stats). But that's a separate issue.
    
    On a mildly related change, I'm also adding code to catch exceptions in the
    code generator since Janino was breaking with the test data I tried this
    patch on.
    
    Tested with unit tests and by doing a count a very wide table (20k columns)
    with many partitions.
    
    Author: Marcelo Vanzin <van...@cloudera.com>
    
    Closes #15112 from vanzin/SPARK-17549.
    
    (cherry picked from commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 9ff158b81224c106d50e087c0d284b0c86c95879
Author: Daniel Darabos <darabos.dan...@gmail.com>
Date:   2016-09-17T11:28:42Z

    Correct fetchsize property name in docs
    
    ## What changes were proposed in this pull request?
    
    Replace `fetchSize` with `fetchsize` in the docs.
    
    ## How was this patch tested?
    
    I manually tested `fetchSize` and `fetchsize`. The latter has an effect. 
See also 
[`JdbcUtils.scala#L38`](https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L38)
 for the definition of the property.
    
    Author: Daniel Darabos <darabos.dan...@gmail.com>
    
    Closes #14975 from darabos/patch-3.
    
    (cherry picked from commit 69cb0496974737347e2650cda436b39bbd51e581)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 3ca0dc00786df1d529d55e297aaf23e1e1e07999
Author: Xin Ren <iamsh...@126.com>
Date:   2016-09-17T11:30:25Z

    [SPARK-17567][DOCS] Use valid url to Spark RDD paper
    
    https://issues.apache.org/jira/browse/SPARK-17567
    
    ## What changes were proposed in this pull request?
    
    Documentation 
(http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) 
contains broken link to Spark paper 
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf).
    
    I found it elsewhere 
(https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and 
I hope it is the same one. It should be uploaded to and linked from some Apache 
controlled storage, so it won't break again.
    
    ## How was this patch tested?
    
    Tested manually on local laptop.
    
    Author: Xin Ren <iamsh...@126.com>
    
    Closes #15121 from keypointt/SPARK-17567.
    
    (cherry picked from commit f15d41be3ce7569736ccbf2ffe1bec265865f55d)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit c9bd67e94d9d9d2e1f2cb1e5c4bb71a69b1e1d4e
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-16T20:43:05Z

    [SPARK-17561][DOCS] DataFrameWriter documentation formatting problems
    
    Fix `<ul> / <li>` problems in SQL scaladoc.
    
    Scaladoc build and manual verification of generated HTML.
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15117 from srowen/SPARK-17561.
    
    (cherry picked from commit b9323fc9381a09af510f542fd5c86473e029caf6)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit eb2675de92b865852d7aa3ef25a20e6cff940299
Author: William Benton <wi...@redhat.com>
Date:   2016-09-17T11:49:58Z

    [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously 
rejects the best match when invoked with a vector
    
    ## What changes were proposed in this pull request?
    
    This pull request changes the behavior of `Word2VecModel.findSynonyms` so 
that it will not spuriously reject the best match when invoked with a vector 
that does not correspond to a word in the model's vocabulary.  Instead of 
blindly discarding the best match, the changed implementation discards a match 
that corresponds to the query word (in cases where `findSynonyms` is invoked 
with a word) or that has an identical angle to the query vector.
    
    ## How was this patch tested?
    
    I added a test to `Word2VecSuite` to ensure that the word with the most 
similar vector from a supplied vector would not be spuriously rejected.
    
    Author: William Benton <wi...@redhat.com>
    
    Closes #15105 from willb/fix/findSynonyms.
    
    (cherry picked from commit 25cbbe6ca334140204e7035ab8b9d304da9b8a8a)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit ec2b736566b69a1549791f3d86b55cb0249a757d
Author: sandy <phal...@gmail.com>
Date:   2016-09-17T15:25:03Z

    [SPARK-17575][DOCS] Remove extra table tags in configuration document
    
    ## What changes were proposed in this pull request?
    
    Remove extra table tags in configurations document.
    
    ## How was this patch tested?
    
    Run all test cases and generate document.
    
    Before with extra tag its look like below
    
![config-wrong1](https://cloud.githubusercontent.com/assets/8075390/18608239/c602bb60-7d01-11e6-875e-f38558997dd3.png)
    
    
![config-wrong2](https://cloud.githubusercontent.com/assets/8075390/18608241/cf3b672c-7d01-11e6-935e-1e73f9e6e578.png)
    
    After removing tags its looks like below
    
    
![config](https://cloud.githubusercontent.com/assets/8075390/18608245/e156eb8e-7d01-11e6-98aa-3be68d4d1961.png)
    
    
![config2](https://cloud.githubusercontent.com/assets/8075390/18608247/e84eecd4-7d01-11e6-9738-a3f7ff8fe834.png)
    
    Author: sandy <phal...@gmail.com>
    
    Closes #15130 from phalodi/SPARK-17575.
    
    (cherry picked from commit bbe0b1d623741decce98827130cc67eb1fff1240)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit a3bba372abce926351335d0a2936b70988f19b23
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-17T15:52:30Z

    [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls 
List.length/size which is O(n)
    
    This PR fixes all the instances which was fixed in the previous PR.
    
    To make sure, I manually debugged and also checked the Scala source. 
`length` in 
[LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57)
 is O(n). Also, `size` calls `length` via 
[SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).
    
    For debugging, I have created these as below:
    
    ```scala
    ArrayBuffer(1, 2, 3)
    Array(1, 2, 3)
    List(1, 2, 3)
    Seq(1, 2, 3)
    ```
    
    and then called `size` and `length` for each to debug.
    
    I ran the bash as below on Mac
    
    ```bash
    find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | 
grep "src/main"
    find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | 
grep "src/main"
    ```
    
    and then checked each.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #15093 from HyukjinKwon/SPARK-17480-followup.
    
    (cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit bec077069af0b3bc22092a0552baf855dfb344ad
Author: Josh Rosen <joshro...@databricks.com>
Date:   2016-09-17T18:46:15Z

    [SPARK-17491] Close serialization stream to fix wrong answer bug in 
putIteratorAsBytes()
    
    ## What changes were proposed in this pull request?
    
    `MemoryStore.putIteratorAsBytes()` may silently lose values when used with 
`KryoSerializer` because it does not properly close the serialization stream 
before attempting to deserialize the already-serialized values, which may cause 
values buffered in Kryo's internal buffers to not be read.
    
    This is the root cause behind a user-reported "wrong answer" bug in PySpark 
caching reported by bennoleslie on the Spark user mailing list in a thread 
titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's 
automatic use of KryoSerializer for "safe" types (such as byte arrays, 
primitives, etc.) this misuse of serializers manifested itself as silent data 
corruption rather than a StreamCorrupted error (which you might get from 
JavaSerializer).
    
    The minimal fix, implemented here, is to close the serialization stream 
before attempting to deserialize written values. In addition, this patch adds 
several additional assertions / precondition checks to prevent misuse of 
`PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`.
    
    ## How was this patch tested?
    
    The original bug was masked by an invalid assert in the memory store test 
cases: the old assert compared two results record-by-record with `zip` but 
didn't first check that the lengths of the two collections were equal, causing 
missing records to go unnoticed. The updated test case reproduced this bug.
    
    In addition, I added a new `PartiallySerializedBlockSuite` to unit test 
that component.
    
    Author: Josh Rosen <joshro...@databricks.com>
    
    Closes #15043 from 
JoshRosen/partially-serialized-block-values-iterator-bugfix.
    
    (cherry picked from commit 8faa5217b44e8d52eab7eb2d53d0652abaaf43cd)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 0cfc0469b40450aee5d909641b4296b3a978b2d6
Author: Tathagata Das <tathagata.das1...@gmail.com>
Date:   2016-09-17T21:18:40Z

    Revert "[SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls 
List.length/size which is O(n)"
    
    This reverts commit a3bba372abce926351335d0a2936b70988f19b23.

commit 5fd354b2d628130a74c9d01adc7ab6bef65fbd9a
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-17T15:52:30Z

    [SPARK-17480][SQL][FOLLOWUP] Fix more instances which calls 
List.length/size which is O(n)
    
    This PR fixes all the instances which was fixed in the previous PR.
    
    To make sure, I manually debugged and also checked the Scala source. 
`length` in 
[LinearSeqOptimized.scala#L49-L57](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/LinearSeqOptimized.scala#L49-L57)
 is O(n). Also, `size` calls `length` via 
[SeqLike.scala#L106](https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/SeqLike.scala#L106).
    
    For debugging, I have created these as below:
    
    ```scala
    ArrayBuffer(1, 2, 3)
    Array(1, 2, 3)
    List(1, 2, 3)
    Seq(1, 2, 3)
    ```
    
    and then called `size` and `length` for each to debug.
    
    I ran the bash as below on Mac
    
    ```bash
    find . -name *.scala -type f -exec grep -il "while (.*\\.length)" {} \; | 
grep "src/main"
    find . -name *.scala -type f -exec grep -il "while (.*\\.size)" {} \; | 
grep "src/main"
    ```
    
    and then checked each.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #15093 from HyukjinKwon/SPARK-17480-followup.
    
    (cherry picked from commit 86c2d393a56bf1e5114bc5a781253c0460efb8af)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit cf728b0f2dc7c1e9f62a8984122d3bf91e6ba439
Author: Wenchen Fan <wenc...@databricks.com>
Date:   2016-09-18T13:15:35Z

    [SPARK-17541][SQL] fix some DDL bugs about table management when same-name 
temp view exists
    
    In `SessionCatalog`, we have several operations(`tableExists`, `dropTable`, 
`loopupRelation`, etc) that handle both temp views and metastore tables/views. 
This brings some bugs to DDL commands that want to handle temp view only or 
metastore table/view only. These bugs are:
    
    1. `CREATE TABLE USING` will fail if a same-name temp view exists
    2. `Catalog.dropTempView`will un-cache and drop metastore table if a 
same-name table exists
    3. `saveAsTable` will fail or have unexpected behaviour if a same-name temp 
view exists.
    
    These bug fixes are pulled out from 
https://github.com/apache/spark/pull/14962 and targets both master and 2.0 
branch
    
    new regression tests
    
    Author: Wenchen Fan <wenc...@databricks.com>
    
    Closes #15099 from cloud-fan/fix-view.
    
    (cherry picked from commit 3fe630d314cf50d69868b7707ac8d8d2027080b8)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 5619f095bfac76009758b4f4a4f8c9e319eeb5b1
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-18T15:22:31Z

    [SPARK-17546][DEPLOY] start-* scripts should use hostname -f
    
    ## What changes were proposed in this pull request?
    
    Call `hostname -f` to get fully qualified host name
    
    ## How was this patch tested?
    
    Jenkins tests of course, but also verified output of command on OS X and 
Linux
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15129 from srowen/SPARK-17546.
    
    (cherry picked from commit 342c0e65bec4b9a715017089ab6ea127f3c46540)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 6c67d86f2f0a24764146827ec5c42969194cb11d
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2016-09-18T18:18:49Z

    [SPARK-17586][BUILD] Do not call static member via instance reference
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a warning message as below:
    
    ```
    [WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static 
method should be qualified by type name, TaskMemoryManager, instead of by an 
expression
    [WARNING]       currentPageNumber = 
memoryManager.decodePageNumber(recordPointer)
    ```
    
    by referencing the static member via class not instance reference.
    
    ## How was this patch tested?
    
    Existing tests should cover this - Jenkins tests.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #15141 from HyukjinKwon/SPARK-17586.
    
    (cherry picked from commit 7151011b38a841d9d4bc2e453b9a7cfe42f74f8f)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 151f808a181333daa6300c7d5d7c49c3cec3307c
Author: Liwei Lin <lwl...@gmail.com>
Date:   2016-09-18T18:25:58Z

    [SPARK-16462][SPARK-16460][SPARK-15144][SQL] Make CSV cast null values 
properly
    
    ## Problem
    
    CSV in Spark 2.0.0:
    -  does not read null values back correctly for certain data types such as 
`Boolean`, `TimestampType`, `DateType` -- this is a regression comparing to 1.6;
    - does not read empty values (specified by `options.nullValue`) as `null`s 
for `StringType` -- this is compatible with 1.6 but leads to problems like 
SPARK-16903.
    
    ## What changes were proposed in this pull request?
    
    This patch makes changes to read all empty values back as `null`s.
    
    ## How was this patch tested?
    
    New test cases.
    
    Author: Liwei Lin <lwl...@gmail.com>
    
    Closes #14118 from lw-lin/csv-cast-null.
    
    (cherry picked from commit 1dbb725dbef30bf7633584ce8efdb573f2d92bca)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 27ce39cf207eba46502ed11fcbfd51bed3e68f2b
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-09-18T22:22:01Z

    [SPARK-17571][SQL] AssertOnQuery.condition should always return Boolean 
value
    
    ## What changes were proposed in this pull request?
    AssertOnQuery has two apply constructor: one that accepts a closure that 
returns boolean, and another that accepts a closure that returns Unit. This is 
actually very confusing because developers could mistakenly think that 
AssertOnQuery always require a boolean return type and verifies the return 
result, when indeed the value of the last statement is ignored in one of the 
constructors.
    
    This pull request makes the two constructor consistent and always require 
boolean value. It will overall make the test suites more robust against 
developer errors.
    
    As an evidence for the confusing behavior, this change also identified a 
bug with an existing test case due to file system time granularity. This pull 
request fixes that test case as well.
    
    ## How was this patch tested?
    This is a test only change.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #15127 from petermaxlee/SPARK-17571.
    
    (cherry picked from commit 8f0c35a4d0dd458719627be5f524792bf244d70a)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit ac060397c109158e84a2b57355c8dee5ab24ab7b
Author: Sean Owen <so...@cloudera.com>
Date:   2016-09-19T08:38:25Z

    [SPARK-17297][DOCS] Clarify window/slide duration as absolute time, not 
relative to a calendar
    
    ## What changes were proposed in this pull request?
    
    Clarify that slide and window duration are absolute, and not relative to a 
calendar.
    
    ## How was this patch tested?
    
    Doc build (no functional change)
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #15142 from srowen/SPARK-17297.
    
    (cherry picked from commit d720a4019460b6c284d0473249303c349df60a1f)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit c4660d607fbeacc9bdbe2bb1293e4401d19a4bd5
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-09-19T17:21:33Z

    [SPARK-17589][TEST][2.0] Fix test case `create external table` in 
MetastoreDataSourcesSuite
    
    ### What changes were proposed in this pull request?
    This PR is to fix a test failure on the branch 2.0 builds:
    
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.0-test-maven-hadoop-2.7/711/
    ```
    Error Message
    
    "Table `default`.`createdJsonTable` already exists.;" did not contain 
"Table default.createdJsonTable already exists." We should complain that 
createdJsonTable already exists
    ```
    
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsm...@gmail.com>
    
    Closes #15145 from gatorsmile/fixTestCase.

commit f56035ba6c86fe93a45fd437f98f812431df0069
Author: sureshthalamati <suresh.thalam...@gmail.com>
Date:   2016-09-19T16:56:16Z

    [SPARK-17473][SQL] fixing docker integration tests error due to different 
versions of jars.
    
    ## What changes were proposed in this pull request?
    Docker tests are using older version  of jersey jars (1.19),  which was 
used in older releases of spark.  In 2.0 releases Spark was upgraded to use 2.x 
verison of Jersey. After  upgrade to new versions, docker tests  are  failing 
with AbstractMethodError.  Now that spark is upgraded  to 2.x jersey version, 
using of  shaded docker jars  may not be required any more.  Removed the 
exclusions/overrides of jersey related classes from pom file, and changed the 
docker-client to use regular jar instead of shaded one.
    
    ## How was this patch tested?
    
    Tested  using existing  docker-integration-tests
    
    Author: sureshthalamati <suresh.thalam...@gmail.com>
    
    Closes #15114 from sureshthalamati/docker_testfix-spark-17473.
    
    (cherry picked from commit cdea1d1343d02f0077e1f3c92ca46d04a3d30414)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit d6191a0671effe32f5c07397679c17a62e1cdaff
Author: Shixiong Zhu <shixi...@databricks.com>
Date:   2016-09-19T18:00:42Z

    [SPARK-17438][WEBUI] Show Application.executorLimit in the application page
    
    ## What changes were proposed in this pull request?
    
    This PR adds `Application.executorLimit` to the applicatino page
    
    ## How was this patch tested?
    
    Checked the UI manually.
    
    Screenshots:
    
    1. Dynamic allocation is disabled
    
    <img width="484" alt="screen shot 2016-09-07 at 4 21 49 pm" 
src="https://cloud.githubusercontent.com/assets/1000778/18332029/210056ea-7518-11e6-9f52-76d96046c1c0.png";>
    
    2. Dynamic allocation is enabled.
    
    <img width="466" alt="screen shot 2016-09-07 at 4 25 30 pm" 
src="https://cloud.githubusercontent.com/assets/1000778/18332034/2c07700a-7518-11e6-8fce-aebe25014902.png";>
    
    Author: Shixiong Zhu <shixi...@databricks.com>
    
    Closes #15001 from zsxwing/fix-core-info.
    
    (cherry picked from commit 80d6655921bea9b1bb27c1d95c2b46654e7a8cca)
    Signed-off-by: Andrew Or <and...@databricks.com>

commit fef3ec151a67348ce05fbbec95b74a0a4fe1aa4b
Author: Davies Liu <dav...@databricks.com>
Date:   2016-09-19T18:49:03Z

    [SPARK-16439] [SQL] bring back the separator in SQL UI
    
    ## What changes were proposed in this pull request?
    
    Currently, the SQL metrics looks like `number of rows: 111111111111`, it's 
very hard to read how large the number is. So a separator was added by #12425, 
but removed by #14142, because the separator is weird in some locales (for 
example, pl_PL), this PR will add that back, but always use "," as the 
separator, since the SQL UI are all in English.
    
    ## How was this patch tested?
    
    Existing tests.
    
![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #15106 from davies/metric_sep.
    
    (cherry picked from commit e0632062635c37cbc77df7ebd2a1846655193e12)
    Signed-off-by: Davies Liu <davies....@gmail.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16206: Branch 2.0

Reply via email to