[GitHub] spark pull request #19423: Branch 2.2

2017-10-03 Thread engineeyao
GitHub user engineeyao opened a pull request:

https://github.com/apache/spark/pull/19423

Branch 2.2

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19423


commit 9cbf39f1c74f16483865cd93d6ffc3c521e878a7
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2017-05-25T12:15:15Z

[SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.

## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang <yblia...@gmail.com>

Closes #18089 from yanboliang/spark-19281.

(cherry picked from commit 913a6bfe4b0eb6b80a03b858ab4b2767194103de)
Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit e01f1f222bcb7c469b1e1595e9338ed478d99894
Author: Yan Facai (颜发才) <facai@gmail.com>
Date:   2017-05-25T13:40:39Z

[SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark 
FPGrowth.

## What changes were proposed in this pull request?

Expose numPartitions (expert) param of PySpark FPGrowth.

## How was this patch tested?

+ [x] Pass all unit tests.

Author: Yan Facai (颜发才) <facai@gmail.com>

Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.

(cherry picked from commit 139da116f130ed21481d3e9bdee5df4b8d7760ac)
Signed-off-by: Yanbo Liang <yblia...@gmail.com>

commit 022a4957d8dc8d6049e0a8c9191fcfd1bd95a4a4
Author: Lior Regev <liore...@gmail.com>
Date:   2017-05-25T16:08:19Z

[SPARK-20741][SPARK SUBMIT] Added cleanup of JARs archive generated by 
SparkSubmit

## What changes were proposed in this pull request?

Deleted generated JARs archive after distribution to HDFS

## How was this patch tested?

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

Author: Lior Regev <liore...@gmail.com>

Closes #17986 from liorregev/master.

(cherry picked from commit 7306d556903c832984c7f34f1e8fe738a4b2343c)
Signed-off-by: Sean Owen <so...@cloudera.com>

commit 5ae1c652147aba9c5087335b0c6916a1035090b2
Author: hyukjinkwon <gurwls...@gmail.com>
Date:   2017-05-25T16:10:30Z

[SPARK-19707][SPARK-18922][TESTS][SQL][CORE] Fix test failures/the invalid 
path check for sc.addJar on Windows

## What changes were proposed in this pull request?

This PR proposes two things:

- A follow up for SPARK-19707 (Improving the invalid path check for 
sc.addJar on Windows as well).

```
org.apache.spark.SparkContextSuite:
 - add jar with invalid path *** FAILED *** (32 milliseconds)
   2 was not equal to 1 (SparkContextSuite.scala:309)
   ...
```

- Fix path vs URI related test failures on Windows.

```
org.apache.spark.storage.LocalDirsSuite:
 - SPARK_LOCAL_DIRS override also affects driver *** FAILED *** (0 
milliseconds)
   new java.io.File("/NONEXISTENT_PATH").exists() was true 
(LocalDirsSuite.scala:50)
   ...

 - Utils.getLocalDir() throws an exception if any temporary directory 
cannot be retrieved *** FAILED *** (15 milliseconds)
   Expected exception java.io.IOException to be thrown, but no exception 
was thrown. (LocalDirsSuite.scala:64)
   ...
```

```
org.apache.spark.sql.hive.HiveSchemaInferenceSuite:
 - orc: schema should be inferred and saved when INFER_AND_SAVE is 
specified *** FAILED *** (203 milliseconds)
   java.net.URISyntaxException: Illegal character in opaque part at index 
2: C:\projects\spark\target\tmp\spark-dae61ab3-a851-4dd3-bf4e-be97c501f254
   ...

 - parquet: schema should be inferred and saved when INFER_AND_SAVE is 
specified *** FAILED *** (203 milliseconds)
   java.net.URISyntaxException: Illegal character in opaque part at index 
2: C:\projects\spark\target\tmp\spark-fa3aff89-a66e-4376-9a37-2a9b87596939
   ...

 - orc: schema should be inferred but not stored when INFER_ONLY is 
specified *** FAILED *** (141 milliseconds)
   

[GitHub] spark pull request #19187: Branch 2.1

2017-09-11 Thread engineeyao
GitHub user engineeyao opened a pull request:

https://github.com/apache/spark/pull/19187

Branch 2.1

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19187


commit 62fab5beee147c90d8b7f8092b4ee76ba611ee8e
Author: uncleGen <husty...@gmail.com>
Date:   2017-02-07T05:03:20Z

[SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it 
from uri scheme

## What changes were proposed in this pull request?

```
Caused by: java.lang.IllegalArgumentException: Wrong FS: 
s3a://**/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, 
expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:100)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
```

Can easily replicate on spark standalone cluster by providing checkpoint 
location uri scheme anything other than "file://" and not overriding in config.

WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket or set it in 
sparkConf or spark-default.conf

## How was this patch tested?

existing ut

Author: uncleGen <husty...@gmail.com>

Closes #16815 from uncleGen/SPARK-19407.

(cherry picked from commit 7a0a630e0f699017c7d0214923cd4aa0227e62ff)
Signed-off-by: Shixiong Zhu <shixi...@databricks.com>

commit dd1abef138581f30ab7a8dfacb616fe7dd64b421
Author: Aseem Bansal <anshban...@users.noreply.github.com>
Date:   2017-02-07T11:44:14Z

[SPARK-19444][ML][DOCUMENTATION] Fix imports not being present in 
documentation

## What changes were proposed in this pull request?

SPARK-19444 imports not being present in documentation

## How was this patch tested?

Manual

## Disclaimer

Contribution is original work and I license the work to the project under 
the project’s open source license

Author: Aseem Bansal <anshban...@users.noreply.github.com>

Closes #16789 from anshbansal/patch-1.

(cherry picked from commit aee2bd2c7ee97a58f0adec82ec52e5625b39e804)
Signed-off-by: Sean Owen <so...@cloudera.com>

commit e642a07d57798f98b25ba08ed7ae3abe0f597941
Author: Tyson Condie <tcon...@gmail.com>
Date:   2017-02-07T22:31:23Z

[SPARK-18682][SS] Batch Source for Kafka

Today, you can start a stream that reads from kafka. However, given kafka's 
configurable retention period, it seems like sometimes you might just want to 
read all of the data that is available now. As such we should add a version 
that works with spark.read as well.
The options should be the same as the streaming kafka source, with the 
following differences:
startingOffsets should default to earliest, and should not allow latest 
(which would always be empty).
endingOffsets should also be allowed and should default to latest. the same 
assign json format as startingOffsets should also be accepted.
It would be really good, if things like .limit(n) were enough to prevent 
all the data from being read (this might just work).

KafkaRelationSuite was added for testing batch queries via KafkaUtils.

Auth