date:20161121

[GitHub] spark issue #15788: [SPARK-18291][SparkR][ML] SparkR glm predict should outp...

2016-11-21 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/15788
  
@jkbradley Thanks for your comments. I fully understand and also carefully 
considered your concerns. I totally agree we should make ```spark.glm``` match 
Râs ```glm``` as much as possible. However, I found the output prediction of 
```spark.glm``` for the binomial family is meaningless if we only provides the 
probability value.

Letâs look at how R ```glm``` works for binomial family(with string 
label):
```R
data <- iris[iris$Species %in% c("versicolor", "virginica"), ]
model <- glm(Species ~ ., data = data, family = binomial(link = "logitâ))
predict(model, data, type = "responseâ)
f <- factor(data$Species)
prediction <- round(predict(model, data, type = "response"))
factor(prediction, labels = levels(f))
```
The label is string and R ```glm``` will encode it into factor 
automatically when training. The default prediction output is log-space 
prediction, but we can convert it into probability by specifying ```type = 
"response"```, then we can get the prediction value in the form ```0``` and 
```1```.
However, native R ```factor``` method can get the map between the string 
values and the converted numeric values. For example, ```versicolor``` maps to 
```0``` and ```virginica``` maps to ```1```. Users can convert the numeric 
prediction value into string value and vice versa.

In Spark, we use ```StringIndexer``` to encode label and 
```IndexToString``` to convert back. This was wrapped in ```RFormula``` and 
SparkR users can not use it which means they can not get the map between string 
label and numeric label. I think getting the original label as prediction value 
is one of the most important use cases in SparkR users, so I proposed to make 
this change.
Another option is to implement ```spark.factor``` as a function to SparkR 
users, then they will be able to use it as ```factor``` in native R. But I 
think the internal of ```RFormula```(which is closely related with the 
implementation of ```spark.factor```) is private for SparkR users, and we are 
not ready to make it public(We are still making improvements to ```RFormula```, 
see [SPARK-15540](https://issues.apache.org/jira/browse/SPARK-15540)), so I 
give up this opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15960: [SPARK-18521] Add `NoRedundantStringInterpolator` Scala ...

2016-11-21 Thread weiqingy

Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/15960
  
Hi, @rxin @srowen Thanks for the prompt feedbacks and suggestions. Yes, I 
understand and agree with your concerns. The motivation of creatingÂ this PR is 
I think the redundant string interpolators are unnecessary and misleading, so 
itâs good to clean up the code and make complication a little faster.

How about making this PR in a long-term step-by-step manner in terms of 
modules? (the rule of `NoRedundantStringInterpolator` Scala style will be added 
in the final step)

Spark 2.1(~ Nov 2016): Mesos (2 files), Yarn (4 files), External (4 files), 
GraphX (5 files)
Spark 2.2(~ Mar 2017): Hive-Thrift-Server (4 files), Streaming (8 files), 
Example (18 files)
Spark 2.3(~ Jul 2017): Hive (28 files), Catalyst(19 files)
Spark 2.4(~ Nov 2017): SQL Core(47 files)
Spark 2.5(~ 2018): Core(54 files), MLLib (69 files)

If it is ok for you, I can make PRs based on the schedule above (Yes, I 
need to fix the failures of Spark unit tests). If not, I can just close this PR 
and SPARK-18521. ðð Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15968: [SPARK-18533] Raise correct error upon specification of ...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15968
  
**[Test build #68983 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68983/consoleFull)**
 for PR 15968 at commit 
[`dbdbd1e`](https://github.com/apache/spark/commit/dbdbd1ed2a678edc34abe6f4a6808574ff6ab038).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15968: [SPARK-18533] Raise correct error upon specificat...

2016-11-21 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/15968#discussion_r89054853
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala
 ---
@@ -249,4 +249,13 @@ class CreateTableAsSelectSuite
   }
 }
   }
+
+  test("specifying the column list for CTAS") {
+withTable("t") {
+  val e = intercept[AnalysisException] {
--- End diff --

@cloud-fan Thanks !! I have made the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15951
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15951
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68979/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15952: [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`...

2016-11-21 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15952
  
Oh, I see. I didn't notice. Thank you both!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15951
  
**[Test build #68979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68979/consoleFull)**
 for PR 15951 at commit 
[`08566e7`](https://github.com/apache/spark/commit/08566e72d04f7f2334d676211caca8bc0ae99290).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89053783
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -100,99 +100,30 @@ object AnalyzeColumnCommand extends Logging {
   exprOption.getOrElse(throw new AnalysisException(s"Invalid column 
name: $col."))
 }).toSeq
 
+// Make sure the column types are supported for stats gathering.
+attributesToAnalyze.foreach { attr =>
+  if (!ColumnStat.supportsType(attr.dataType)) {
+throw new AnalysisException(
+  s"Data type ${attr.dataType.simpleString} for column 
${attr.name} is not supported " +
--- End diff --

Yes. The command can be ran in a context of a pipeline with other commands. 
Say a job is writing to `n` tables and then populating stats for each of them. 
In case one of the queries to populate the stats fails, user would have to dig 
through the logs to find out which table caused the problem. If the exception 
message has ample information, one can simply look at the failure message of 
the container where the driver ran without needing to scroll through logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regression sho...

2016-11-21 Thread wangmiao1981

Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/15910
  
I am on travel now. I will address the comments asap. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15717: [SPARK-17910][SQL] Allow users to update the comm...

2016-11-21 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15717#discussion_r89053215
  
--- Diff: 
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ---
@@ -93,6 +93,8 @@ statement
 SET TBLPROPERTIES tablePropertyList
#setTableProperties
 | ALTER (TABLE | VIEW) tableIdentifier
 UNSET TBLPROPERTIES (IF EXISTS)? tablePropertyList 
#unsetTableProperties
+| ALTER TABLE tableIdentifier
+CHANGE COLUMN? expandColTypeList   
#changeColumns
--- End diff --

I think we are following the HIVE DDL syntax by the following definition:
```
colType
: identifier dataType (COMMENT STRING)?
;

expandColTypeList
: expandColType (',' expandColType)*
;

expandColType
: identifier colType colPosition?
;
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15930: [SPARK-18501][ML][SparkR] Fix spark.glm errors when fitt...

2016-11-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15930
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15976: [SPARK-18403][SQL] Fix unsafe data false sharing issue i...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15976
  
**[Test build #68982 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68982/consoleFull)**
 for PR 15976 at commit 
[`4b88eed`](https://github.com/apache/spark/commit/4b88eed0f5f2cb34d217d833dbe085f9d0277a94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15976: [SPARK-18403][SQL] Fix unsafe data false sharing ...

2016-11-21 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/15976

[SPARK-18403][SQL] Fix unsafe data false sharing issue in 
ObjectHashAggregateExec

[SPARK-18403][SQL] Fix unsafe data false sharing issue in 
ObjectHashAggregateExec

## What changes were proposed in this pull request?

This PR fixes a random OOM issue occurred while running 
`ObjectHashAggregateSuite`.

This issue can be steadily reproduced under the following conditions:

1. The aggregation must be evaluated using `ObjectHashAggregateExec`;
2. There must be an input column whose data type involves `ArrayType` (an 
input column of `MapType` may even cause SIGSEGV);
3. Sort-based aggregation fallback must be triggered during evaluation.

The root cause is that while falling back to sort-based aggregation, we 
must sort and feed already evaluated partial aggregation buffers living in the 
hash map to the sort-based aggregator using an external sorter. However, the 
underlying mutable byte buffer of `UnsafeRow`s produced by the iterator of the 
external sorter is reused and may get overwritten when the iterator steps 
forward. After the last entry is consumed, the byte buffer points to a block of 
uninitialized memory filled by `5a`. Therefore, while reading an 
`UnsafeArrayData` out of the `UnsafeRow`, `5a5a5a5a` is treated as array size 
and triggers a memory allocation for a ridiculously large array and immediately 
blows up the JVM with an OOM.

To fix this issue, we only need to add `.copy()` accordingly.

## How was this patch tested?

New regression test case added in `ObjectHashAggregateSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark investigate-oom

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15976


commit 4b88eed0f5f2cb34d217d833dbe085f9d0277a94
Author: Cheng Lian 
Date:   2016-11-21T23:41:23Z

Fix unsafe data false sharing issue in ObjectHashAggregateExec




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15952: [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`...

2016-11-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15952
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15952: [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`...

2016-11-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15952
  
> context.R, RDD.R and pairRDD.R

Mostly, but not all - there are a few public & documented API in context.R, 
for instance.
Basically if it has a `@noRd` then it will not be documented




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15888: [SPARK-18444][SPARKR] SparkR running in yarn-cluster mod...

2016-11-21 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15888
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15966: [SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instea...

2016-11-21 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15966
  
I found a way to verify the coalesce logics of JDBC writing. See my PR: 
https://github.com/apache/spark/pull/15975 It added `numPartition` into 
`JDBCRelation`. 

With minor code changes, you can see the adjusted `numPartition` in the 
output of `EXPLAIN`. 
```Scala
sql("INSERT INTO TABLE PEOPLE1 SELECT * FROM PEOPLE").explain(true)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp column ty...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15969
  
**[Test build #68980 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68980/consoleFull)**
 for PR 15969 at commit 
[`e977808`](https://github.com/apache/spark/commit/e977808fa61c7e8c6b8ad84bb704b8c607b96236).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp column ty...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15969
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68980/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp column ty...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15969
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89048248
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
--- End diff --

It's from Hive - but I agree spelling it out fully would be better.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15975: Fix Concurrent Table Fetching Using DataFrameReader JDBC...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15975
  
**[Test build #68981 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68981/consoleFull)**
 for PR 15975 at commit 
[`bcc86c0`](https://github.com/apache/spark/commit/bcc86c0395ddc24cb629f46af9f985bdff0387a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15975: Fix Concurrent Table Fetching Using DataFrameRead...

2016-11-21 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/15975

Fix Concurrent Table Fetching Using DataFrameReader JDBC APIs

### What changes were proposed in this pull request?
The following two `DataFrameReader` JDBC APIs ignore the user-specified 
parameters of parallelism degree.

```Scala
  def jdbc(
  url: String,
  table: String,
  columnName: String,
  lowerBound: Long,
  upperBound: Long,
  numPartitions: Int,
  connectionProperties: Properties): DataFrame
```

```Scala
  def jdbc(
  url: String,
  table: String,
  predicates: Array[String],
  connectionProperties: Properties): DataFrame
```

This PR is to fix the issues. To verify the behavior correctness, we 
improve the plan output of `EXPLAIN` command by adding `numPartitions` in the 
`JDBCRelation` node.

Before the fix, 
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [NAME#1896,THEID#1897] ReadSchema: 
struct
```

After the fix, 
```
== Physical Plan ==
*Scan JDBCRelation(TEST.PEOPLE) [numPartitions=3] [NAME#1896,THEID#1897] 
ReadSchema: struct
```
### How was this patch tested?
Added the verification logics on all the test cases for JDBC concurrent 
fetching.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark jdbc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15975


commit bcc86c0395ddc24cb629f46af9f985bdff0387a6
Author: gatorsmile 
Date:   2016-11-22T05:49:42Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp column ty...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15969
  
**[Test build #68980 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68980/consoleFull)**
 for PR 15969 at commit 
[`e977808`](https://github.com/apache/spark/commit/e977808fa61c7e8c6b8ad84bb704b8c607b96236).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15935: [SPARK-18188] add checksum for blocks of broadcas...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15935#discussion_r89046108
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -85,10 +86,21 @@ private[spark] class TorrentBroadcast[T: ClassTag](obj: 
T, id: Long)
   /** Total number of blocks this broadcast variable contains. */
   private val numBlocks: Int = writeBlocks(obj)
 
+  /** The checksum for all the blocks. */
+  private var checksums: Array[Int] = _
+
   override protected def getValue() = {
 _value
   }
 
+  private def caclChecksum(block: ByteBuffer): Int = {
--- End diff --

`calcChecksum` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15946: [SPARK-18513][Structured Streaming] Record and recover w...

2016-11-21 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/15946
  
Ah, I'll keep an eye on that! @zsxwing thanks for the notification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15820
  
**[Test build #3431 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3431/consoleFull)**
 for PR 15820 at commit 
[`1b8d56e`](https://github.com/apache/spark/commit/1b8d56e8e87fc4e3fd48b44ae3ded0d6255d2967).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15870: [SPARK-18425][Structured Streaming][Tests] Test `...

2016-11-21 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15870


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15820
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15974: [SPARK-18537] [Web UI]Add a REST api to spark streaming

2016-11-21 Thread ChorPangChan

Github user ChorPangChan commented on the issue:

https://github.com/apache/spark/pull/15974
  
hi @tdas
pre our conversation in spark-dev channel, I have done the implementation.
can you please have a look on this please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15820
  
**[Test build #68978 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68978/consoleFull)**
 for PR 15820 at commit 
[`1b8d56e`](https://github.com/apache/spark/commit/1b8d56e8e87fc4e3fd48b44ae3ded0d6255d2967).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15820
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68978/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15870: [SPARK-18425][Structured Streaming][Tests] Test `Compact...

2016-11-21 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15870
  
LGTM. Thanks! Merging to master and 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15946: [SPARK-18513][Structured Streaming] Record and recover w...

2016-11-21 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15946
  
@lw-lin thanks for doing this. However, #15949 also includes the fix for 
this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15974: [SPARK-18537] [Web UI]Add a REST api to spark streaming

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15974
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15974: [SPARK-18537] Add a REST api to spark streaming

2016-11-21 Thread ChorPangChan

GitHub user ChorPangChan opened a pull request:

https://github.com/apache/spark/pull/15974

[SPARK-18537] Add a REST api to spark streaming 

## What changes were proposed in this pull request?

1. implement a package(org.apache.spark.streaming.status.api.v1) that serve 
the same purpose as org.apache.spark.status.api.v1
1. register the api path through StreamingPage
1. retrive the streaming informateion through StreamingJobProgressListener

this api should cover exceptly the same amount of information as you can 
get from the web interface
the implementation is base on the current REST pmplementation of spark-core
and will be available for running applications only

https://issues.apache.org/jira/browse/SPARK-18537

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ChorPangChan/spark stream-api-dev

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15974.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15974


commit 4ac2fd0767345e1ac31a15d726a1d5542c885d43
Author: Chan Chor Pang 
Date:   2016-10-26T05:36:44Z

compile ok, try to test

commit 328e77d4a1e248f5fb2dee707113da21a04fa302
Author: Chan Chor Pang 
Date:   2016-10-26T07:39:40Z

add path /streamingapi

commit 0ce5054a35b0f34ee499d7b1e6c12d310450cff0
Author: Chan Chor Pang 
Date:   2016-10-28T05:42:40Z

need attach to some where

commit 2de6b4adcf24e5c40956e1b69b8f7dc7fe796fdc
Author: Chan Chor Pang 
Date:   2016-11-08T06:05:24Z

remove unuse file

commit f1301b03abf26ce7ea338176c66bbb72c16d86d7
Author: Chan Chor Pang 
Date:   2016-11-02T02:10:03Z

no writer yet

commit 25147cdc4e7bfa0e428d4b7a3d23ace3a3d8c95a
Author: Chan Chor Pang 
Date:   2016-11-02T05:13:48Z

not work, may be the data need to be in Iterator form

commit fd5afced470b9c14c348f30996ddd20081798794
Author: Chan Chor Pang 
Date:   2016-11-02T06:18:17Z

package name didnt change in the copy process

commit dbb317fb0f394e6afa11ce9f162e11858f1d601e
Author: Chan Chor Pang 
Date:   2016-11-07T04:35:43Z

try to get the real info

commit 9f488bc8028b7fcff33937399782988bf0383a16
Author: saturday_s 
Date:   2016-11-14T09:51:02Z

Refactor to fit scalastyle.

commit 77f1d189fe74c02e7a31611c66b90c23bab8f2a5
Author: saturday_s 
Date:   2016-11-16T04:40:24Z

Try to get startTime.

commit bc16ed8af007753f96151db5ad8bd3329788700c
Author: saturday_s 
Date:   2016-11-16T04:53:41Z

Change api path prefix.

commit 6174419bfff201623dfa7525f518586c9271e2aa
Author: saturday_s 
Date:   2016-11-16T09:13:59Z

Implement statistics api.

commit dea1e62607e35dee579fdb1fc050c631f41291ad
Author: saturday_s 
Date:   2016-11-17T02:59:58Z

Implement receivers api.

commit db4e78fa02f40aff61b994e92dda07b92f58af43
Author: saturday_s 
Date:   2016-11-17T04:46:13Z

Fix last-error-info format.

commit 08fb3c8623e2289bd6689ed5a8652184a7022350
Author: saturday_s 
Date:   2016-11-17T05:08:30Z

Implement one-receiver api.

commit 083d2477eeb8a0823fc3f676b2d3ebb5b6629fcb
Author: saturday_s 
Date:   2016-11-17T05:21:39Z

Fix access level issue of `ErrorWrapper`.

commit ecccd94356ada9dcff220dfe6b7275247a6f40e2
Author: saturday_s 
Date:   2016-11-18T01:30:30Z

Synchronize to listener when getting info from it.

commit 1fe8c24096d6a8b84e6c5c870cffbaa401ddba40
Author: saturday_s 
Date:   2016-11-18T05:30:15Z

Implement batch(es) api.

commit 9758774c450dcaf8802d09a0f03251e78bdfd73b
Author: saturday_s 
Date:   2016-11-18T06:55:42Z

Remove details of outputOpsInfo from batchInfo.

commit 53e36293e1a4a881f574018511ea37c068a4afc4
Author: saturday_s 
Date:   2016-11-18T08:35:55Z

Implement outputOpsInfo api.

commit 402299de15966927358770901420f5311aa5d92b
Author: saturday_s 
Date:   2016-11-18T09:37:04Z

Try another approach to get outputOpsInfo.

commit 7511416b03497088a3dc98b865ff5a9c4a889d6a
Author: saturday_s 
Date:   2016-11-21T02:03:55Z

Try another more approach to get outputOpsInfo.

commit d8d847483f9c12cc83c9546936f2f64a676502d0
Author: saturday_s 
Date:   2016-11-21T02:41:25Z

Continue trying to get outputOpsInfo(jobIds).

commit

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15951
  
**[Test build #68979 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68979/consoleFull)**
 for PR 15951 at commit 
[`08566e7`](https://github.com/apache/spark/commit/08566e72d04f7f2334d676211caca8bc0ae99290).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15820
  
@tdas I did some changes to make the stress test stable and ran it for 20 
minutes without errors. I also confirmed the warning logs did appear in the 
unit test logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15820: [SPARK-18373][SS][Kafka]Make failOnDataLoss=false work w...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15820
  
**[Test build #68978 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68978/consoleFull)**
 for PR 15820 at commit 
[`1b8d56e`](https://github.com/apache/spark/commit/1b8d56e8e87fc4e3fd48b44ae3ded0d6255d2967).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15966: [SPARK-18413][SQL][FOLLOW-UP] Use `numPartitions` instea...

2016-11-21 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15966
  
Thank you for review, @cloud-fan . 

With the same parameter name `numPartitions` for read/write, we will use 
the same parallelism by default. It's easy to use. 

To use different parallelisms for read/write, we are able to do that with 
different view names. In the following example, t1 is `numPartitions=1` and t2 
is `numPartitions=2`. It's an example for writing, but I think the situation is 
the same with read operation.

```scala
sql("CREATE OR REPLACE TEMPORARY VIEW data USING org.apache.spark.sql.jdbc 
OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 'data', user 'root', 
password '')")
sql("CREATE OR REPLACE TEMPORARY VIEW t1 USING org.apache.spark.sql.jdbc 
OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', 
password '', numPartitions '1')")
sql("CREATE OR REPLACE TEMPORARY VIEW t2 USING org.apache.spark.sql.jdbc 
OPTIONS (url 'jdbc:mysql://localhost:3306/t', dbtable 't1', user 'root', 
password '', numPartitions '2')")
sql("INSERT OVERWRITE TABLE t1 SELECT a FROM data GROUP BY a")
sql("INSERT OVERWRITE TABLE t2 SELECT a FROM data GROUP BY a")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15951
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68976/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15951
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15951
  
**[Test build #68976 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68976/consoleFull)**
 for PR 15951 at commit 
[`6f741b6`](https://github.com/apache/spark/commit/6f741b617573613b62d6fe2f0c7eca1cbf66f660).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89034281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
+ndv: Long,
+min: Any,
+max: Any,
+numNulls: Long,
+avgLen: Long,
+maxLen: Long) {
 
-  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
-NumericColumnStat(statRow, dataType)
-  }
-  def forString: StringColumnStat = StringColumnStat(statRow)
-  def forBinary: BinaryColumnStat = BinaryColumnStat(statRow)
-  def forBoolean: BooleanColumnStat = BooleanColumnStat(statRow)
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the field (e.g. "ndv" or "min"), and the value 
is the string
+   * representation for the value. The deserialization side is defined in 
[[ColumnStat.fromMap]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   */
+  def toMap: Map[String, String] = Map(
+"version" -> "1",
+"ndv" -> ndv.toString,
+"min" -> min.toString,
+"max" -> max.toString,
+"numNulls" -> numNulls.toString,
+"avgLen" -> avgLen.toString,
+"maxLen" -> maxLen.toString
+  )
+}
+
+
+object ColumnStat extends Logging {
 
-  override def toString: String = {
-// use Base64 for encoding
-Base64.encodeBase64String(statRow.asInstanceOf[UnsafeRow].getBytes)
+  /** Returns true iff the we support gathering column statistics on 
column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+case _: NumericType | TimestampType | DateType | BooleanType => true
+case StringType | BinaryType => true
+case _ => false
   }
-}
 
-object ColumnStat {
-  def apply(numFields: Int, str: String): ColumnStat = {
-// use Base64 for decoding
-val bytes = Base64.decodeBase64(str)
-val unsafeRow = new UnsafeRow(numFields)
-unsafeRow.pointTo(bytes, bytes.length)
-ColumnStat(unsafeRow)
+  /**
+   * Creates a [[ColumnStat]] object from the given map. This is used to 
deserialize column stats
+   * from some external storage. The serialization side is defined in 
[[ColumnStat.toMap]].
+   */
+  def fromMap(dataType: DataType, map: Map[String, String]): 
Option[ColumnStat] = {
+val str2val: (String => Any) = dataType match {
+  case _: IntegralType => _.toLong
+  case _: DecimalType => Decimal(_)
+  case DoubleType | FloatType => _.toDouble
+  case BooleanType => _.toBoolean
+  case _ => identity
+}
+
+try {
+  Some(ColumnStat(
+ndv = map("ndv").toLong,
+min = str2val(map.get("min").orNull),
+max = str2val(map.get("max").orNull),
+numNulls = map("numNulls").toLong,
+avgLen = map.getOrElse("avgLen", "1").toLong,
+maxLen = map.getOrElse("maxLen", "1").toLong
+  ))
+} catch {
+  case NonFatal(e) =>
+logWarning("Failed to parse column statistics", e)
+None
+}
   }
-}
 
-case class NumericColumnStat[T <: AtomicType](statRow: InternalRow, 
dataType: T) {
-  // The indices here must be consistent with 
`ColumnStatStruct.numericColumnStat`.
-  val numNulls: Long = statRow.getLong(0)
-  val max: T#InternalType = statRow.get(1, 
dataType).asInstanceOf[T#InternalType]
-  val min: T#InternalType = statRow.get(2, 
dataType).asInstanceOf[T#InternalType]
-  val ndv: Long = statRow.getLong(3)
-}

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89039895
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
--- End diff --

can you add some basic sanity checks ? eg. 
- `max >= min`
- `maxLen  >= avgLen` 
- `if (ndv == 1) then min == max`

Floats / decimals might behave badly but its good to check before anyone 
consumes these stats.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89030028
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
--- End diff --

nit: `ndv` sounds weird. Why not use `numDistinctVals` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89033107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
+ndv: Long,
+min: Any,
+max: Any,
+numNulls: Long,
+avgLen: Long,
+maxLen: Long) {
 
-  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
-NumericColumnStat(statRow, dataType)
-  }
-  def forString: StringColumnStat = StringColumnStat(statRow)
-  def forBinary: BinaryColumnStat = BinaryColumnStat(statRow)
-  def forBoolean: BooleanColumnStat = BooleanColumnStat(statRow)
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the field (e.g. "ndv" or "min"), and the value 
is the string
+   * representation for the value. The deserialization side is defined in 
[[ColumnStat.fromMap]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   */
+  def toMap: Map[String, String] = Map(
+"version" -> "1",
+"ndv" -> ndv.toString,
+"min" -> min.toString,
+"max" -> max.toString,
+"numNulls" -> numNulls.toString,
+"avgLen" -> avgLen.toString,
+"maxLen" -> maxLen.toString
+  )
+}
+
+
+object ColumnStat extends Logging {
 
-  override def toString: String = {
-// use Base64 for encoding
-Base64.encodeBase64String(statRow.asInstanceOf[UnsafeRow].getBytes)
+  /** Returns true iff the we support gathering column statistics on 
column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+case _: NumericType | TimestampType | DateType | BooleanType => true
+case StringType | BinaryType => true
+case _ => false
   }
-}
 
-object ColumnStat {
-  def apply(numFields: Int, str: String): ColumnStat = {
-// use Base64 for decoding
-val bytes = Base64.decodeBase64(str)
-val unsafeRow = new UnsafeRow(numFields)
-unsafeRow.pointTo(bytes, bytes.length)
-ColumnStat(unsafeRow)
+  /**
+   * Creates a [[ColumnStat]] object from the given map. This is used to 
deserialize column stats
+   * from some external storage. The serialization side is defined in 
[[ColumnStat.toMap]].
+   */
+  def fromMap(dataType: DataType, map: Map[String, String]): 
Option[ColumnStat] = {
+val str2val: (String => Any) = dataType match {
+  case _: IntegralType => _.toLong
+  case _: DecimalType => Decimal(_)
+  case DoubleType | FloatType => _.toDouble
+  case BooleanType => _.toBoolean
+  case _ => identity
+}
+
+try {
+  Some(ColumnStat(
+ndv = map("ndv").toLong,
+min = str2val(map.get("min").orNull),
+max = str2val(map.get("max").orNull),
+numNulls = map("numNulls").toLong,
+avgLen = map.getOrElse("avgLen", "1").toLong,
+maxLen = map.getOrElse("maxLen", "1").toLong
+  ))
+} catch {
+  case NonFatal(e) =>
+logWarning("Failed to parse column statistics", e)
--- End diff --

Is it possible to log the name of the table, column name, `dataType` and 
`map`  ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89034659
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -100,99 +100,30 @@ object AnalyzeColumnCommand extends Logging {
   exprOption.getOrElse(throw new AnalysisException(s"Invalid column 
name: $col."))
 }).toSeq
 
+// Make sure the column types are supported for stats gathering.
+attributesToAnalyze.foreach { attr =>
+  if (!ColumnStat.supportsType(attr.dataType)) {
+throw new AnalysisException(
+  s"Data type ${attr.dataType.simpleString} for column 
${attr.name} is not supported " +
--- End diff --

nit : please include the table name in the message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89038988
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
+ndv: Long,
+min: Any,
+max: Any,
+numNulls: Long,
+avgLen: Long,
+maxLen: Long) {
 
-  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
-NumericColumnStat(statRow, dataType)
-  }
-  def forString: StringColumnStat = StringColumnStat(statRow)
-  def forBinary: BinaryColumnStat = BinaryColumnStat(statRow)
-  def forBoolean: BooleanColumnStat = BooleanColumnStat(statRow)
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the field (e.g. "ndv" or "min"), and the value 
is the string
+   * representation for the value. The deserialization side is defined in 
[[ColumnStat.fromMap]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   */
+  def toMap: Map[String, String] = Map(
+"version" -> "1",
+"ndv" -> ndv.toString,
+"min" -> min.toString,
+"max" -> max.toString,
+"numNulls" -> numNulls.toString,
+"avgLen" -> avgLen.toString,
+"maxLen" -> maxLen.toString
+  )
+}
+
+
+object ColumnStat extends Logging {
 
-  override def toString: String = {
-// use Base64 for encoding
-Base64.encodeBase64String(statRow.asInstanceOf[UnsafeRow].getBytes)
+  /** Returns true iff the we support gathering column statistics on 
column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+case _: NumericType | TimestampType | DateType | BooleanType => true
+case StringType | BinaryType => true
+case _ => false
   }
-}
 
-object ColumnStat {
-  def apply(numFields: Int, str: String): ColumnStat = {
-// use Base64 for decoding
-val bytes = Base64.decodeBase64(str)
-val unsafeRow = new UnsafeRow(numFields)
-unsafeRow.pointTo(bytes, bytes.length)
-ColumnStat(unsafeRow)
+  /**
+   * Creates a [[ColumnStat]] object from the given map. This is used to 
deserialize column stats
+   * from some external storage. The serialization side is defined in 
[[ColumnStat.toMap]].
+   */
+  def fromMap(dataType: DataType, map: Map[String, String]): 
Option[ColumnStat] = {
+val str2val: (String => Any) = dataType match {
+  case _: IntegralType => _.toLong
+  case _: DecimalType => Decimal(_)
+  case DoubleType | FloatType => _.toDouble
+  case BooleanType => _.toBoolean
+  case _ => identity
+}
+
+try {
+  Some(ColumnStat(
+ndv = map("ndv").toLong,
+min = str2val(map.get("min").orNull),
+max = str2val(map.get("max").orNull),
+numNulls = map("numNulls").toLong,
+avgLen = map.getOrElse("avgLen", "1").toLong,
+maxLen = map.getOrElse("maxLen", "1").toLong
+  ))
+} catch {
+  case NonFatal(e) =>
+logWarning("Failed to parse column statistics", e)
+None
+}
   }
-}
 
-case class NumericColumnStat[T <: AtomicType](statRow: InternalRow, 
dataType: T) {
-  // The indices here must be consistent with 
`ColumnStatStruct.numericColumnStat`.
-  val numNulls: Long = statRow.getLong(0)
-  val max: T#InternalType = statRow.get(1, 
dataType).asInstanceOf[T#InternalType]
-  val min: T#InternalType = statRow.get(2, 
dataType).asInstanceOf[T#InternalType]
-  val ndv: Long = statRow.getLong(3)
-}

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89035720
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
+ndv: Long,
+min: Any,
+max: Any,
+numNulls: Long,
+avgLen: Long,
+maxLen: Long) {
 
-  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
-NumericColumnStat(statRow, dataType)
-  }
-  def forString: StringColumnStat = StringColumnStat(statRow)
-  def forBinary: BinaryColumnStat = BinaryColumnStat(statRow)
-  def forBoolean: BooleanColumnStat = BooleanColumnStat(statRow)
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the field (e.g. "ndv" or "min"), and the value 
is the string
+   * representation for the value. The deserialization side is defined in 
[[ColumnStat.fromMap]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   */
+  def toMap: Map[String, String] = Map(
--- End diff --

Should we also have a flag to indicate if a column stat is valid or not ? 
Few example cases when it would help:

- some bug in the code which lead to all stats generated using version XYZ 
to be incorrect. We want clients to not trust the stats in such case.
- If for some reason we read bad stat from metastore (eg. min > max)
- Storing `min` and `max` for string types can be risky as you are at the 
mercy of user data. In past this has bit me wherein a column value was a super 
large string. The step where stats are generated needs to guard against such 
cases and set the "stat-is-invalid" flag.

All these can be handled by returning some special value (`None`) to client 
but you lose the information that some stats were there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15959: [SPARK-18522][SQL] Explicit contract for column s...

2016-11-21 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15959#discussion_r89039488
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala
 ---
@@ -58,60 +61,127 @@ case class Statistics(
   }
 }
 
+
 /**
- * Statistics for a column.
+ * Statistics collected for a column.
+ *
+ * 1. Supported data types are defined in `ColumnStat.supportsType`.
+ * 2. The JVM data type stored in min/max is the external data type (used 
in Row) for the
+ * corresponding Catalyst data type. For example, for DateType we store 
java.sql.Date, and for
+ * TimestampType we store java.sql.Timestamp.
+ * 3. For integral types, they are all upcasted to longs, i.e. shorts are 
stored as longs.
+ *
+ * @param ndv number of distinct values
+ * @param min minimum value
+ * @param max maximum value
+ * @param numNulls number of nulls
+ * @param avgLen average length of the values. For fixed-length types, 
this should be a constant.
+ * @param maxLen maximum length of the values. For fixed-length types, 
this should be a constant.
  */
-case class ColumnStat(statRow: InternalRow) {
+// TODO: decide if we want to use bigint to represent ndv and numNulls.
+case class ColumnStat(
+ndv: Long,
+min: Any,
+max: Any,
+numNulls: Long,
+avgLen: Long,
+maxLen: Long) {
 
-  def forNumeric[T <: AtomicType](dataType: T): NumericColumnStat[T] = {
-NumericColumnStat(statRow, dataType)
-  }
-  def forString: StringColumnStat = StringColumnStat(statRow)
-  def forBinary: BinaryColumnStat = BinaryColumnStat(statRow)
-  def forBoolean: BooleanColumnStat = BooleanColumnStat(statRow)
+  /**
+   * Returns a map from string to string that can be used to serialize the 
column stats.
+   * The key is the name of the field (e.g. "ndv" or "min"), and the value 
is the string
+   * representation for the value. The deserialization side is defined in 
[[ColumnStat.fromMap]].
+   *
+   * As part of the protocol, the returned map always contains a key 
called "version".
+   */
+  def toMap: Map[String, String] = Map(
+"version" -> "1",
+"ndv" -> ndv.toString,
+"min" -> min.toString,
+"max" -> max.toString,
+"numNulls" -> numNulls.toString,
+"avgLen" -> avgLen.toString,
+"maxLen" -> maxLen.toString
+  )
+}
+
+
+object ColumnStat extends Logging {
 
-  override def toString: String = {
-// use Base64 for encoding
-Base64.encodeBase64String(statRow.asInstanceOf[UnsafeRow].getBytes)
+  /** Returns true iff the we support gathering column statistics on 
column of the given type. */
+  def supportsType(dataType: DataType): Boolean = dataType match {
+case _: NumericType | TimestampType | DateType | BooleanType => true
+case StringType | BinaryType => true
+case _ => false
   }
-}
 
-object ColumnStat {
-  def apply(numFields: Int, str: String): ColumnStat = {
-// use Base64 for decoding
-val bytes = Base64.decodeBase64(str)
-val unsafeRow = new UnsafeRow(numFields)
-unsafeRow.pointTo(bytes, bytes.length)
-ColumnStat(unsafeRow)
+  /**
+   * Creates a [[ColumnStat]] object from the given map. This is used to 
deserialize column stats
+   * from some external storage. The serialization side is defined in 
[[ColumnStat.toMap]].
+   */
+  def fromMap(dataType: DataType, map: Map[String, String]): 
Option[ColumnStat] = {
+val str2val: (String => Any) = dataType match {
+  case _: IntegralType => _.toLong
+  case _: DecimalType => Decimal(_)
+  case DoubleType | FloatType => _.toDouble
+  case BooleanType => _.toBoolean
+  case _ => identity
+}
+
+try {
+  Some(ColumnStat(
--- End diff --

Curious: why is `version` not retained here ? If due to some Spark bug we 
realise that stats generated in some version `x` are bad, we would want the 
client (CBO code) to not consume it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15955: [SPARK-18434][ML][FOLLOWUP] Add checking for setSolver i...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15955
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15955: [SPARK-18434][ML][FOLLOWUP] Add checking for setSolver i...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15955
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68977/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15955: [SPARK-18434][ML][FOLLOWUP] Add checking for setSolver i...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15955
  
**[Test build #68977 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68977/consoleFull)**
 for PR 15955 at commit 
[`1fe2f92`](https://github.com/apache/spark/commit/1fe2f925aa7dbe614f65a76eb61ebe13fe67dd6a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15954: [WIP][SPARK-18516][SQL] Split state and progress in stre...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15954: [WIP][SPARK-18516][SQL] Split state and progress in stre...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15954
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68972/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15954: [WIP][SPARK-18516][SQL] Split state and progress in stre...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15954
  
**[Test build #68972 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68972/consoleFull)**
 for PR 15954 at commit 
[`59a9139`](https://github.com/apache/spark/commit/59a91393d869b312819f3b25b25028912fe0b79a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15870: [SPARK-18425][Structured Streaming][Tests] Test `Compact...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15870
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68973/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15870: [SPARK-18425][Structured Streaming][Tests] Test `Compact...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15870
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15870: [SPARK-18425][Structured Streaming][Tests] Test `Compact...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15870
  
**[Test build #68973 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68973/consoleFull)**
 for PR 15870 at commit 
[`9d193d0`](https://github.com/apache/spark/commit/9d193d0a6f159b5ac6b878c7be6c5579ad257909).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15936: [SPARK-18504][SQL] Scalar subquery with extra group by c...

2016-11-21 Thread nsyca

Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/15936
  
The latest push (3181a08) of the PR has the following changes from its 
previous version:

1. Consolidate existing checking from Analyzer.scala to CheckAnalysis.scala
2. Make use of `ExpressionSet` class to compare the canonical form of 
expressions between two ExpressionSet
3. Revise the test case to be more concise (using only one table) and, more 
importantly, exercise the `ExpressionSet` code in the comparison of the columns 
in the GROUP BY clause and the columns in the WHERE clause at line 143 of the 
file CheckAnalysis.scala



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68970/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15971
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15971
  
**[Test build #68970 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68970/consoleFull)**
 for PR 15971 at commit 
[`5dd3630`](https://github.com/apache/spark/commit/5dd3630d6a937ba8634054940543509d9186e68e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp column ty...

2016-11-21 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/15969
  
I wonder whether it makes sense to actually test with a query that has 
window and withWatermark on the timestamp column. because that is what we want 
to enable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15969: [SPARK-18530][SS][KAFKA]Change Kafka timestamp co...

2016-11-21 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15969#discussion_r89037554
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
 ---
@@ -413,6 +414,43 @@ class KafkaSourceSuite extends KafkaSourceTest {
 )
   }
 
+  test("Kafka column types") {
+val now = System.currentTimeMillis()
+val topic = newTopic()
+testUtils.createTopic(newTopic(), partitions = 1)
+testUtils.sendMessages(topic, Array(1).map(_.toString))
+
+val reader = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", testUtils.brokerAddress)
+  .option("kafka.metadata.max.age.ms", "1")
+  .option("startingOffsets", s"earliest")
+  .option("subscribe", topic)
+
+val kafka = reader.load()
--- End diff --

nit: why have separate var for reader


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15949: [SPARK-18339] [SQL] Don't push down current_timestamp fo...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15949
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68971/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15949: [SPARK-18339] [SQL] Don't push down current_timestamp fo...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15949
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15949: [SPARK-18339] [SQL] Don't push down current_timestamp fo...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15949
  
**[Test build #68971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68971/consoleFull)**
 for PR 15949 at commit 
[`c1b3e60`](https://github.com/apache/spark/commit/c1b3e601d9feaa38a2fe6eef4d20dab5181e5eda).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class CurrentBatchTimestamp(timestamp: SQLTimestamp) extends 
LeafExpression`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15946: [SPARK-18513][Structured Streaming] Record and recover w...

2016-11-21 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/15946
  
@marmbrus @zsxwing could you take a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89035938
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.{util => ju}
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Holds statistics about state that is being stored for a given streaming 
query.
+ */
+@Experimental
+class StateOperator private[sql](
+val numEntries: Long,
--- End diff --

Those sound good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15951
  
```
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
```
I don't think this is valid use case, `DataFrameReader` can't specify 
partition columns, so we will always infer partitions.

I think the real problem is `HadoopFsRelation.schema`:
```
val schema: StructType = {
  val dataSchemaColumnNames = dataSchema.map(_.name.toLowerCase).toSet
  StructType(dataSchema ++ partitionSchema.filterNot { column =>
dataSchemaColumnNames.contains(column.name.toLowerCase)
  })
}
```
It sliently drops the partition schema if the partition column names are 
duplicated in data schema.

I think the best solution is to add `partitionBy` in `DataFrameReader` so 
that we can skip inferring partitions really. But this maybe too late for 2.1, 
we should define a better semantic for the current "broken" API.

> Once we find what the partition columns are, we try to find them in the 
user specified schema and use the dataType provided there, or fall back to the 
smallest common data type.

This LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15955: [SPARK-18434][ML][FOLLOWUP] Add checking for setSolver i...

2016-11-21 Thread zhengruifeng

Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/15955
  
@yanboliang I updated this pr according to your comments. And now it's 
`solver` looks like `MPLC` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15955: [SPARK-18434][ML][FOLLOWUP] Add checking for setSolver i...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15955
  
**[Test build #68977 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68977/consoleFull)**
 for PR 15955 at commit 
[`1fe2f92`](https://github.com/apache/spark/commit/1fe2f925aa7dbe614f65a76eb61ebe13fe67dd6a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89034915
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala ---
@@ -64,23 +66,26 @@ trait StreamingQuery {
 
   /**
* Returns the current status of the query.
+   *
* @since 2.0.2
*/
   def status: StreamingQueryStatus
 
   /**
-   * Returns current status of all the sources.
-   * @since 2.0.0
+   * Returns an array of the most recent [[StreamingQueryProgress]] 
updates for this query.
+   * The number of records retained for each stream is configured by
+   * `spark.sql.streaming.numProgressRecords`.
+   *
+   *  @since 2.1.0
*/
-  @deprecated("use status.sourceStatuses", "2.0.2")
-  def sourceStatuses: Array[SourceStatus]
+  def recentProgress: Array[StreamingQueryProgress]
--- End diff --

well. in the name StreamingQueryProgress we have effectively defined that 
"progress" means data from one trigger (i.e. singular). So I think its better 
to be `progresses`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89034797
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.{util => ju}
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Holds statistics about state that is being stored for a given streaming 
query.
+ */
+@Experimental
+class StateOperator private[sql](
+val numEntries: Long,
--- End diff --

Same question then applies to numUpdated as well. 
How about numRowsTotal, numRowsUpdated?
Then in future we could add sizeBytesTotal, sizeBytesUpdated, etc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15972
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68974/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15972
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15972: [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, Deve...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15972
  
**[Test build #68974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68974/consoleFull)**
 for PR 15972 at commit 
[`d2674fb`](https://github.com/apache/spark/commit/d2674fbec826e48a1f6068a37c20af311b3f1274).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait LDAOptimizer `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14995: [Test Only][SPARK-6235][CORE]Address various 2G limits

2016-11-21 Thread witgo

Github user witgo commented on the issue:

https://github.com/apache/spark/pull/14995
  
This PR is Test only, it used to
1. verify code through CI
2. verify the effectiveness of the solution

includes two underlying API changes.

1. Replace ByteBuffer with ChunkedByteBuffer.
2. Replace ByteBuf with InputStream.

There should not be much debate about 1.(Master branch has done some of the 
relevant changes), But @rxin has a different idea for 2. 
We should reach a consensus on the above two underlying changes, and then 
do the next step.


@srowen  What do you think of the above two changes?
@opme Have you done more test on large scale data shuflle?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15837: [SPARK-18395][SQL] Evaluate common subexpression like la...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15837
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68967/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15837: [SPARK-18395][SQL] Evaluate common subexpression like la...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15837
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15837: [SPARK-18395][SQL] Evaluate common subexpression like la...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15837
  
**[Test build #68967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68967/consoleFull)**
 for PR 15837 at commit 
[`f95100b`](https://github.com/apache/spark/commit/f95100bf850a24c6956b2330bec2937c0f3386f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class AesCipher `
  * `public class AesConfigMessage implements Encodable `
  * `public class ByteArrayReadableChannel implements ReadableByteChannel `
  * `public final class JavaStructuredKafkaWordCount `
  * `abstract class PerPartitionConfig extends Serializable `
  * `class ClusteringSummary(JavaWrapper):`
  * `class GaussianMixtureSummary(ClusteringSummary):`
  * `class BisectingKMeansSummary(ClusteringSummary):`
  * `trait CollectionGenerator extends Generator `
  * `case class Stack(children: Seq[Expression]) extends Generator `
  * `abstract class ExplodeBase extends UnaryExpression with 
CollectionGenerator with Serializable `
  * `case class Explode(child: Expression) extends ExplodeBase `
  * `case class PosExplode(child: Expression) extends ExplodeBase `
  * `case class Inline(child: Expression) extends UnaryExpression with 
CollectionGenerator `
  * `trait InvokeLike extends Expression with NonSQLExpression `
  * `case class EventTimeWatermark(`
  * `class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, 
String]`
  * `final class ParquetLogRedirector implements Serializable `
  * `sealed trait ViewType `
  * `  case class OutputSpec(`
  * `class MaxLong(protected var currentValue: Long = 0)`
  * `case class EventTimeWatermarkExec(`
  * `class FileStreamOptions(parameters: CaseInsensitiveMap) extends 
Logging `
  * `case class OffsetSeq(offsets: Seq[Option[Offset]], metadata: 
Option[String] = None) `
  * `sealed trait StoreUpdate `
  * `case class ValueRemoved(key: UnsafeRow, value: UnsafeRow) extends 
StoreUpdate`
  * `case class SparkListenerDriverAccumUpdates(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89032237
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQuery.scala ---
@@ -64,23 +66,26 @@ trait StreamingQuery {
 
   /**
* Returns the current status of the query.
+   *
* @since 2.0.2
*/
   def status: StreamingQueryStatus
 
   /**
-   * Returns current status of all the sources.
-   * @since 2.0.0
+   * Returns an array of the most recent [[StreamingQueryProgress]] 
updates for this query.
+   * The number of records retained for each stream is configured by
+   * `spark.sql.streaming.numProgressRecords`.
+   *
+   *  @since 2.1.0
*/
-  @deprecated("use status.sourceStatuses", "2.0.2")
-  def sourceStatuses: Array[SourceStatus]
+  def recentProgress: Array[StreamingQueryProgress]
--- End diff --

Hmmm, yeah maybe.  Its not clear to me that `progress` is inherently 
singular and `progresses` is kind of a mouthful.  It is maybe nice for `Array`s 
to always be plural though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89032104
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/SourceProgress.scala ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import scala.util.control.NonFatal
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Reports metrics on data being read from a given streaming source.
+ *
+ * @param description Description of the source.
+ * @param startOffset The starting offset for data being read.
+ * @param endOffset The ending offset for data being read.
+ * @param numRecords The number of records read from this source.
--- End diff --

I think if we update the docs as you suggest above this will be clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89032039
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.scala
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import java.{util => ju}
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Holds statistics about state that is being stored for a given streaming 
query.
+ */
+@Experimental
+class StateOperator private[sql](
+val numEntries: Long,
--- End diff --

+1 to docs.  I think `numTotal` is less clear.  Total of what?  It is a 
count of the number of entries that the state store is holding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15954: [WIP][SPARK-18516][SQL] Split state and progress ...

2016-11-21 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/15954#discussion_r89031940
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/SourceProgress.scala ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming
+
+import scala.util.control.NonFatal
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * :: Experimental ::
+ * Reports metrics on data being read from a given streaming source.
--- End diff --

Sure, we can copy the docs from the main class: `Each event relates to 
processing done for a single trigger of the streaming query.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15973: Initial Kubernetes cluster manager implementation...

2016-11-21 Thread mccheah

Github user mccheah closed the pull request at:

https://github.com/apache/spark/pull/15973


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15973: Initial Kubernetes cluster manager implementation.

2016-11-21 Thread mccheah

Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/15973
  
Sorry, wrong base for the fork here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15973: Initial Kubernetes cluster manager implementation...

2016-11-21 Thread mccheah

GitHub user mccheah opened a pull request:

https://github.com/apache/spark/pull/15973

Initial Kubernetes cluster manager implementation.

Includes the following initial feature set:
- Cluster mode with only Scala/Java jobs
- Spark-submit support
- Dynamic allocation

Does not include, most notably:
- Client mode support
- Proper testing on both the unit and integration level; integration
tests are flaky

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/foxish/spark k8s-support-alternate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15973.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15973


commit 1b0e728fac566eb0bf4db14cb137d601c76d67f1
Author: mcheah 
Date:   2016-11-22T02:01:24Z

Initial Kubernetes cluster manager implementation.

Includes the following initial feature set:
- Cluster mode with only Scala/Java jobs
- Spark-submit support
- Dynamic allocation

Does not include, most notably:
- Client mode support
- Proper testing on both the unit and integration level; integration
tests are flaky




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15968: [SPARK-18533] Raise correct error upon specificat...

2016-11-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15968#discussion_r89031720
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/CreateTableAsSelectSuite.scala
 ---
@@ -249,4 +249,13 @@ class CreateTableAsSelectSuite
   }
 }
   }
+
+  test("specifying the column list for CTAS") {
+withTable("t") {
+  val e = intercept[AnalysisException] {
--- End diff --

`ParserException`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15968: [SPARK-18533] Raise correct error upon specification of ...

2016-11-21 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15968
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15968: [SPARK-18533] Raise correct error upon specificat...

2016-11-21 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15968#discussion_r89031666
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -1052,7 +1077,8 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
 "CTAS statement."
   operationNotAllowed(errorMessage, ctx)
 }
-// Just use whatever is projected in the select statement as our 
schema
--- End diff --

I think this is an out-dated comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15951
  
**[Test build #68976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68976/consoleFull)**
 for PR 15951 at commit 
[`6f741b6`](https://github.com/apache/spark/commit/6f741b617573613b62d6fe2f0c7eca1cbf66f660).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13909
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68962/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15970: [SPARK-18134][SQL] Comparable MapTypes [POC]

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15970
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68968/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15970: [SPARK-18134][SQL] Comparable MapTypes [POC]

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15970
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-11-21 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13909
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15970: [SPARK-18134][SQL] Comparable MapTypes [POC]

2016-11-21 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15970
  
**[Test build #68968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68968/consoleFull)**
 for PR 15970 at commit 
[`edec2d8`](https://github.com/apache/spark/commit/edec2d8c1133271a04fc26842dd5101be24b041f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SortMap(child: Expression) extends UnaryExpression with 
ExpectsInputTypes `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 553 matches

Mail list logo