date:20170820

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

2017-08-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18958#discussion_r134148287
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -970,30 +458,14 @@ public final int appendStruct(boolean isNull) {
   protected boolean anyNullsSet;
 
   /**
-   * True if this column's values are fixed. This means the column values 
never change, even
-   * across resets.
-   */
-  protected boolean isConstant;
-
-  /**
-   * Default size of each array length value. This grows as necessary.
-   */
-  protected static final int DEFAULT_ARRAY_LENGTH = 4;
-
-  /**
-   * Current write cursor (row index) when appending data.
-   */
-  protected int elementsAppended;
-
-  /**
* If this is a nested type (array or struct), the column for the child 
data.
*/
   protected ColumnVector[] childColumns;
--- End diff --

We need this field for `ArrowColumnVector` to store its child columns, too.
Do you want to make the method `getChildColumn(int ordinal)` abstract and 
move the field to more concrete classes to manage by themselves?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-20 Thread DonnyZone

Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18986
  
@gatorsmile For this issue, I think the behevior in PromoteStrings rule is 
reasonable, but there are problems in underlying converter UTF8String.

As described in PR-15880 (https://github.com/apache/spark/pull/15880):
> It's more reasonable to follow postgres, i.e. cast string to the type of 
the other side, but return null if the string is not castable to keep hive 
compatibility.

However, the underlying UTF8String still returns true for cases that are 
not castable. From the below code, we can get res=true and wrapper.value=0. 
Consequently, it results in wrong answer in SPARK-21774.
```
val x = UTF8String.fromString("0.1")
val wrapper = new IntWrapper
val res = x.toInt(wrapper)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18968: [SPARK-21759][SQL] In.checkInputDataTypes should ...

2017-08-20 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18968#discussion_r134147697
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -138,46 +138,56 @@ case class Not(child: Expression)
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
+
+  lazy val valExprs = value match {
+case cns: CreateNamedStruct => cns.valExprs
+case expr => Seq(expr)
+  }
+
+  override lazy val resolved: Boolean = {
+lazy val checkForInSubquery = list match {
--- End diff --

why `lazy` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18984: [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in...

2017-08-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18984
  
Here is the context I got:

https://github.com/apache/spark/pull/18702 broke the documentation build in 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/.

```
...
Moving back into docs dir.
Moving to SQL directory and building docs.
Missing mkdocs in your path, skipping SQL documentation generation.
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
jekyll 2.5.3 | Error:  unknown file type: ../sql/site/.
Deleting credential directory 
/home/jenkins/workspace/spark-master-docs/spark-utils/new-release-scripts/jenkins/jenkins-credentials-scUXuITy
Build step 'Execute shell' marked build as failure
[WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
Finished: FAILURE
```

It failed in this way roughly about 20ish days.

That PR adds SQL documentation build but with the dependency, `mkdocs` 
package. I completely forgot we actually build it in Jenkins.

To fix this, in this PR, I manually added some install command here to 
install `mkdocs` if missing in the path; however, it looks failed to install 
this with the error message:

```
...
Installing collected packages: singledispatch, certifi, backports-abc, 
tornado, livereload, click, Markdown, mkdocs
Exception:
Traceback (most recent call last):
  File "/home/anaconda/lib/python2.7/site-packages/pip/basecommand.py", 
line 215, in main
status = self.run(options, args)
  File 
"/home/anaconda/lib/python2.7/site-packages/pip/commands/install.py", line 342, 
in run
prefix=options.prefix_path,
  File "/home/anaconda/lib/python2.7/site-packages/pip/req/req_set.py", 
line 784, in install
**kwargs
  File "/home/anaconda/lib/python2.7/site-packages/pip/req/req_install.py", 
line 851, in install
self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
  File "/home/anaconda/lib/python2.7/site-packages/pip/req/req_install.py", 
line 1064, in move_wheel_files
isolated=self.isolated,
  File "/home/anaconda/lib/python2.7/site-packages/pip/wheel.py", line 345, 
in move_wheel_files
clobber(source, lib_dir, True)
  File "/home/anaconda/lib/python2.7/site-packages/pip/wheel.py", line 323, 
in clobber
shutil.copyfile(srcfile, destfile)
  File "/home/anaconda/lib/python2.7/shutil.py", line 83, in copyfile
with open(dst, 'wb') as fdst:
IOError: [Errno 13] Permission denied: 
'/home/anaconda/lib/python2.7/site-packages/singledispatch_helpers.pyc
...
```

It has failed in this way roughly about 1 day.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18968: [SPARK-21759][SQL] In.checkInputDataTypes should not wro...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18968
  
**[Test build #80920 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80920/testReport)**
 for PR 18968 at commit 
[`0e15cde`](https://github.com/apache/spark/commit/0e15cdeb5897b767de91bcfb38e7d15034bbd994).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

2017-08-20 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18958#discussion_r134147099
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java
 ---
@@ -307,64 +293,70 @@ public void update(int ordinal, Object value) {
 
 @Override
 public void setNullAt(int ordinal) {
--- End diff --

one question, does the rows returned by `ColumnarBatch.rowIterator` have to 
be mutable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18968: [SPARK-21759][SQL] In.checkInputDataTypes should ...

2017-08-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18968#discussion_r134147067
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -138,46 +138,63 @@ case class Not(child: Expression)
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
+
+  lazy val valExprs = value match {
+case cns: CreateNamedStruct => cns.valExprs
+case expr => Seq(expr)
+  }
+
+  override lazy val resolved: Boolean = {
+lazy val checkForInSubquery = list match {
+  case (l @ ListQuery(sub, children, _)) :: Nil =>
+// SPARK-21759:
+// TODO: Update this check if we combine the optimizer rules for 
subquery rewriting.
+//
+// In `CheckAnalysis`, we already check if the size of subquery 
plan output match the size
+// of value expressions. However, we can add extra correlated 
predicate references into
+// the top of subquery plan when pulling up correlated predicates. 
Thus, we add extra check
+// here to make sure we don't mess the query plan.
+
+// Try to find out if any extra subquery output doesn't in the 
subquery condition.
+val extraOutputAllInCondition = 
sub.output.drop(valExprs.length).find { attr =>
+  l.children.forall { c =>
+!c.references.contains(attr)
+  }
+}.isEmpty
+
+if (sub.output.length >= valExprs.length && 
extraOutputAllInCondition) {
+  true
+} else {
+  false
+}
--- End diff --

Looks good. Update soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

2017-08-20 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18958#discussion_r134146811
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -970,30 +458,14 @@ public final int appendStruct(boolean isNull) {
   protected boolean anyNullsSet;
 
   /**
-   * True if this column's values are fixed. This means the column values 
never change, even
-   * across resets.
-   */
-  protected boolean isConstant;
-
-  /**
-   * Default size of each array length value. This grows as necessary.
-   */
-  protected static final int DEFAULT_ARRAY_LENGTH = 4;
-
-  /**
-   * Current write cursor (row index) when appending data.
-   */
-  protected int elementsAppended;
-
-  /**
* If this is a nested type (array or struct), the column for the child 
data.
*/
   protected ColumnVector[] childColumns;
--- End diff --

can we move this to `WritableColumnVector`? I think `ColumnVector` only 
need `ColumnVector getChildColumn(int ordinal)`, and `WritableColumnVector` can 
overwrite it to `WritableColumnVector getChildColumn(int ordinal)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18968: [SPARK-21759][SQL] In.checkInputDataTypes should ...

2017-08-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18968#discussion_r134146523
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -138,46 +138,63 @@ case class Not(child: Expression)
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
+
+  lazy val valExprs = value match {
+case cns: CreateNamedStruct => cns.valExprs
+case expr => Seq(expr)
+  }
+
+  override lazy val resolved: Boolean = {
+lazy val checkForInSubquery = list match {
+  case (l @ ListQuery(sub, children, _)) :: Nil =>
+// SPARK-21759:
+// TODO: Update this check if we combine the optimizer rules for 
subquery rewriting.
+//
+// In `CheckAnalysis`, we already check if the size of subquery 
plan output match the size
+// of value expressions. However, we can add extra correlated 
predicate references into
+// the top of subquery plan when pulling up correlated predicates. 
Thus, we add extra check
+// here to make sure we don't mess the query plan.
+
+// Try to find out if any extra subquery output doesn't in the 
subquery condition.
+val extraOutputAllInCondition = 
sub.output.drop(valExprs.length).find { attr =>
+  l.children.forall { c =>
+!c.references.contains(attr)
+  }
+}.isEmpty
+
+if (sub.output.length >= valExprs.length && 
extraOutputAllInCondition) {
+  true
+} else {
+  false
+}
--- End diff --

Line 159 - Line 169 can be simplified to

```Scala
val isAllExtraOutputInCondition = 
sub.output.drop(valExprs.length).forall { attr =>
  children.exists(_.references.contains(attr))
}
sub.output.length >= valExprs.length && isAllExtraOutputInCondition
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18984: [SPARK-21773][BUILD][DOCS] Installs mkdocs if missing in...

2017-08-20 Thread shaneknapp

Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/18984
  
how long has this been failing in this way?  i'll take a closer look
tomorrow afternoon.

On Sun, Aug 20, 2017 at 4:31 AM, Hyukjin Kwon 
wrote:

> It looks still failed during mkdocs installation -
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-
> docs/3588/consoleFull
>
> Missing mkdocs in your path, trying to install mkdocs for SQL 
documentation generation.
> Collecting mkdocs
>   Downloading mkdocs-0.16.3-py2.py3-none-any.whl (1.2MB)
> Collecting livereload>=2.5.1 (from mkdocs)
>   Downloading livereload-2.5.1-py2-none-any.whl
> Requirement already satisfied: Jinja2>=2.7.1 in 
/home/anaconda/lib/python2.7/site-packages (from mkdocs)
> Collecting click>=3.3 (from mkdocs)
>   Downloading click-6.7-py2.py3-none-any.whl (71kB)
> Collecting Markdown>=2.3.1 (from mkdocs)
>   Downloading Markdown-2.6.9.tar.gz (271kB)
> Requirement already satisfied: PyYAML>=3.10 in 
/home/anaconda/lib/python2.7/site-packages (from mkdocs)
> Collecting tornado>=4.1 (from mkdocs)
>   Downloading tornado-4.5.1.tar.gz (483kB)
> Requirement already satisfied: six in 
/home/anaconda/lib/python2.7/site-packages (from livereload>=2.5.1->mkdocs)
> Collecting singledispatch (from tornado>=4.1->mkdocs)
>   Downloading singledispatch-3.4.0.3-py2.py3-none-any.whl
> Collecting certifi (from tornado>=4.1->mkdocs)
>   Downloading certifi-2017.7.27.1-py2.py3-none-any.whl (349kB)
> Collecting backports_abc>=0.4 (from tornado>=4.1->mkdocs)
>   Downloading backports_abc-0.5-py2.py3-none-any.whl
> Building wheels for collected packages: Markdown, tornado
>   Running setup.py bdist_wheel for Markdown: started
>   Running setup.py bdist_wheel for Markdown: finished with status 'done'
>   Stored in directory: 
/home/jenkins/.cache/pip/wheels/bf/46/10/c93e17ae86ae3b3a919c7b39dad3b5ccf09aeb066419e5c1e5
>   Running setup.py bdist_wheel for tornado: started
>   Running setup.py bdist_wheel for tornado: finished with status 'done'
>   Stored in directory: 
/home/jenkins/.cache/pip/wheels/84/83/cd/6a04602633457269d161344755e6766d24307189b7a67ff4b7
> Successfully built Markdown tornado
> Installing collected packages: singledispatch, certifi, backports-abc, 
tornado, livereload, click, Markdown, mkdocs
> Exception:
> Traceback (most recent call last):
>   File "/home/anaconda/lib/python2.7/site-packages/pip/basecommand.py", 
line 215, in main
> status = self.run(options, args)
>   File 
"/home/anaconda/lib/python2.7/site-packages/pip/commands/install.py", line 342, 
in run
> prefix=options.prefix_path,
>   File "/home/anaconda/lib/python2.7/site-packages/pip/req/req_set.py", 
line 784, in install
> **kwargs
>   File 
"/home/anaconda/lib/python2.7/site-packages/pip/req/req_install.py", line 851, 
in install
> self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
>   File 
"/home/anaconda/lib/python2.7/site-packages/pip/req/req_install.py", line 1064, 
in move_wheel_files
> isolated=self.isolated,
>   File "/home/anaconda/lib/python2.7/site-packages/pip/wheel.py", line 
345, in move_wheel_files
> clobber(source, lib_dir, True)
>   File "/home/anaconda/lib/python2.7/site-packages/pip/wheel.py", line 
323, in clobber
> shutil.copyfile(srcfile, destfile)
>   File "/home/anaconda/lib/python2.7/shutil.py", line 83, in copyfile
> with open(dst, 'wb') as fdst:
> IOError: [Errno 13] Permission denied: 
'/home/anaconda/lib/python2.7/site-packages/singledispatch_helpers.pyc
> ...
>
> but I believe the fix itself is still okay and no need to revert.
>
> Hi @shaneknapp , I *guess* we could simply
> install this by sudo pip install mkdocs without a close look. Maybe,
> would you please be able to check this and install mkdocs package for SQL
> documentation build?
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:

[GitHub] spark pull request #17951: [SPARK-20711][ML] Fix incorrect min/max for NaN v...

2017-08-20 Thread zhengruifeng

Github user zhengruifeng closed the pull request at:

https://github.com/apache/spark/pull/17951


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarch...

2017-08-20 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18958#discussion_r134145753
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java
 ---
@@ -307,64 +293,69 @@ public void update(int ordinal, Object value) {
 
 @Override
 public void setNullAt(int ordinal) {
-  assert (!columns[ordinal].isConstant);
-  columns[ordinal].putNull(rowId);
+  getColumnAsMutable(ordinal).putNull(rowId);
 }
 
 @Override
 public void setBoolean(int ordinal, boolean value) {
-  assert (!columns[ordinal].isConstant);
-  columns[ordinal].putNotNull(rowId);
-  columns[ordinal].putBoolean(rowId, value);
+  MutableColumnVector column = getColumnAsMutable(ordinal);
+  column.putNotNull(rowId);
+  column.putBoolean(rowId, value);
 }
 
 @Override
 public void setByte(int ordinal, byte value) {
-  assert (!columns[ordinal].isConstant);
-  columns[ordinal].putNotNull(rowId);
-  columns[ordinal].putByte(rowId, value);
+  MutableColumnVector column = getColumnAsMutable(ordinal);
--- End diff --

In my understanding, cast still occurs at runtime. The cast operation may 
consist compare and branch.
I am thinking about how we can reduce the cost of operations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19002: [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdown veri...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19002
  
**[Test build #80919 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80919/testReport)**
 for PR 19002 at commit 
[`b24aedf`](https://github.com/apache/spark/commit/b24aedf8859283d1520d5eae195d21722972591a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18994: [SPARK-21784][SQL] Adds support for defining info...

2017-08-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18994#discussion_r134145171
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -1214,6 +1246,11 @@ object HiveExternalCatalog {
 
   val CREATED_SPARK_VERSION = SPARK_SQL_PREFIX + "create.version"
 
+  val TABLE_CONSTRAINT_PREFIX = SPARK_SQL_PREFIX + "constraint."
+  val TABLE_CONSTRAINT_PRIMARY_KEY = SPARK_SQL_PREFIX + 
TABLE_CONSTRAINT_PREFIX + "pk"
+  val TABLE_NUM_FK_CONSTRAINTS = SPARK_SQL_PREFIX + "numFkConstraints"
+  val TABLE_CONSTRAINT_FOREIGNKEY_PREFIX = SPARK_SQL_PREFIX + 
TABLE_CONSTRAINT_PREFIX + "fk."
--- End diff --

`SPARK_SQL_PREFIX` is duplicated in `TABLE_CONSTRAINT_PRIMARY_KEY` and 
`TABLE_CONSTRAINT_FOREIGNKEY_PREFIX`.

E.g., `TABLE_CONSTRAINT_PRIMARY_KEY` is `SPARK_SQL_PREFIX` + 
`SPARK_SQL_PREFIX` + "constraint.".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19002: [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdo...

2017-08-20 Thread wangyum

Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/19002#discussion_r134144956
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -39,7 +39,6 @@ import org.apache.spark.sql.catalyst.plans.PlanTest
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
 import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.execution.FilterExec
-import org.apache.spark.sql.internal.SQLConf
--- End diff --

OK, i'll revert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18994: [SPARK-21784][SQL] Adds support for defining info...

2017-08-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18994#discussion_r134144063
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/TableConstraints.scala
 ---
@@ -0,0 +1,323 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.catalog
+
+import java.util.UUID
+
+import org.json4s._
+import org.json4s.JsonAST.JValue
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.Resolver
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.SchemaUtils
+
+/**
+ * A container class to hold all the constraints defined on a table. Scope 
of the
+ * constraint names are at the table level.
+ */
+case class TableConstraints(
+primaryKey: Option[PrimaryKey] = None,
+foreignKeys: Seq[ForeignKey] = Seq.empty) {
+
+  /**
+   * Adds the given constraint to the existing table constraints, after 
verifying the
+   * constraint name is not a duplicate.
+   */
+  def addConstraint(constraint: TableConstraint, resolver: Resolver): 
TableConstraints = {
+if ((primaryKey.exists(pk => resolver(pk.constraintName, 
constraint.constraintName))
+  || foreignKeys.exists(fk => resolver(fk.constraintName, 
constraint.constraintName {
+  throw new AnalysisException(
+s"Failed to add constraint, duplicate constraint name 
'${constraint.constraintName}'")
+}
+constraint match {
+  case pk: PrimaryKey =>
+if (primaryKey.nonEmpty) {
+  throw new AnalysisException(
+s"Primary key '${primaryKey.get.constraintName}' already 
exists.")
+}
+this.copy(primaryKey = Option(pk))
+  case fk: ForeignKey => this.copy(foreignKeys = foreignKeys :+ fk)
+}
+  }
+}
+
+object TableConstraints {
+  /**
+   * Returns a [[TableConstraints]] containing [[PrimaryKey]] or 
[[ForeignKey]]
+   */
+  def apply(tableConstraint: TableConstraint): TableConstraints = {
+tableConstraint match {
+  case pk: PrimaryKey => TableConstraints(primaryKey = Option(pk))
+  case fk: ForeignKey => TableConstraints(foreignKeys = Seq(fk))
+}
+  }
+
+  /**
+   * Converts constraints represented in Json strings to 
[[TableConstraints]].
+   */
+  def fromJson(pkJson: Option[String], fksJson: Seq[String]): 
TableConstraints = {
+val pk = pkJson.map(pk => PrimaryKey.fromJson(parse(pk)))
+val fks = fksJson.map(fk => ForeignKey.fromJson(parse(fk)))
+TableConstraints(pk, fks)
+  }
+}
+
+/**
+ * Common type representing a table constraint.
+ */
+sealed trait TableConstraint {
+  val constraintName : String
+  val keyColumnNames : Seq[String]
+}
+
+object TableConstraint {
+  private[TableConstraint] val curId = new 
java.util.concurrent.atomic.AtomicLong(0L)
+  private[TableConstraint] val jvmId = UUID.randomUUID()
+
+  /**
+   * Generates unique constraint name to use when adding table constraints,
+   * if user does not specify a name. The `curId` field is unique within a 
given JVM,
+   * while the `jvmId` is used to uniquely identify JVMs.
+   */
+  def generateConstraintName(constraintType: String = "constraint"): 
String = {
+s"${constraintType}_${jvmId}_${curId.getAndIncrement()}"
+  }
+
+  def parseColumn(json: JValue): String = json match {
+case JString(name) => name
+case _ => json.toString
+  }
+
+  object JSortedObject {
+def unapplySeq(value: JValue): Option[List[(String, JValue)]] = value 
match {
+  case JObject(seq) => Some(seq.toList.sortBy(_._1))
+  case _ => None
+}
+  }
+
+  /**
+   * Returns [[StructField]]

[GitHub] spark issue #18958: [SPARK-21745][SQL] Refactor ColumnVector hierarchy to ma...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18958
  
**[Test build #80918 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80918/testReport)**
 for PR 18958 at commit 
[`4d94655`](https://github.com/apache/spark/commit/4d94655b4695b16a2600e80824507e322af8ab00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18999
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80912/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18999
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80911/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18999
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18999
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18999
  
**[Test build #80911 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80911/testReport)**
 for PR 18999 at commit 
[`24525bc`](https://github.com/apache/spark/commit/24525bc75c886ead4c88a2b6d899c6f9a3947420).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18999
  
**[Test build #80912 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80912/testReport)**
 for PR 18999 at commit 
[`f2608ab`](https://github.com/apache/spark/commit/f2608ab0ca1e64ce97d65bffb62a07935e4b3db8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17951: [SPARK-20711][ML] Fix incorrect min/max for NaN value in...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17951
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80914/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17951: [SPARK-20711][ML] Fix incorrect min/max for NaN value in...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17951
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17951: [SPARK-20711][ML] Fix incorrect min/max for NaN value in...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17951
  
**[Test build #80914 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80914/testReport)**
 for PR 17951 at commit 
[`44d117a`](https://github.com/apache/spark/commit/44d117ad4ccad698eb331e3d9ac535bd9a438af0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18270: [SPARK-21055][SQL] replace grouping__id with grouping_id...

2017-08-20 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/18270
  
@cenyuhai 
Are you still working on this? Could please fix the test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-08-20 Thread janewangfb

Github user janewangfb commented on the issue:

https://github.com/apache/spark/pull/18975
  
still need to implement the data source table portion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18975: [SPARK-4131] Support "Writing data into the filesystem f...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18975
  
**[Test build #80917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80917/testReport)**
 for PR 18975 at commit 
[`068662a`](https://github.com/apache/spark/commit/068662a5abaaa693529320bb855b7a3323915bf8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18975: [SPARK-4131] Support "Writing data into the files...

2017-08-20 Thread janewangfb

Github user janewangfb closed the pull request at:

https://github.com/apache/spark/pull/18975


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #80913 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80913/testReport)**
 for PR 18029 at commit 
[`eb7ad56`](https://github.com/apache/spark/commit/eb7ad56b598af5e537e5fa1808dc93b692a14f6f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait InitialPosition `
  * `case class AtTimestamp(timestamp: Date) extends InitialPosition `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80913/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19002: [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdo...

2017-08-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19002#discussion_r134140695
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -255,6 +256,18 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 val df = dfRead.filter(dfRead.col("date_type").lt(dt))
   .filter(dfRead.col("timestamp_type").lt(ts))
 
+val parentPlan = df.queryExecution.executedPlan
+assert(parentPlan.isInstanceOf[WholeStageCodegenExec])
+val node = parentPlan.asInstanceOf[WholeStageCodegenExec]
+val metadata = node.child.asInstanceOf[RowDataSourceScanExec].metadata
+// The "PushedFilters" part should be exist in Dataframe's
--- End diff --

little nit: `should be exist` -> `should exist`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19002: [SPARK-21790][TESTS][FOLLOW-UP] Add filter pushdo...

2017-08-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19002#discussion_r134140826
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -255,6 +256,18 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
 val df = dfRead.filter(dfRead.col("date_type").lt(dt))
   .filter(dfRead.col("timestamp_type").lt(ts))
 
+val parentPlan = df.queryExecution.executedPlan
+assert(parentPlan.isInstanceOf[WholeStageCodegenExec])
+val node = parentPlan.asInstanceOf[WholeStageCodegenExec]
+val metadata = node.child.asInstanceOf[RowDataSourceScanExec].metadata
+// The "PushedFilters" part should be exist in Dataframe's
+// physical plan and the existence of right literals in
+// "PushedFilters" is used to prove that the predicates
+// pushing down have been effective.
+assert(metadata.get("PushedFilters").ne(None))
--- End diff --

nit: Could we use `isDefined` instead of `ne(None)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19008
  
**[Test build #80916 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80916/testReport)**
 for PR 19008 at commit 
[`0ddcec3`](https://github.com/apache/spark/commit/0ddcec303fd51be9e5e81f1c74bb23569ef58576).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18985: [SPARK-21772] Fix staging parent directory for In...

2017-08-20 Thread liupc

Github user liupc closed the pull request at:

https://github.com/apache/spark/pull/18985


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18968: [SPARK-21759][SQL] In.checkInputDataTypes should not wro...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18968
  
**[Test build #80915 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80915/testReport)**
 for PR 18968 at commit 
[`dd48a9d`](https://github.com/apache/spark/commit/dd48a9d9476c7ba775df2a1764b6311d778891b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19009: [MINOR][CORE]remove scala 's' function

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19009
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17951: [SPARK-20711][ML] Fix incorrect min/max for identical Na...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17951
  
**[Test build #80914 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80914/testReport)**
 for PR 17951 at commit 
[`44d117a`](https://github.com/apache/spark/commit/44d117ad4ccad698eb331e3d9ac535bd9a438af0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19009: [MINOR][CORE]remove scala 's' function

2017-08-20 Thread heary-cao

GitHub user heary-cao opened a pull request:

https://github.com/apache/spark/pull/19009

[MINOR][CORE]remove scala 's' function


## What changes were proposed in this pull request?

remove scala 's'  function when output information without taking the value 
of the variable.

## How was this patch tested?

existing test cases.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heary-cao/spark scals_s

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19009.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19009


commit 3752839cc6e92c3760bd118094be46ac6e41a788
Author: caoxuewen 
Date:   2017-08-21T03:28:18Z

remove scala 's' when output information without taking the value of the 
variable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18985: [SPARK-21772] Fix staging parent directory for InsertInt...

2017-08-20 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18985
  
@liupc If you have no more question about this, can you close this PR? 
Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18985: [SPARK-21772] Fix staging parent directory for InsertInt...

2017-08-20 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18985
  
OK. Thanks @liupc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18985: [SPARK-21772] Fix staging parent directory for InsertInt...

2017-08-20 Thread liupc

Github user liupc commented on the issue:

https://github.com/apache/spark/pull/18985
  
Sorry, I think SPARK-18675 has solved this problem.
https://issues.apache.org/jira/browse/SPARK-18675

My environment is hive-0.13, spark2.1.0, There are two reasons caused this 
problem.
First, this issue is not fixed until Spark-2.1.1
Second, we didn't configure the spark.sql.metastore.version(default is 
1.2.1), which caused this problem!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread yssharma

Github user yssharma commented on the issue:

https://github.com/apache/spark/pull/18029
  
Will wait for @brkyvz , @HyukjinKwon for final âï¸ 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #80913 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80913/testReport)**
 for PR 18029 at commit 
[`eb7ad56`](https://github.com/apache/spark/commit/eb7ad56b598af5e537e5fa1808dc93b692a14f6f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...

2017-08-20 Thread yssharma

Github user yssharma commented on a diff in the pull request:

https://github.com/apache/spark/pull/18029#discussion_r134138994
  
--- Diff: 
external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/InitialPosition.scala
 ---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.streaming.kinesis
+
+import java.util.Date
+
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
+
+/**
+ * Trait for Kinesis's InitialPositionInStream.
+ * This will be overridden by more specific types.
+ */
+sealed trait InitialPosition {
+  val initialPositionInStream: InitialPositionInStream
+}
+
+/**
+ * Case object for Kinesis's InitialPositionInStream.LATEST.
+ */
+case object Latest extends InitialPosition {
+  val instance: InitialPosition = this
+  override val initialPositionInStream: InitialPositionInStream
+= InitialPositionInStream.LATEST
+}
+
+/**
+ * Case object for Kinesis's InitialPositionInStream.TRIM_HORIZON.
+ */
+case object TrimHorizon extends InitialPosition {
+  val instance: InitialPosition = this
+  override val initialPositionInStream: InitialPositionInStream
+= InitialPositionInStream.TRIM_HORIZON
+}
+
+/**
+ * Case object for Kinesis's InitialPositionInStream.AT_TIMESTAMP.
+ */
+case class AtTimestamp(timestamp: Date) extends InitialPosition {
+  val instance: InitialPosition = this
+  override val initialPositionInStream: InitialPositionInStream
+= InitialPositionInStream.AT_TIMESTAMP
+}
+
+/**
+ * Companion object for InitialPosition that returns
+ * appropriate version of InitialPositionInStream.
+ */
+object InitialPosition {
+
+  /**
+   * An instance of Latest with InitialPositionInStream.LATEST.
+   * @return [[Latest]]
+   */
+  val latest : InitialPosition = Latest
+
+  /**
+   * An instance of Latest with InitialPositionInStream.TRIM_HORIZON.
+   * @return [[TrimHorizon]]
+   */
+  val trimHorizon : InitialPosition = TrimHorizon
+
+  /**
+   * Returns instance of AtTimestamp with 
InitialPositionInStream.AT_TIMESTAMP.
+   * @return [[AtTimestamp]]
+   */
+  def atTimestamp(timestamp: Date) : InitialPosition = 
AtTimestamp(timestamp)
+
+  /**
+   * Returns instance of [[InitialPosition]] based on the passed 
[[InitialPositionInStream]].
+   * This method is used in KinesisUtils for translating the 
InitialPositionInStream
+   * to InitialPosition. This function would be removed when we deprecate 
the KinesisUtils.
+   *
+   * @return [[InitialPosition]]
+   */
+  def kinesisInitialPositionInStream(
+initialPositionInStream: InitialPositionInStream) : InitialPosition = {
--- End diff --

Added all other review comments. The indentation was making it look weird, 
so skipped the indentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread yssharma

Github user yssharma commented on the issue:

https://github.com/apache/spark/pull/18029
  
Added review suggestions @budde !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...

2017-08-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18810
  
Btw, as for merged prs, I'm just monitoring TPCDS perf. in 
[here](https://docs.google.com/spreadsheets/d/1V8xoKR9ElU-rOXMH84gb5BbLEw0XAPTJY8c8aZeIqus/edit#gid=445143188).
 Also, I wrote a script before to run TPCDS on pending prs: 
https://github.com/maropu/spark-tpcds-datagen#helper-scripts-for-benchmarks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18906: [SPARK-21692][PYSPARK][SQL] Add nullability support to P...

2017-08-20 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18906
  
@ptkool Thank you for working on this!
I'd like to ask what your use-case is. Users have historically been 
confused about what nullable means, and we don't think we should give them yet 
another avenue to get it wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...

2017-08-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18810
  
yea, I'll do 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18968: [SPARK-21759][SQL] In.checkInputDataTypes should ...

2017-08-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18968#discussion_r134136888
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -138,46 +138,80 @@ case class Not(child: Expression)
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
+
+  lazy val valExprs = value match {
+case cns: CreateNamedStruct => cns.valExprs
+case expr => Seq(expr)
+  }
+
+  override lazy val resolved: Boolean = {
+lazy val checkForInSubquery = list match {
+  case (l @ ListQuery(sub, children, _)) :: Nil =>
+// SPARK-21759:
+// It is possibly that the subquery plan has more output than 
value expressions, because
+// the condition expressions in `ListQuery` might use part of 
subquery plan's output.
--- End diff --

@dilipbiswal Yeah, you are right. Normally we don't trigger it. It is just 
here in case we possibly mess the query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...

2017-08-20 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18810
  
@maropu Interesting. Would you like to benchmark with #18931 too? It is my 
attempt to solve long code-gen functions without disabling it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions ...

2017-08-20 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18966#discussion_r134134190
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -582,6 +582,15 @@ object SQLConf {
 .intConf
 .createWithDefault(2667)
 
+  val CODEGEN_MAX_CHARS_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxCharactersPerFunction")
--- End diff --

If the length of source code does not mean the number of lines of the 
source code, you are right.
This is because we check the sum of 
[`String.length`](https://github.com/apache/spark/pull/18966/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR785).
 Here are two examples.
```abc``` -> 3
```
ab
c
```
-> 4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18999
  
**[Test build #80912 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80912/testReport)**
 for PR 18999 at commit 
[`f2608ab`](https://github.com/apache/spark/commit/f2608ab0ca1e64ce97d65bffb62a07935e4b3db8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18999
  
**[Test build #80911 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80911/testReport)**
 for PR 18999 at commit 
[`24525bc`](https://github.com/apache/spark/commit/24525bc75c886ead4c88a2b6d899c6f9a3947420).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-20 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Thanks @MLnick . I have double checked my test.
Since there is no  recommendForUserSubset , my previous test is MLLIB 
MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating of 
many users for many products. The performance of this function is low comparing 
with recommendForAll. 
This PR calls recommendForAll with a subset of the users, I agree with your 
test results. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Py...

2017-08-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18999
  
https://github.com/apache/spark/pull/18999#discussion_r134131441 looks 
hidden. I addressed the other comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18866: [SPARK-21649][SQL] Support writing data into hive bucket...

2017-08-20 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/18866
  
@cloud-fan 
Would you give some advice on this ? Thus I can know if I'm on the right 
direction. I can keep working on it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample AP...

2017-08-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18999#discussion_r134131441
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -659,19 +659,77 @@ def distinct(self):
 return DataFrame(self._jdf.distinct(), self.sql_ctx)
 
 @since(1.3)
-def sample(self, withReplacement, fraction, seed=None):
+def sample(self, withReplacement=None, fraction=None, seed=None):
 """Returns a sampled subset of this :class:`DataFrame`.
 
+:param withReplacement: Sample with replacement or not (default 
False).
+:param fraction: Fraction of rows to generate, range [0.0, 1.0].
+:param seed: Seed for sampling (default a random seed).
+
 .. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
 count of the given :class:`DataFrame`.
 
->>> df.sample(False, 0.5, 42).count()
-2
-"""
-assert fraction >= 0.0, "Negative fraction value: %s" % fraction
--- End diff --

Hm.. wouldn't we better avoid duplicating expression requirement? It looks 
I should do:


https://github.com/apache/spark/blob/5ad1796b9fd6bce31bbc1cdc2f607115d2dd0e7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L714-L722

within Python side. I have been thinking of avoiding it if the error 
message makes sense to Python users (but not the case of exposing non-Pythonic 
error messages, for example, Java types `java.lang.Long` in the error message) 
although I understand it is better to throw an exception ahead before going to 
JVM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18576: [SPARK-21351][SQL] Update nullability based on children'...

2017-08-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18576
  
ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18810: [SPARK-21603][SQL]The wholestage codegen will be much sl...

2017-08-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18810
  
(copied from jira for just-in-case) Just for your information, I checked 
the performance changes of TPCDS before/after the pr #18810; the pr affected 
Q17/Q66 only (that is, they have too long codegen'd functions). The changes are 
as follows (just run TPCDSQueryBenchmark);
Q17: 
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q17.sql
Q66: 
https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q66.sql
 
```
Q17 w/o this pr, 3224.0  --> q17 w/this pr, 2627.0 (perf. improvement)
Q66 w/o this pr, 1712.0 -->  q66 w/this pr, 3032.0 (perf. regression)
```

It seems their queries have gen'd funcs with 2800~2900 lines, so if we set 
2900 at spark.sql.codegen.maxLinesPerFunction, we could keep the previous 
performance w/o pr18810.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...

2017-08-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18986
  
Yea, since this topic is important for some users, I mean we better move 
the doc into `./docs/` ( I feel novices dont seem to check the code documents).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-08-20 Thread yssharma

Github user yssharma commented on the issue:

https://github.com/apache/spark/pull/18029
  
Will update and post another request seen. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19003#discussion_r134129199
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala ---
@@ -763,6 +763,47 @@ class VersionsSuite extends SparkFunSuite with Logging 
{
   }
 }
 
+test(s"$version: read avro file containing decimal") {
+  val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
+  val location = new File(url.getFile)
+
+  val tableName = "tab1"
+  val avroSchema =
+"""{
+  |  "name": "test_record",
+  |  "type": "record",
+  |  "fields": [ {
+  |"name": "f0",
+  |"type": [
+  |  "null",
+  |  {
+  |"precision": 38,
+  |"scale": 2,
+  |"type": "bytes",
+  |"logicalType": "decimal"
+  |  }
+  |]
+  |  } ]
+  |}
+""".stripMargin
+  withTable(tableName) {
+versionSpark.sql(
+  s"""
+ |CREATE TABLE $tableName
+ |ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
+ |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true')
+ |STORED AS
+ |  INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
+ |  OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
+ |LOCATION '$location'
+ |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema')
--- End diff --

For such an example that requires users setting `TBLPROPERTIES`, it sounds 
like we are unable to use the `CREATE TABLE USING` command. cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

2017-08-20 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18849
  
If `ALTER TABLE` makes the hive compatibility broken, the value of this 
flag becomes misleading. Currently, the naming of this flag is pretty general. 
I expect this flag could be used for the other places in the future (besides 
`ALTER TABLE ADD COLUMN`). Introducing a flag is simple but maintaining the 
flag needs more works. That is why we do not want to introduce the extra new 
flags if they are not required.

If we want to introduce such a flag, we also need to ensure the value is 
always true. That means, we need to follow [what we are doing in the CREATE 
TABLE code 
path](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L364-L374).
 When Hive metastore complained about it, we should also set it to `false`. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19008: [SPARK-21756][SQL]Add JSON option to allow unquot...

2017-08-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19008#discussion_r134128223
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala
 ---
@@ -72,6 +72,21 @@ class JsonParsingOptionsSuite extends QueryTest with 
SharedSQLContext {
 assert(df.first().getString(0) == "Reynold Xin")
   }
 
+  test("allowUnquotedControlChars off") {
+val str = """{"name" : " + "a\tb"}"""
--- End diff --

This is corrupted, right? It is different from the string used in the 
following test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18966: [SPARK-21751][SQL] CodeGeneraor.splitExpressions ...

2017-08-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18966#discussion_r134128067
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -582,6 +582,15 @@ object SQLConf {
 .intConf
 .createWithDefault(2667)
 
+  val CODEGEN_MAX_CHARS_PER_FUNCTION = 
buildConf("spark.sql.codegen.maxCharactersPerFunction")
--- End diff --

Based on my understanding, this is not the number of characters? This is 
the length of source codes, right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18968: [SPARK-21759][SQL] In.checkInputDataTypes should ...

2017-08-20 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/18968#discussion_r134126635
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -138,46 +138,80 @@ case class Not(child: Expression)
 case class In(value: Expression, list: Seq[Expression]) extends Predicate {
 
   require(list != null, "list should not be null")
+
+  lazy val valExprs = value match {
+case cns: CreateNamedStruct => cns.valExprs
+case expr => Seq(expr)
+  }
+
+  override lazy val resolved: Boolean = {
+lazy val checkForInSubquery = list match {
+  case (l @ ListQuery(sub, children, _)) :: Nil =>
+// SPARK-21759:
+// It is possibly that the subquery plan has more output than 
value expressions, because
+// the condition expressions in `ListQuery` might use part of 
subquery plan's output.
--- End diff --

So we are adding another criteria to consider an in-subquery expression to 
be resolved. That new criteria is - 
1) Any additional output attributes that may have been added to the 
subquery plan by optimizer should have a reference in the originating 
in-subquery expression's children.(children reflect the pulled up correlated 
predicates)

Just for my understanding, there is no way to trigger this condition from 
our regular code path, right ? This is just to guard against any potential 
incorrect rewrites by the optimizer in the future ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19008
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80910/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19008
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19008
  
**[Test build #80910 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80910/testReport)**
 for PR 19008 at commit 
[`6f00957`](https://github.com/apache/spark/commit/6f009579687e11f34b26bb2f21883377b88f5b35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18849: [SPARK-21617][SQL] Store correct table metadata when alt...

2017-08-20 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18849
  
If the flag is set to true, then whenever an "alter table" command is 
executed, it will follow the "Hive compatible" path, which lets the Hive 
metastore decide whether the change is valid or not. So, to the best of Spark's 
knowledge, compatibility is maintained because Hive did not complain about it. 
No other table metadata (e.g. storage info) is changed by that command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19001
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80908/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19001
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19001
  
**[Test build #80908 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80908/testReport)**
 for PR 19001 at commit 
[`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  throw new IOException(\"Cannot find class \" + 
inputFormatClassName, e);`
  * `  throw new IOException(\"Unable to find the InputFormat class \" 
+ inputFormatClassName, e);`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample AP...

2017-08-20 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18999#discussion_r134123916
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -659,19 +659,77 @@ def distinct(self):
 return DataFrame(self._jdf.distinct(), self.sql_ctx)
 
 @since(1.3)
-def sample(self, withReplacement, fraction, seed=None):
+def sample(self, withReplacement=None, fraction=None, seed=None):
 """Returns a sampled subset of this :class:`DataFrame`.
 
+:param withReplacement: Sample with replacement or not (default 
False).
+:param fraction: Fraction of rows to generate, range [0.0, 1.0].
+:param seed: Seed for sampling (default a random seed).
+
 .. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
 count of the given :class:`DataFrame`.
 
->>> df.sample(False, 0.5, 42).count()
-2
-"""
-assert fraction >= 0.0, "Negative fraction value: %s" % fraction
-seed = seed if seed is not None else random.randint(0, sys.maxsize)
-rdd = self._jdf.sample(withReplacement, fraction, long(seed))
-return DataFrame(rdd, self.sql_ctx)
+.. note:: `fraction` is required and, `withReplacement` and `seed` 
are optional.
+
+>>> df = spark.range(10)
+>>> df.sample(0.5, 3).count()
+4
+>>> df.sample(fraction=0.5, seed=3).count()
+4
+>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count()
+1
+>>> df.sample(1.0).count()
+10
+>>> df.sample(fraction=1.0).count()
+10
+>>> df.sample(False, fraction=1.0).count()
+10
+>>> df.sample("a").count()
+Traceback (most recent call last):
+...
+TypeError:...
+>>> df.sample(seed="abc").count()
+Traceback (most recent call last):
+...
+TypeError:...
--- End diff --

that makes sense! doc tests are examples users can follow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18281: [SPARK-21027][ML][PYTHON] Added tunable paralleli...

2017-08-20 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18281#discussion_r134123840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -17,29 +17,34 @@
 
 package org.apache.spark.ml.classification
 
-import java.util.{List => JList}
 import java.util.UUID
 
-import scala.collection.JavaConverters._
+import scala.concurrent.Future
+import scala.concurrent.duration.Duration
 import scala.language.existentials
 
 import org.apache.hadoop.fs.Path
 import org.json4s.{DefaultFormats, JObject, _}
 import org.json4s.JsonDSL._
 import org.json4s.jackson.JsonMethods._
 
-import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Since
 import org.apache.spark.ml._
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vector
 import org.apache.spark.ml.param.{Param, ParamMap, ParamPair, Params}
+<<< HEAD
--- End diff --

some sort of merge problem? this shouldn't be in a commit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-08-20 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18732
  
cool - this is a bit understated but potentially huge (to me anyway)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample AP...

2017-08-20 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18999#discussion_r134123764
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -659,19 +659,77 @@ def distinct(self):
 return DataFrame(self._jdf.distinct(), self.sql_ctx)
 
 @since(1.3)
-def sample(self, withReplacement, fraction, seed=None):
+def sample(self, withReplacement=None, fraction=None, seed=None):
 """Returns a sampled subset of this :class:`DataFrame`.
 
+:param withReplacement: Sample with replacement or not (default 
False).
+:param fraction: Fraction of rows to generate, range [0.0, 1.0].
+:param seed: Seed for sampling (default a random seed).
+
 .. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
 count of the given :class:`DataFrame`.
 
->>> df.sample(False, 0.5, 42).count()
-2
-"""
-assert fraction >= 0.0, "Negative fraction value: %s" % fraction
-seed = seed if seed is not None else random.randint(0, sys.maxsize)
-rdd = self._jdf.sample(withReplacement, fraction, long(seed))
-return DataFrame(rdd, self.sql_ctx)
+.. note:: `fraction` is required and, `withReplacement` and `seed` 
are optional.
+
+>>> df = spark.range(10)
+>>> df.sample(0.5, 3).count()
+4
+>>> df.sample(fraction=0.5, seed=3).count()
+4
+>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count()
+1
+>>> df.sample(1.0).count()
+10
+>>> df.sample(fraction=1.0).count()
+10
+>>> df.sample(False, fraction=1.0).count()
+10
+>>> df.sample("a").count()
+Traceback (most recent call last):
+...
+TypeError:...
+>>> df.sample(seed="abc").count()
+Traceback (most recent call last):
+...
+TypeError:...
--- End diff --

maybe we don't do the error cases here in doctest, but move them to unit 
test instead?
also these cases aren't really that meaningfully different to me as an 
user?
```
>>> df.sample(0.5, 3).count()
 +4
 +>>> df.sample(fraction=0.5, seed=3).count()
 +4
 +>>> df.sample(1.0).count()
 +10
 +>>> df.sample(fraction=1.0).count()
 +10
 +>>> df.sample(False, fraction=1.0).count()
 +10
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18953
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18953
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80909/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18999: [SPARK-21779][PYTHON] Simpler DataFrame.sample AP...

2017-08-20 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18999#discussion_r134123358
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -659,19 +659,77 @@ def distinct(self):
 return DataFrame(self._jdf.distinct(), self.sql_ctx)
 
 @since(1.3)
-def sample(self, withReplacement, fraction, seed=None):
+def sample(self, withReplacement=None, fraction=None, seed=None):
 """Returns a sampled subset of this :class:`DataFrame`.
 
+:param withReplacement: Sample with replacement or not (default 
False).
+:param fraction: Fraction of rows to generate, range [0.0, 1.0].
+:param seed: Seed for sampling (default a random seed).
+
 .. note:: This is not guaranteed to provide exactly the fraction 
specified of the total
 count of the given :class:`DataFrame`.
 
->>> df.sample(False, 0.5, 42).count()
-2
-"""
-assert fraction >= 0.0, "Negative fraction value: %s" % fraction
--- End diff --

I'd do the check in python, so the error message is more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18953
  
**[Test build #80909 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80909/testReport)**
 for PR 18953 at commit 
[`63cf876`](https://github.com/apache/spark/commit/63cf87688ae1b47e6adcad4d9ff1784ac321eb12).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19007
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19007
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80907/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19007
  
**[Test build #80907 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80907/testReport)**
 for PR 19007 at commit 
[`2c46907`](https://github.com/apache/spark/commit/2c469074b6ba734aac1384cf42746733baacfd3f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19007
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19007
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80906/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19007
  
**[Test build #80906 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80906/testReport)**
 for PR 19007 at commit 
[`8756a54`](https://github.com/apache/spark/commit/8756a541f2bdb3c37211d17452f55a930955416d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19008: [SPARK-21756][SQL]Add JSON option to allow unquoted cont...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19008
  
**[Test build #80910 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80910/testReport)**
 for PR 19008 at commit 
[`6f00957`](https://github.com/apache/spark/commit/6f009579687e11f34b26bb2f21883377b88f5b35).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19008: [SPARK-21756][SQL]Add JSON option to allow unquot...

2017-08-20 Thread vinodkc

GitHub user vinodkc opened a pull request:

https://github.com/apache/spark/pull/19008

[SPARK-21756][SQL]Add JSON option to allow unquoted control characters

## What changes were proposed in this pull request?

This patch adds allowUnquotedControlChars option in JSON data source to 
allow JSON Strings to contain unquoted control characters (ASCII characters 
with value less than 32, including tab and line feed characters)

## How was this patch tested?
Add new test cases



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vinodkc/spark br_fix_SPARK-21756

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19008.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19008


commit 6f009579687e11f34b26bb2f21883377b88f5b35
Author: vinodkc 
Date:   2017-08-20T17:39:27Z

Add JSON option to allow unquoted control characters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by...

2017-08-20 Thread vinodkc

Github user vinodkc closed the pull request at:

https://github.com/apache/spark/pull/19007


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread vinodkc

Github user vinodkc commented on the issue:

https://github.com/apache/spark/pull/19007
  
Ok , I'm closing my PR.
Now a days, Spark  JIRA is not showing PR status.That is why I missed your 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...

2017-08-20 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18953
  
Hi, @cloud-fan . I added `SparkOrcNewRecordReader.java` back to reduce the 
patch size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19007: [SPARK-21783][SQL]Turn on ORC filter push-down by defaul...

2017-08-20 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19007
  
Ur, I made the PR two days ago already, #18991 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19001
  
**[Test build #80908 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80908/testReport)**
 for PR 19001 at commit 
[`02d8711`](https://github.com/apache/spark/commit/02d87119f60db4db3e141b2f72365b09b45d9647).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18953: [SPARK-20682][SQL] Update ORC data source based on Apach...

2017-08-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18953
  
**[Test build #80909 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80909/testReport)**
 for PR 18953 at commit 
[`63cf876`](https://github.com/apache/spark/commit/63cf87688ae1b47e6adcad4d9ff1784ac321eb12).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19001
  
Retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19001: [SPARK-19256][SQL] Hive bucketing support

2017-08-20 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19001#discussion_r134120888
  
--- Diff: 
sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/BucketizedSparkRecordReader.java
 ---
@@ -0,0 +1,147 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io;
+
+import org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil;
--- End diff --

I see. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 261 matches

Mail list logo