date:20170718

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18620
  
I'm not understanding why `sorted` is slower than `sortBy` - `sortBy` uses 
`sorted` in its implementation:

```scala
def sortBy[B](f: A => B)(implicit ord: Ordering[B]): Repr = sorted(ord on f)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18655
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18655
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18637: [SPARK-15526][ML][FOLLOWUP][test-maven] Make JPMML provi...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18637
  
**[Test build #79701 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79701/testReport)**
 for PR 18637 at commit 
[`9c3ab05`](https://github.com/apache/spark/commit/9c3ab057ad1ff89ab726ea86774692ef22151b49).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18635: [SPARK-21415] Triage scapegoat warnings, part 1

2017-07-18 Thread srowen

Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/18635


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127903537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

> The other preceding join conditions before equi join condition also could 
impact it. It could be skipped if the preceding join conditions is false, right?

No. We evaluate the joining keys first to find matching/not matching rows, 
and then evaluate other join conditions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18620
  
That would make sense. There must be something else going on. Overall, I 
don't think it is compelling enough evidence to make the `poll` change. (Though 
as mentioned it's not a huge deal so if others want to do it, no objection)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18513
  
**[Test build #79699 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)**
 for PR 18513 at commit 
[`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...

2017-07-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18632
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18513
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18513
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79699/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18669: tfidf-new edit

2017-07-18 Thread chlyzzo

Github user chlyzzo closed the pull request at:

https://github.com/apache/spark/pull/18669


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18659: [SPARK-21404][PYSPARK][WIP] Simple Python Vectori...

2017-07-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18659#discussion_r127913117
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -132,6 +135,61 @@ private[sql] object ArrowConverters {
 }
   }
 
+  private[sql] def fromPayloadIterator(iter: Iterator[ArrowPayload]): 
Iterator[InternalRow] = {
+new Iterator[InternalRow] {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+  private var _reader: ArrowFileReader = _
+  private var _root: VectorSchemaRoot = _
+  private var _index = 0
+
+  loadNextBatch()
+
+  override def hasNext: Boolean = _root != null && _index < 
_root.getRowCount
+
+  override def next(): InternalRow = {
+val fields = _root.getFieldVectors.asScala
+
+val genericRowData = fields.map { field =>
+  field.getAccessor.getObject(_index)
+}.toArray[Any]
--- End diff --

How about using `SpecificInternalRow`? I think that it could eliminate some 
boxing/unboxing. The following is a snippet for this usage.

```java
val fieldTypes = fields.map { field =>
  field match {
case NullableIntVector => IntegerType
case NullableFloat8Vector => DoubleType
...
  }
}
val row = new SpecificInternalRow(fieldTypes)
fields.zipWithIndex.map { case (field, i) =>
  field match {
case NullableIntVector =>
  row.setInt(i, 
field.asInstanceOf[NullableIntVector].getAccessor.get(_index)) 
case NullableFloat8Vector => LongType
  row.setDouble(i, 
field.asInstanceOf[NullableFloat8Vector].getAccessor.get(_index))
...  
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18665: [SPARK-21446] [SQL] Fix setAutoCommit never executed

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18665
  
**[Test build #3844 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3844/testReport)**
 for PR 18665 at commit 
[`9ba431a`](https://github.com/apache/spark/commit/9ba431a838a16a8371b3d3f6ef028158576f85d2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15471: [SPARK-17919] Make timeout to RBackend configurab...

2017-07-18 Thread QCTW

Github user QCTW commented on a diff in the pull request:

https://github.com/apache/spark/pull/15471#discussion_r127771891
  
--- Diff: R/pkg/R/backend.R ---
@@ -108,13 +108,27 @@ invokeJava <- function(isStatic, objId, methodName, 
...) {
   conn <- get(".sparkRCon", .sparkREnv)
   writeBin(requestMessage, conn)
 
-  # TODO: check the status code to output error information
   returnStatus <- readInt(conn)
+  handleErrors(returnStatus, conn)
+
+  # Backend will send -1 as keep alive value to prevent various connection 
timeouts
+  # on very long running jobs. See spark.r.heartBeatInterval
+  while (returnStatus == 1) {
--- End diff --

Shoudn't it have a retry limit for the returnStatus check to avoid infinite 
loop?

I have an infinite loop when the it is called by Toree sparkr_runner.R with 
error message "Failed to connect JVM: Error in socketConnection(host = 
hostname, port = port, server = FALSE, : argument "timeout" is missing, with no 
default"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127897413
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

> One row in side A could match multiple rows in side B. The join 
conditions could be also evaluated multiple times for the same row in side A, 
right? Then, if we push it down to the side A, it could also break the number 
of rand calls, right?

No. Joining keys are evaluated at once on two tables. Then we simply match 
the evaluated results.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18656
  
No. I meant if there's a CodegenFallback expression, wholestage codegen 
will not be enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18633: [SPARK-21411][YARN] Lazily create FS within kerberized U...

2017-07-18 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/18633
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18669: tfidf-new edit

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18669
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18635: [SPARK-21415] Triage scapegoat warnings, part 1

2017-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18635
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18637: [SPARK-15526][ML][FOLLOWUP][test-maven] Make JPMM...

2017-07-18 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18637#discussion_r127903284
  
--- Diff: mllib/pom.xml ---
@@ -139,8 +133,38 @@
   
 
   
+
   
 
target/scala-${scala.binary.version}/classes
 
target/scala-${scala.binary.version}/test-classes
+
+  
+org.apache.maven.plugins
+maven-dependency-plugin
+
+

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127918499
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

Most the RDBMS systems allow non-deterministic join conditions. To support 
it correclty in Spark, we need to check how the other systems behave. After we 
deciding the rule, we can't break it. Thus, it has to be very careful to design 
the initial version.

In the current stage, I do not think we have a bandwidth to make it 
perfect. If you want to continue the PR, could you just check how Hive works? 
Adding an extra flag for Hive users. It can simplify their migration task. By 
default, turn it off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18639: [SPARK-21408][core] Better default number of RPC ...

2017-07-18 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18639#discussion_r127898848
  
--- Diff: core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala 
---
@@ -33,7 +33,7 @@ import org.apache.spark.util.ThreadUtils
 /**
  * A message dispatcher, responsible for routing RPC messages to the 
appropriate endpoint(s).
  */
-private[netty] class Dispatcher(nettyEnv: NettyRpcEnv) extends Logging {
+private[netty] class Dispatcher(nettyEnv: NettyRpcEnv, numUsableCores: 
Int) extends Logging {
--- End diff --

Should we document the behavior when `numUsableCores` is set to 0 in the 
comment above?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127901508
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

Could you check the behavior of DB2 and Oracle? This is not related to the 
semantics instead of performance. We need to check what is the correct 
behavior. 

BTW, `EnsureRequirements` could also add extra `Sort` below the join. In 
our implementation, we never consider this support. Many factors could break 
this assumption.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18667: Fix the simpleString used in error messages

2017-07-18 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18667#discussion_r127903964
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/LongType.scala ---
@@ -43,7 +43,7 @@ class LongType private() extends IntegralType {
*/
   override def defaultSize: Int = 8
 
-  override def simpleString: String = "bigint"
+  override def simpleString: String = "long"
--- End diff --

I don't think so. bigint is the SQL type for an 8-byte integer, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
I am ok to close this. Thanks @MLnick 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18632#discussion_r127904688
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java
 ---
@@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int 
numFields) {
 this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields);
 this.fixedSize = nullBitsSize + 8 * numFields;
 this.startingOffset = holder.cursor;
+holder.reset();
--- End diff --

I think we don't guarantee all the calls to this `UnsafeRowWriter` 
constructor are at the timing of new incoming record. It could be possible that 
we pass a `BufferHolder` into this constructor but the holder is already 
written with some data and we want to continue writing from current cursor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18620
  
My benchmarks locally said poll() is a little faster on moderately large 
collections, like 100 elements in the queue. I'm really neutral. If it affords 
a little help, that's great. It's a natural method for a queue to have and no 
extra implementation cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Thanks @srowen ,  my test also said pq.poll is a little faster on some 
cases.
One possible benefit here is if we provide pq.poll, user's first choice may 
use pq.poll, not pq.toArray.sorted, which may causes performance reduction. As 
I have encounter for https://github.com/apache/spark/pull/18624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18513
  
**[Test build #79699 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79699/testReport)**
 for PR 18513 at commit 
[`990b816`](https://github.com/apache/spark/commit/990b816428f8e5b94c08749650be05a3f52d07db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127903005
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

I do not think we can have an easy solution to ensure it always works as 
you expected. `EnsureRequirements` is just one of rules that could break it.

The other preceding join conditions before equi join condition also could 
impact it. It could be skipped if the preceding join conditions is false, right?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18667: Fix the simpleString used in error messages

2017-07-18 Thread fxbonnet

Github user fxbonnet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18667#discussion_r127905854
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/LongType.scala ---
@@ -43,7 +43,7 @@ class LongType private() extends IntegralType {
*/
   override def defaultSize: Int = 8
 
-  override def simpleString: String = "bigint"
+  override def simpleString: String = "long"
--- End diff --

When you try to read a csv and map to a case class with a Long your get a 
message like this one:
__EXCEPTION__:org.apache.spark.sql.AnalysisException: Cannot up cast 
linked_docs.`MR_NUMBER_OF_DOCS_UPLOADED` from string to bigint as it may 
truncate
The type path of the target object is:
- field (class: "scala.Long", name: "MR_NUMBER_OF_DOCS_UPLOADED")

Getting a message that talks about bigint while you are trying to cast a 
String to a Long looks confusing to me. I thought this was a typo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127907280
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

We do not support non-deterministic join condition. Thus, our current 
execution orders in the join implementation might not behave correctly.

If we really need to support it, we have to check what is the right 
behavior in the traditional DB system.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127909294
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

I just did a simple test on Oracle. Looks like it allows the following 
query:

SELECT * from test1 join test2 on test1.a + FLOOR(DBMS_RANDOM.VALUE()) = 
test2.b + FLOOR(DBMS_RANDOM.VALUE());

Furthermore, it also doesn't disallow non-deterministic function as joining 
condition other than joining keys.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18632
  
**[Test build #79702 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79702/testReport)**
 for PR 18632 at commit 
[`a098540`](https://github.com/apache/spark/commit/a0985404363f2975bf673e37306d0bd1c700a4d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector to abst...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18468
  
**[Test build #79703 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79703/testReport)**
 for PR 18468 at commit 
[`0aa1b78`](https://github.com/apache/spark/commit/0aa1b785a0ed0038cc6a30dbb9334a0ce98992d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79695/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone

Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Yeah, CodegenFallback just provide a fallback mode. 
However, in such case, SortMergeJoinExec passes incomplete row as input to 
hiveUDF that implements CodegenFallback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12646
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18655
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79696/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18654
  
**[Test build #79695 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79695/testReport)**
 for PR 18654 at commit 
[`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12646
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79697/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127896217
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

The whole thing does not make sense to me at all. Here, I think we are just 
trying to behave consistent with Hive, although this looks a bug to me. We 
might really check how Hive works for supporting it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12646
  
**[Test build #79697 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79697/testReport)**
 for PR 12646 at commit 
[`9bb80ea`](https://github.com/apache/spark/commit/9bb80eaf8e0b4339850d8c48e221c8ad1e477552).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18655
  
**[Test build #79696 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79696/testReport)**
 for PR 18655 at commit 
[`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
I also very confused about this. You can change 
https://github.com/apache/spark/pull/18624 to sorted and test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127897096
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

How `rand(a)` and `rand(b)` share the same state? They are different 
expression instances.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18655
  
**[Test build #79698 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79698/testReport)**
 for PR 18655 at commit 
[`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127898565
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

From the line of discussion, it seems to me you still talk joining keys and 
other join conditions together. However, pushing down non-deterministic joining 
keys actually doesn't change join results, as I said above. I am not sure why 
it doesn't make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18655
  
@BryanCutler I'd like to share the motivation of refactoring 
`ArrowConverters` and `ColumnWriter`.

For `ColumnWriter`, at first I'd like to support complex types like 
`ArrayType` and `StructType`, so I refactored it based on your `ColumnWriter` 
implementation. And then I renamed and moved the package so that we can also 
use it for pandas UDF as @cloud-fan mentioned. As you might see before, I'll 
introduce `ArrowColumnVector` as a reader for Arrow vectors as well.

For `ArrowConverters`, I thought we can skip the intermediate 
`ArrowRecordBatch` creation in `ArrowConverters.toPayloadIterator()`. What do 
you think about that?

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
My micro benchmark (write a program only test pq.toArray.sorted and 
pq.Array.sortBy and pq.poll), not find significant performance difference. Only 
in the Spark job, there is big difference. Confused.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18654
  
**[Test build #79700 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79700/testReport)**
 for PR 18654 at commit 
[`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18669: tfidf-new edit

2017-07-18 Thread chlyzzo

GitHub user chlyzzo opened a pull request:

https://github.com/apache/spark/pull/18669

tfidf-new edit

## What changes were proposed in this pull request?

i add a TfIdf.scala,it can compute docs tfidf's vector. i hava a case that 
is compute docs similarity,so i use the spark millib,the code is follow:
~~~bash
val hashingTF = new HashingTF()
val tf = hashingTF.transform(dataSeg)
val idfIgnore = new IDF().fit(tf) 
val tfidfIgnore= idfIgnore.transform(tf)
val data = docIds.zip(tfidfIgnore)//RDD[(String,Vector)]
~~~
but run in a small dataset,it can get result,but take much time,then in big 
dataset,it does not work(25 document),the job does not get result in 1 
hours.
the spark config setting follow:
~~~bash
--driver-memory 8G 
--conf spark.yarn.executor.memoryOverhead=6144 
--conf spark.akka.frameSize=300
num-executors=20
executor-cores=5
executor-memory=10g
~~~
so,i write the tdidf method by meself,and test dataset(25 documents),it 
can get the result,
## How was this patch tested?

i write the TfIdf.scala,it can compute doc tfidf value,and transfer the 
value to vector.then you can use cos similary.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18669.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18669


commit 7cb566abc27d41d5816dee16c6ecb749da2adf46
Author: Yuming Wang 
Date:   2017-05-05T10:31:59Z

[SPARK-19660][SQL] Replace the deprecated property name fs.default.name to 
fs.defaultFS that newly introduced

## What changes were proposed in this pull request?

Replace the deprecated property name `fs.default.name` to `fs.defaultFS` 
that newly introduced.

## How was this patch tested?

Existing tests

Author: Yuming Wang 

Closes #17856 from wangyum/SPARK-19660.

(cherry picked from commit 37cdf077cd3f436f777562df311e3827b0727ce7)
Signed-off-by: Sean Owen 

commit dbb54a7b39568cc9e8046a86113b98c3c69b7d11
Author: jyu00 
Date:   2017-05-05T10:36:51Z

[SPARK-20546][DEPLOY] spark-class gets syntax error in posix mode

## What changes were proposed in this pull request?

Updated spark-class to turn off posix mode so the process substitution 
doesn't cause a syntax error.

## How was this patch tested?

Existing unit tests, manual spark-shell testing with posix mode on

Author: jyu00 

Closes #17852 from jyu00/master.

(cherry picked from commit 5773ab121d5d7cbefeef17ff4ac6f8af36cc1251)
Signed-off-by: Sean Owen 

commit 1fa3c86a740e072957a2104dbd02ca3c158c508d
Author: Jarrett Meyer 
Date:   2017-05-05T15:30:42Z

[SPARK-20613] Remove excess quotes in Windows executable

## What changes were proposed in this pull request?

Quotes are already added to the RUNNER variable on line 54. There is no 
need to put quotes on line 67. If you do, you will get an error when launching 
Spark.

'""C:\Program' is not recognized as an internal or external command, 
operable program or batch file.

## How was this patch tested?

Tested manually on Windows 10.

Author: Jarrett Meyer 

Closes #17861 from jarrettmeyer/fix-windows-cmd.

(cherry picked from commit b9ad2d1916af5091c8585d06ccad8219e437e2bc)
Signed-off-by: Felix Cheung 

commit f71aea6a0be6eda24623d8563d971687ecd04caf
Author: Yucai 
Date:   2017-05-05T16:51:57Z

[SPARK-20381][SQL] Add SQL metrics of numOutputRows for 
ObjectHashAggregateExec

## What changes were proposed in this pull request?

ObjectHashAggregateExec is missing numOutputRows, add this metrics for it.

## How was this patch tested?

Added unit tests for the new metrics.

Author: Yucai 

Closes #17678 from yucai/objectAgg_numOutputRows.

(cherry picked from commit 41439fd52dd263b9f7d92e608f027f193f461777)
Signed-off-by: Xiao Li 

commit 24fffacad709c553e0f24ae12a8cca3ab980af3c
Author: Shixiong Zhu 
Date:   2017-05-05T18:08:26Z

[SPARK-20603][SS][TEST] Set default number of topic partitions to 1 to 
reduce the load

## What changes were proposed in this pull request?

I checked the logs of 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.2-test-maven-hadoop-2.7/47/
 and found it took several seconds to create Kafka internal

[GitHub] spark issue #18669: tfidf-new edit

2017-07-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18669
  
@chlyzzo  close this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...

2017-07-18 Thread gczsjdy

Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18632#discussion_r127907518
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java
 ---
@@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int 
numFields) {
 this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields);
 this.fixedSize = nullBitsSize + 8 * numFields;
 this.startingOffset = holder.cursor;
+holder.reset();
--- End diff --

What do you mean by 'writer is for inner struct'? @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18632: [SPARK-21412][SQL] Reset BufferHolder while initi...

2017-07-18 Thread gczsjdy

Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18632#discussion_r127908258
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java
 ---
@@ -51,6 +51,7 @@ public UnsafeRowWriter(BufferHolder holder, int 
numFields) {
 this.nullBitsSize = UnsafeRow.calculateBitSetWidthInBytes(numFields);
 this.fixedSize = nullBitsSize + 8 * numFields;
 this.startingOffset = holder.cursor;
+holder.reset();
--- End diff --

@cloud-fan @viirya For your worries, maybe we can move the `holder.reset()` 
to `BufferHolder`'s constructor. Then the holder will be reset only once, and 
also it'ok to continue writing from a buffer's current cursor. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18669: tfidf-new edit

2017-07-18 Thread chlyzzo

Github user chlyzzo commented on the issue:

https://github.com/apache/spark/pull/18669
  
closed,
- åå§é®ä»¶ -
åä»¶äººï¼Sean Owen 
æ¶ä»¶äººï¼apache/spark 
æéäººï¼chlyzzo ,  Mention 

ä¸»é¢ï¼Re: [apache/spark] tfidf-new edit (#18669)
æ¥æï¼2017å¹´07æ18æ¥ 15ç¹41å

@chlyzzo  close this


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.


  
  








---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18624
  
Hi @srowen @MLnick @jkbradley @mengxr @yanboliang 
Is this change acceptable?  if it is acceptable,  I will update ALS ML code 
following this method. Also update Test Suite, which are too simple, can not 
detect ALS errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127891910
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

For different joining type, I think the joining keys are used to find 
matching/not matching rows. Currently I can't think of the case we can't push 
down non-deterministic joining keys. Maybe you can also show an example?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18656
  
Will CodegenFallback be used in wholestage codegen? I think it's not 
supported.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127894313
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

IIUC, for joining keys, it actually satisfies what you said: It's evaluated 
in the same order and in the same number as we don't push it down.

I can't think an example it doesn't. So I may ask if you have an example 
for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

2017-07-18 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127888746
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempPath { dir =>
--- End diff --

More clear :) No need to create source files in real.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties from s...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18668
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127893543
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

The major point here is the non-deterministic join condition push-down is 
safe only when the results are the exactly same before and after the push down. 
After we push it down, basically, it will be evaluated for each row of that 
side. Will it be evaluated in the same order and in the same number if we do 
not push it down? We can find many different scenarios to break it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12646
  
**[Test build #79697 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79697/testReport)**
 for PR 12646 at commit 
[`9bb80ea`](https://github.com/apache/spark/commit/9bb80eaf8e0b4339850d8c48e221c8ad1e477552).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127892847
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

What is the join key? Any definition?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18668: [SPARK-21451][SQL]get `spark.hadoop.*` properties...

2017-07-18 Thread yaooqinn

GitHub user yaooqinn opened a pull request:

https://github.com/apache/spark/pull/18668

[SPARK-21451][SQL]get `spark.hadoop.*` properties from sysProps to hiveconf 



## What changes were proposed in this pull request?

get `spark.hadoop.*` properties from sysProps to hiveconf

## How was this patch tested?
UT

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yaooqinn/spark SPARK-21451

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18668


commit 89d9b86616196fde5d0b3a08fb284e6af6afe588
Author: Kent Yao 
Date:   2017-07-18T06:41:24Z

HiveConf in SparkSQLCLIDriver doesn't respect 
spark.hadoop.some.hive.variables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127895586
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

However, `rand(a)` and `rand(b)` could share the same state inside of 
`rand`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread DonnyZone

Github user DonnyZone commented on the issue:

https://github.com/apache/spark/pull/18656
  
Hi, @cloud-fan, @vanzin , could you help to take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127893995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

Supporting only equi-join does not sound reasonable here. The join 
condition can be any predicate. 

How about adding a SQLConf flag for controlling it? We can simply pushing 
it down no matter whether its semantics are the same or not, for making it 
consistent with Hive. By default, turn that flag off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127895248
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

Joining keys can only be equi-join. It is exactly the use case discussed in 
the dev mailling list. It's actually useful for the use cases.

A general non-deterministic join condition pushdown doesn't make a lot of 
sense. The kind of predicates like `rand(1) > 0 && rand(11) < 0` can be a 
serious concern. The join results can be different before and after pushdown.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127895399
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

`rand(a)` and `rand(b)` are belonging to individual tables. So they are 
evaluated individually on different tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127895419
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

One row in side A could match multiple rows in side B. The join conditions 
could be also evaluated multiple times for the same row in side A, right? Then, 
if we push it down to the side A, it could also break the number of `rand` 
calls, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18654
  
**[Test build #79695 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79695/testReport)**
 for PR 18654 at commit 
[`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18620
  
Hi @MLnick , @srowen .
My test showing: pq.poll is not significantly faster than 
pq.toArray.sortBy, but significantly faster than pq.toArray.sorted.  Seems not 
each pq.toArray.sorted (such as used in topByKey) can be replaced by 
pq.toArray.sortBy, so use pq.poll to replace pq.toArray.sorted will benefit.
You can compare the performance of pq.sorted, pq.sortBy, and pq.poll using: 
 https://github.com/apache/spark/pull/18624
The performance of pq.toArray.sortBy is about the same as pq.poll, and 
about 20% improvement comparing pq.toArray.sorted. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18655: [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18655
  
**[Test build #79696 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79696/testReport)**
 for PR 18655 at commit 
[`8ffedda`](https://github.com/apache/spark/commit/8ffedda9f05d379d700aef95dca049a751374f87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127893174
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

We use `ExtractEquiJoinKeys` to extract joining keys. You can check it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18555: [SPARK-21353][CORE]add checkValue in spark.internal.conf...

2017-07-18 Thread heary-cao

Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/18555
  
@gatorsmile 
Could you please review this code  again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127894772
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

Even if for equi join, how about `rand(a) = rand(b)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector ...

2017-07-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18468#discussion_r127962023
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java
 ---
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.nio.ByteBuffer;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.execution.columnar.*;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * A column backed by an in memory JVM array.
+ */
+public final class CachedBatchColumnVector extends ColumnVector implements 
java.io.Serializable {
+
+  // keep compressed data
+  private byte[] buffer;
+
+  // whether a row is already extracted or not. If extractTo() is called, 
set true
+  // e.g. when isNullAt() and getInt() ara called, extractTo() must be 
called only once
+  private boolean[] calledExtractTo;
+
+  // accessor for a column
+  private transient ColumnAccessor columnAccessor;
+
+  // a row where the compressed data is extracted
+  private transient ColumnVector columnVector;
+
+  // an accessor uses only row 0 in columnVector
+  private final int ROWID = 0;
+
+
+  public CachedBatchColumnVector(byte[] buffer, int numRows, DataType 
type) {
+super(numRows, DataTypes.NullType, MemoryMode.ON_HEAP);
+initialize(buffer, type);
+reserveInternal(numRows);
+reset();
+  }
+
+  @Override
+  public long valuesNativeAddress() {
+throw new RuntimeException("Cannot get native address for on heap 
column");
+  }
+  @Override
+  public long nullsNativeAddress() {
+throw new RuntimeException("Cannot get native address for on heap 
column");
+  }
+
+  @Override
+  public void close() {
+  }
+
+  private void setColumnAccessor() {
+ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
+columnAccessor = ColumnAccessor$.MODULE$.apply(type, byteBuffer);
+calledExtractTo = new boolean[capacity];
+  }
+
+  // call extractTo() before getting actual data
+  private void prepareAccess(int rowId) {
+if (!calledExtractTo[rowId]) {
+  assert (columnAccessor.hasNext());
+  columnAccessor.extractTo(columnVector, ROWID);
+  calledExtractTo[rowId] = true;
+}
+  }
+
+  //
+  // APIs dealing with nulls
+  //
+
+  @Override
+  public void putNotNull(int rowId) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNull(int rowId) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNulls(int rowId, int count) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNotNulls(int rowId, int count) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public boolean isNullAt(int rowId) {
+prepareAccess(rowId);
+return columnVector.isNullAt(ROWID);
+  }
+
+  //
+  // APIs dealing with Booleans
+  //
+
+  @Override
+  public void putBoolean(int rowId, boolean value) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putBooleans(int rowId, int count, boolean value) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public boolean getBoolean(int rowId) {
--- End diff --

We do not support reading values in a random order. This is because 
implementation of `CompressionScheme` (e.g. `IntDelta`) supports only 
sequential access.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working,

[GitHub] spark issue #18656: [SPARK-21441]Incorrect Codegen in SortMergeJoinExec resu...

2017-07-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18656
  
I think the check for `SortMergeJoinExec` in `insertInputAdapter` should be 
corrected to:

private def insertInputAdapter(plan: SparkPlan): SparkPlan = plan match 
{
  case p if !supportCodegen(p) =>
// collapse them recursively
InputAdapter(insertWholeStageCodegen(p))
  case j @ SortMergeJoinExec(_, _, _, _, left, right) =>
// The children of SortMergeJoin should do codegen separately.
j.copy(left = InputAdapter(insertWholeStageCodegen(left)),
  right = InputAdapter(insertWholeStageCodegen(right)))
  case p =>
p.withNewChildren(p.children.map(insertInputAdapter))
}

Can you try it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127965749
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
--- End diff --

Sure. I agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18652: [WIP] Pull non-deterministic joining keys from Jo...

2017-07-18 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18652#discussion_r127965550
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1912,6 +1913,26 @@ class Analyzer(
   nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }.copy(child = newChild)
 
+  case j: Join if j.condition.isDefined && 
!j.condition.get.deterministic =>
+j match {
+  // We can push down non-deterministic joining keys.
+  // We can't push down non-deterministic conditions.
+  case ExtractEquiJoinKeys(_, leftKeys, rightKeys, conditions, _, 
_)
--- End diff --

cc @cloud-fan and @hvanhovell if you have more insights that can be shared 
with us about this part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18468: [SPARK-20873][SQL] Creat CachedBatchColumnVector ...

2017-07-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18468#discussion_r127969028
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/CachedBatchColumnVector.java
 ---
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.vectorized;
+
+import java.nio.ByteBuffer;
+
+import org.apache.spark.memory.MemoryMode;
+import org.apache.spark.sql.execution.columnar.*;
+import org.apache.spark.sql.types.*;
+import org.apache.spark.unsafe.types.UTF8String;
+
+/**
+ * A column backed by an in memory JVM array.
+ */
+public final class CachedBatchColumnVector extends ColumnVector implements 
java.io.Serializable {
+
+  // keep compressed data
+  private byte[] buffer;
+
+  // whether a row is already extracted or not. If extractTo() is called, 
set true
+  // e.g. when isNullAt() and getInt() ara called, extractTo() must be 
called only once
+  private boolean[] calledExtractTo;
+
+  // accessor for a column
+  private transient ColumnAccessor columnAccessor;
+
+  // a row where the compressed data is extracted
+  private transient ColumnVector columnVector;
+
+  // an accessor uses only row 0 in columnVector
+  private final int ROWID = 0;
+
+
+  public CachedBatchColumnVector(byte[] buffer, int numRows, DataType 
type) {
+super(numRows, DataTypes.NullType, MemoryMode.ON_HEAP);
+initialize(buffer, type);
+reserveInternal(numRows);
+reset();
+  }
+
+  @Override
+  public long valuesNativeAddress() {
+throw new RuntimeException("Cannot get native address for on heap 
column");
+  }
+  @Override
+  public long nullsNativeAddress() {
+throw new RuntimeException("Cannot get native address for on heap 
column");
+  }
+
+  @Override
+  public void close() {
+  }
+
+  private void setColumnAccessor() {
+ByteBuffer byteBuffer = ByteBuffer.wrap(buffer);
+columnAccessor = ColumnAccessor$.MODULE$.apply(type, byteBuffer);
+calledExtractTo = new boolean[capacity];
+  }
+
+  // call extractTo() before getting actual data
+  private void prepareAccess(int rowId) {
+if (!calledExtractTo[rowId]) {
+  assert (columnAccessor.hasNext());
+  columnAccessor.extractTo(columnVector, ROWID);
+  calledExtractTo[rowId] = true;
+}
+  }
+
+  //
+  // APIs dealing with nulls
+  //
+
+  @Override
+  public void putNotNull(int rowId) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNull(int rowId) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNulls(int rowId, int count) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putNotNulls(int rowId, int count) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public boolean isNullAt(int rowId) {
+prepareAccess(rowId);
+return columnVector.isNullAt(ROWID);
+  }
+
+  //
+  // APIs dealing with Booleans
+  //
+
+  @Override
+  public void putBoolean(int rowId, boolean value) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public void putBooleans(int rowId, int count, boolean value) {
+throw new UnsupportedOperationException();
+  }
+
+  @Override
+  public boolean getBoolean(int rowId) {
--- End diff --

I see. I will add code to track access order for each getter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79704/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18641: [SPARK-21413][SQL] Fix 64KB JVM bytecode limit pr...

2017-07-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18641#discussion_r127982825
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -273,12 +274,26 @@ case class CaseWhenCodegen(
 val cases = branches.map { case (condExpr, valueExpr) =>
   val cond = condExpr.genCode(ctx)
   val res = valueExpr.genCode(ctx)
+  val (condFunc, condIsNull, condValue, resFunc, resIsNull, resValue ) 
=
+if ((cond.code.length + res.code.length) > 1024 &&
--- End diff --

Ah, got it. You mean that we have to split super deeply-nested if-then-else 
statements into multiple methods, too.
I will work for that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18654
  
**[Test build #79704 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79704/testReport)**
 for PR 18654 at commit 
[`d118d68`](https://github.com/apache/spark/commit/d118d685374242599a12d6536675ba7aeae4bfb7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18670: [SPARK-21455][CORE]RpcFailure should be call on RpcRespo...

2017-07-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18670
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18670: [SPARK-21455][CORE]RpcFailure should be call on RpcRespo...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18670
  
**[Test build #79706 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79706/testReport)**
 for PR 18670 at commit 
[`962b605`](https://github.com/apache/spark/commit/962b6059bcc9f5b54a4e01351993982ef7bab9f1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18670: [SPARK-21455][CORE]RpcFailure should be call on R...

2017-07-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18670#discussion_r127988207
  
--- Diff: core/src/test/scala/org/apache/spark/rpc/RpcEnvSuite.scala ---
@@ -624,7 +624,9 @@ abstract class RpcEnvSuite extends SparkFunSuite with 
BeforeAndAfterAll {
   val e = intercept[SparkException] {
 ThreadUtils.awaitResult(f, 1 seconds)
   }
-  assert(e.getCause.isInstanceOf[NotSerializableException])
+  assert(e.getCause.isInstanceOf[RuntimeException])
--- End diff --

why the exception type changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18654
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79700/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18305: [SPARK-20988][ML] Logistic regression uses aggreg...

2017-07-18 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/18305#discussion_r127934107
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -598,8 +598,23 @@ class LogisticRegression @Since("1.2.0") (
 val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
 val bcFeaturesStd = instances.context.broadcast(featuresStd)
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
-  $(standardization), bcFeaturesStd, regParamL2, multinomial = 
isMultinomial,
+val getAggregatorFunc = new LogisticAggregator(bcFeaturesStd, 
numClasses, $(fitIntercept),
+  multinomial = isMultinomial)(_)
+val getFeaturesStd = (j: Int) => if (j >= 0 && j < 
numCoefficientSets * numFeatures) {
+  featuresStd(j / numCoefficientSets)
+} else {
+  0.0
+}
+
+val regularization = if (regParamL2 != 0.0) {
+  val shouldApply = (idx: Int) => idx >= 0 && idx < numFeatures * 
numCoefficientSets
--- End diff --

The intercepts are appended to the coefficient vectors, so the `idx` for 
intercept will be `>= numFeatures * numCoefficientSets`. Hence this function 
ignores intercept reg.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18305: [SPARK-20988][ML] Logistic regression uses aggregator hi...

2017-07-18 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18305
  
@sethah IMO we should back out the test-related bc var explicit destroy 
code as it complicates things. I hear that this _may_ help catch bugs... but 
frankly I'm not convinced.

Because the code setup & path in the source may not be quite the same as in 
the tests (almost never I'd say), I don't believe you will necessarily catch 
bugs such as the one mentioned by Yanbo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18632
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79702/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...

2017-07-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18632
  
**[Test build #79702 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79702/testReport)**
 for PR 18632 at commit 
[`a098540`](https://github.com/apache/spark/commit/a0985404363f2975bf673e37306d0bd1c700a4d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18632: [SPARK-21412][SQL] Reset BufferHolder while initialize a...

2017-07-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18632
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 432 matches

Mail list logo