date:20160701

[GitHub] spark issue #14026: [SPARK-13569][STREAMING][KAFKA] pattern based topic subs...

2016-07-01 Thread koeninger

Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/14026
  
@tdas @zsxwing This should be the last ConsumerStrategy implementation to 
have basic parity with what's offered by the kafka consumer, anything else 
should probably be handled by user subclasses.

If the KAFKA-3370 workaround stuff isn't clear... the basic issue is that 
you have to poll in order to get partition assignments before setting a 
position... but if you poll with auto offset none, it will throw an exception 
because you don't have a position yet :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14026: [SPARK-13569][STREAMING][KAFKA] pattern based topic subs...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14026
  
**[Test build #61649 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61649/consoleFull)**
 for PR 14026 at commit 
[`796045f`](https://github.com/apache/spark/commit/796045ff3ad53afeb56cdddb69c4770090f7c168).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14026: [SPARK-13569][STREAMING][KAFKA] pattern based top...

2016-07-01 Thread koeninger

GitHub user koeninger opened a pull request:

https://github.com/apache/spark/pull/14026

[SPARK-13569][STREAMING][KAFKA] pattern based topic subscription

## What changes were proposed in this pull request?
Allow for kafka topic subscriptions based on a regex pattern.

## How was this patch tested?
Unit tests, manual tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/koeninger/spark-1 SPARK-13569

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14026.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14026


commit 796045ff3ad53afeb56cdddb69c4770090f7c168
Author: cody koeninger 
Date:   2016-07-02T05:17:29Z

[SPARK-13569][STREAMING][KAFKA] pattern based topic subscription




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372870
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala
 ---
@@ -0,0 +1,298 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeArrayWriter}
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData
+ * To run this:
+ *  build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark"
+ *
+ * Benchmarks in this file are skipped in normal builds.
+ */
+class UnsafeArrayDataBenchmark extends BenchmarkBase {
+
+  new SparkConf()
+.setMaster("local[1]")
+.setAppName("microbenchmark")
+.set("spark.driver.memory", "3g")
+
+  def calculateHeaderPortionInBytes(count: Int) : Int = {
+// Use this assignment for SPARK-15962
+// val size = 4 + 4 * count
+val size = UnsafeArrayData.calculateHeaderPortionInBytes(count)
+size
+  }
+
+  def readUnsafeArray(iters: Int): Unit = {
+val count = 1024 * 1024 * 16
+
+val intUnsafeArray = new UnsafeArrayData
+var intResult: Int = 0
+val intSize = calculateHeaderPortionInBytes(count) + 4 * count
+val intBuffer = new Array[Byte](intSize)
+Platform.putInt(intBuffer, Platform.BYTE_ARRAY_OFFSET, count)
+intUnsafeArray.pointTo(intBuffer, Platform.BYTE_ARRAY_OFFSET, intSize)
+val readIntArray = { i: Int =>
+  var n = 0
+  while (n < iters) {
+val len = intUnsafeArray.numElements
+var sum = 0.toInt
+var i = 0
+while (i < len) {
+  sum += intUnsafeArray.getInt(i)
+  i += 1
+}
+intResult = sum
+n += 1
+  }
+}
+
+val doubleUnsafeArray = new UnsafeArrayData
+var doubleResult: Double = 0
+val doubleSize = calculateHeaderPortionInBytes(count) + 8 * count
+val doubleBuffer = new Array[Byte](doubleSize)
+Platform.putInt(doubleBuffer, Platform.BYTE_ARRAY_OFFSET, count)
+doubleUnsafeArray.pointTo(doubleBuffer, Platform.BYTE_ARRAY_OFFSET, 
doubleSize)
+val readDoubleArray = { i: Int =>
+  var n = 0
+  while (n < iters) {
+val len = doubleUnsafeArray.numElements
+var sum = 0.toDouble
+var i = 0
+while (i < len) {
+  sum += doubleUnsafeArray.getDouble(i)
+  i += 1
+}
+doubleResult = sum
+n += 1
+  }
+}
+
+val benchmark = new Benchmark("Read UnsafeArrayData", count * iters)
+benchmark.addCase("Int")(readIntArray)
+benchmark.addCase("Double")(readDoubleArray)
+benchmark.run
+/*
+Without SPARK-15962
+OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.0.4-301.fc22.x86_64
+Intel Xeon E3-12xx v2 (Ivy Bridge)
+Read UnsafeArrayData:Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+Int370 /  471454.0 
  2.2   1.0X
+Double 351 /  466477.5 
  2.1   1.1X
+*/
+/*
+With SPARK-15962
--- End diff --

only put the current result here, i.e. with this PR. and put the result 
without this PR in PR comment.

A benchmark is used to show the performance of current code, not the 
improvement for some patches, or it will be very hard to

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14025
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14025
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61648/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14025
  
**[Test build #61648 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61648/consoleFull)**
 for PR 14025 at commit 
[`72aa185`](https://github.com/apache/spark/commit/72aa185246ee42a3a597de258734cbf6312d1979).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372787
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala
 ---
@@ -0,0 +1,298 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeArrayWriter}
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData
+ * To run this:
+ *  build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark"
+ *
+ * Benchmarks in this file are skipped in normal builds.
+ */
+class UnsafeArrayDataBenchmark extends BenchmarkBase {
+
+  new SparkConf()
+.setMaster("local[1]")
+.setAppName("microbenchmark")
+.set("spark.driver.memory", "3g")
+
+  def calculateHeaderPortionInBytes(count: Int) : Int = {
+// Use this assignment for SPARK-15962
+// val size = 4 + 4 * count
+val size = UnsafeArrayData.calculateHeaderPortionInBytes(count)
+size
+  }
+
+  def readUnsafeArray(iters: Int): Unit = {
--- End diff --

I think we need to benchmark 4 cases:

1. normal write: generate a random int array and use encoder to turn it 
into array data, e.g.
```
val array: Array[Int] = ...
val encoder = ExpressionEncoder[Array[Int]].resolveAndBind()
encoder.toRow(array) // benchmark it.
```
2. from primitive array: benchmark the `UnsafeArrayData.fromPrimitiveArray`
3. normal read: generate random array, turn it into array data by encoder, 
then benchmark the element reading, e.g. `getInt`
4. to primitive array: benchmark the `UnsafeArrayData.toIntArray`, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372697
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/UnsafeArrayDataBenchmark.scala
 ---
@@ -0,0 +1,298 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.catalyst.expressions.{UnsafeArrayData, 
UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeArrayWriter}
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.util.Benchmark
+
+/**
+ * Benchmark [[UnsafeArrayDataBenchmark]] for UnsafeArrayData
+ * To run this:
+ *  build/sbt "sql/test-only *benchmark.UnsafeArrayDataBenchmark"
+ *
+ * Benchmarks in this file are skipped in normal builds.
+ */
+class UnsafeArrayDataBenchmark extends BenchmarkBase {
+
+  new SparkConf()
+.setMaster("local[1]")
+.setAppName("microbenchmark")
+.set("spark.driver.memory", "3g")
--- End diff --

does this really work? Create a spark conf and leave it there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14025
  
**[Test build #61648 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61648/consoleFull)**
 for PR 14025 at commit 
[`72aa185`](https://github.com/apache/spark/commit/72aa185246ee42a3a597de258734cbf6312d1979).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372615
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -192,26 +192,30 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 val fixedElementSize = et match {
   case t: DecimalType if t.precision <= Decimal.MAX_LONG_DIGITS => 8
   case _ if ctx.isPrimitiveType(jt) => et.defaultSize
-  case _ => 0
+  case _ => 8
--- End diff --

please comment about why it's 8, e.g. we store offset and length for the 
element blabla


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372604
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -192,26 +192,30 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 val fixedElementSize = et match {
--- End diff --

We should update the name as its meaning has changed. How about 
`elementOrOffsetSize`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372533
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -126,11 +187,11 @@ public void write(int ordinal, Decimal input, int 
precision, int scale) {
 // Write the bytes to the variable length portion.
 Platform.copyMemory(
   bytes, Platform.BYTE_ARRAY_OFFSET, holder.buffer, holder.cursor, 
bytes.length);
-setOffset(ordinal);
+write(ordinal, ((long)(holder.cursor - startingOffset) << 32) | 
((long) bytes.length));
--- End diff --

We should abstract in into a method, like `UnsafeRowWriter.setOffsetAndSize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372522
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -33,91 +38,147 @@
   // The offset of the global buffer where we start to write this array.
   private int startingOffset;
 
+  // The number of elements in this array
+  private int numElements;
+
+  private int headerInBytes;
+
+  private void assertIndexIsValid(int index) {
+assert index >= 0 : "index (" + index + ") should >= 0";
+assert index < numElements : "index (" + index + ") should < " + 
numElements;
+  }
+
   public void initialize(BufferHolder holder, int numElements, int 
fixedElementSize) {
-// We need 4 bytes to store numElements and 4 bytes each element to 
store offset.
-final int fixedSize = 4 + 4 * numElements;
+this.numElements = numElements;
+this.headerInBytes = calculateHeaderPortionInBytes(numElements);
 
 this.holder = holder;
 this.startingOffset = holder.cursor;
 
-holder.grow(fixedSize);
-Platform.putInt(holder.buffer, holder.cursor, numElements);
-holder.cursor += fixedSize;
+// Grows the global buffer ahead for header and fixed size data.
+holder.grow(headerInBytes + fixedElementSize * numElements);
+
+// Initialize information in header
+Platform.putInt(holder.buffer, startingOffset, numElements);
+Arrays.fill(holder.buffer, startingOffset + 4 - 
Platform.BYTE_ARRAY_OFFSET,
+  startingOffset + headerInBytes - Platform.BYTE_ARRAY_OFFSET, 
(byte)0);
 
-// Grows the global buffer ahead for fixed size data.
-holder.grow(fixedElementSize * numElements);
+holder.cursor += (headerInBytes + fixedElementSize * numElements);
   }
 
-  private long getElementOffset(int ordinal) {
-return startingOffset + 4 + 4 * ordinal;
+  private long getElementOffset(int ordinal, int scale) {
+return startingOffset + headerInBytes + ordinal * scale;
+  }
+
+  public void setOffsetAndSize(int ordinal, long currentCursor, long size) 
{
+final long relativeOffset = currentCursor - startingOffset;
+final long offsetAndSize = (relativeOffset << 32) | size;
+
+write(ordinal, offsetAndSize);
   }
 
   public void setNullAt(int ordinal) {
-final int relativeOffset = holder.cursor - startingOffset;
-// Writes negative offset value to represent null element.
-Platform.putInt(holder.buffer, getElementOffset(ordinal), 
-relativeOffset);
+throw new UnsupportedOperationException("setNullAt() is not 
supported");
+  }
+
+  private void setNullBit(int ordinal) {
+assertIndexIsValid(ordinal);
+BitSetMethods.set(holder.buffer, startingOffset + 4, ordinal);
+  }
+
+  public void setNullBoolean(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putBoolean(holder.buffer, getElementOffset(ordinal, 1), 
false);
+  }
+
+  public void setNullByte(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putByte(holder.buffer, getElementOffset(ordinal, 1), (byte)0);
+  }
+
+  public void setNullShort(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putShort(holder.buffer, getElementOffset(ordinal, 2), 
(short)0);
+  }
+
+  public void setNullInt(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putInt(holder.buffer, getElementOffset(ordinal, 4), (int)0);
+  }
+
+  public void setNullLong(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putLong(holder.buffer, getElementOffset(ordinal, 8), (long)0);
+  }
+
+  public void setNullFloat(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putFloat(holder.buffer, getElementOffset(ordinal, 4), 
(float)0);
+  }
+
+  public void setNullDouble(int ordinal) {
+setNullBit(ordinal);
+// put zero into the corresponding field when set null
+Platform.putDouble(holder.buffer, getElementOffset(ordinal, 8), 
(double)0);
   }
 
-  public void setOffset(int ordinal) {
-final int relativeOffset = holder.cursor - startingOffset;
-Platform.putInt(holder.buffer, getElementOffset(ordinal), 
relativeOffset);
+  public void setNull(int ordinal) {
--- End diff --

why having this method?

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372511
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -33,91 +38,147 @@
   // The offset of the global buffer where we start to write this array.
   private int startingOffset;
 
+  // The number of elements in this array
+  private int numElements;
+
+  private int headerInBytes;
+
+  private void assertIndexIsValid(int index) {
+assert index >= 0 : "index (" + index + ") should >= 0";
+assert index < numElements : "index (" + index + ") should < " + 
numElements;
+  }
+
   public void initialize(BufferHolder holder, int numElements, int 
fixedElementSize) {
-// We need 4 bytes to store numElements and 4 bytes each element to 
store offset.
--- End diff --

We still need to comment about the `numElements` stuff, or it will confuse 
other reader when they see `startingOffset + 4` later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14008: [SPARK-16281][SQL] Implement parse_url SQL function

2016-07-01 Thread janplus

Github user janplus commented on the issue:

https://github.com/apache/spark/pull/14008
  
@dongjoon-hyun and @cloud-fan Thank you for review.
I have add a new commit which does following things:

1. Cache the url and the key pattern use the similar approach of 
`XPathBoolean`
2. Fix code style problems
3. Put the constants into an object
4. Add exceptional cases for invalid key


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372493
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -33,91 +38,147 @@
   // The offset of the global buffer where we start to write this array.
   private int startingOffset;
 
+  // The number of elements in this array
+  private int numElements;
+
+  private int headerInBytes;
+
+  private void assertIndexIsValid(int index) {
+assert index >= 0 : "index (" + index + ") should >= 0";
+assert index < numElements : "index (" + index + ") should < " + 
numElements;
+  }
+
   public void initialize(BufferHolder holder, int numElements, int 
fixedElementSize) {
-// We need 4 bytes to store numElements and 4 bytes each element to 
store offset.
-final int fixedSize = 4 + 4 * numElements;
+this.numElements = numElements;
+this.headerInBytes = calculateHeaderPortionInBytes(numElements);
 
 this.holder = holder;
 this.startingOffset = holder.cursor;
 
-holder.grow(fixedSize);
-Platform.putInt(holder.buffer, holder.cursor, numElements);
-holder.cursor += fixedSize;
+// Grows the global buffer ahead for header and fixed size data.
+holder.grow(headerInBytes + fixedElementSize * numElements);
+
+// Initialize information in header
+Platform.putInt(holder.buffer, startingOffset, numElements);
+Arrays.fill(holder.buffer, startingOffset + 4 - 
Platform.BYTE_ARRAY_OFFSET,
+  startingOffset + headerInBytes - Platform.BYTE_ARRAY_OFFSET, 
(byte)0);
 
-// Grows the global buffer ahead for fixed size data.
-holder.grow(fixedElementSize * numElements);
+holder.cursor += (headerInBytes + fixedElementSize * numElements);
   }
 
-  private long getElementOffset(int ordinal) {
-return startingOffset + 4 + 4 * ordinal;
+  private long getElementOffset(int ordinal, int scale) {
+return startingOffset + headerInBytes + ordinal * scale;
+  }
+
+  public void setOffsetAndSize(int ordinal, long currentCursor, long size) 
{
+final long relativeOffset = currentCursor - startingOffset;
+final long offsetAndSize = (relativeOffset << 32) | size;
+
+write(ordinal, offsetAndSize);
   }
 
   public void setNullAt(int ordinal) {
-final int relativeOffset = holder.cursor - startingOffset;
-// Writes negative offset value to represent null element.
-Platform.putInt(holder.buffer, getElementOffset(ordinal), 
-relativeOffset);
+throw new UnsupportedOperationException("setNullAt() is not 
supported");
--- End diff --

why not just remove it? We are not implementation an interface here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372472
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -33,91 +38,147 @@
   // The offset of the global buffer where we start to write this array.
   private int startingOffset;
 
+  // The number of elements in this array
+  private int numElements;
+
+  private int headerInBytes;
+
+  private void assertIndexIsValid(int index) {
+assert index >= 0 : "index (" + index + ") should >= 0";
+assert index < numElements : "index (" + index + ") should < " + 
numElements;
+  }
+
   public void initialize(BufferHolder holder, int numElements, int 
fixedElementSize) {
-// We need 4 bytes to store numElements and 4 bytes each element to 
store offset.
-final int fixedSize = 4 + 4 * numElements;
+this.numElements = numElements;
+this.headerInBytes = calculateHeaderPortionInBytes(numElements);
 
 this.holder = holder;
 this.startingOffset = holder.cursor;
 
-holder.grow(fixedSize);
-Platform.putInt(holder.buffer, holder.cursor, numElements);
-holder.cursor += fixedSize;
+// Grows the global buffer ahead for header and fixed size data.
+holder.grow(headerInBytes + fixedElementSize * numElements);
+
+// Initialize information in header
+Platform.putInt(holder.buffer, startingOffset, numElements);
+Arrays.fill(holder.buffer, startingOffset + 4 - 
Platform.BYTE_ARRAY_OFFSET,
+  startingOffset + headerInBytes - Platform.BYTE_ARRAY_OFFSET, 
(byte)0);
 
-// Grows the global buffer ahead for fixed size data.
-holder.grow(fixedElementSize * numElements);
+holder.cursor += (headerInBytes + fixedElementSize * numElements);
   }
 
-  private long getElementOffset(int ordinal) {
-return startingOffset + 4 + 4 * ordinal;
+  private long getElementOffset(int ordinal, int scale) {
--- End diff --

`scale` -> `elementSize`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69372467
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -33,91 +38,147 @@
   // The offset of the global buffer where we start to write this array.
   private int startingOffset;
 
+  // The number of elements in this array
+  private int numElements;
+
+  private int headerInBytes;
+
+  private void assertIndexIsValid(int index) {
+assert index >= 0 : "index (" + index + ") should >= 0";
+assert index < numElements : "index (" + index + ") should < " + 
numElements;
+  }
+
   public void initialize(BufferHolder holder, int numElements, int 
fixedElementSize) {
-// We need 4 bytes to store numElements and 4 bytes each element to 
store offset.
-final int fixedSize = 4 + 4 * numElements;
+this.numElements = numElements;
+this.headerInBytes = calculateHeaderPortionInBytes(numElements);
 
 this.holder = holder;
 this.startingOffset = holder.cursor;
 
-holder.grow(fixedSize);
-Platform.putInt(holder.buffer, holder.cursor, numElements);
-holder.cursor += fixedSize;
+// Grows the global buffer ahead for header and fixed size data.
+holder.grow(headerInBytes + fixedElementSize * numElements);
+
+// Initialize information in header
+Platform.putInt(holder.buffer, startingOffset, numElements);
+Arrays.fill(holder.buffer, startingOffset + 4 - 
Platform.BYTE_ARRAY_OFFSET,
--- End diff --

is it faster than `UnsafeRowWriter.zeroOutNullBytes`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13976: [SPARK-16288][SQL] Implement inline table generating fun...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/13976
  
Thank you, @cloud-fan .
`elementSchema` is simplified and the testcase name is updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13976: [SPARK-16288][SQL] Implement inline table generating fun...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13976
  
**[Test build #61647 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61647/consoleFull)**
 for PR 13976 at commit 
[`ebd9cfa`](https://github.com/apache/spark/commit/ebd9cfa1dc4ed2c9ed28653ddb2c1a214ca19add).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13976#discussion_r69372303
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala ---
@@ -89,4 +89,32 @@ class GeneratorFunctionSuite extends QueryTest with 
SharedSQLContext {
   exploded.join(exploded, exploded("i") === 
exploded("i")).agg(count("*")),
   Row(3) :: Nil)
   }
+
+  test("inline raises exception on empty array") {
--- End diff --

Yep. That's more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13976#discussion_r69372281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala
 ---
@@ -195,3 +195,42 @@ case class Explode(child: Expression) extends 
ExplodeBase(child, position = fals
   extended = "> SELECT _FUNC_(array(10,20));\n  0\t10\n  1\t20")
 // scalastyle:on line.size.limit
 case class PosExplode(child: Expression) extends ExplodeBase(child, 
position = true)
+
+/**
+ * Explodes an array of structs into a table.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a) - Explodes an array of structs into a table.",
+  extended = "> SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b')));\n 
[1,a]\n[2,b]")
+case class Inline(child: Expression) extends UnaryExpression with 
Generator with CodegenFallback {
+
+  override def children: Seq[Expression] = child :: Nil
+
+  override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
+case ArrayType(et, _) if et.isInstanceOf[StructType] =>
+  TypeCheckResult.TypeCheckSuccess
+case _ =>
+  TypeCheckResult.TypeCheckFailure(
+s"input to function inline should be array of struct type, not 
${child.dataType}")
+  }
+
+  override def elementSchema: StructType = child.dataType match {
+case ArrayType(et : StructType, _) =>
+  StructType(et.fields.zipWithIndex.map {
--- End diff --

Oh, my god. I was too naive, here.
Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14025
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61646/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14025
  
**[Test build #61646 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61646/consoleFull)**
 for PR 14025 at commit 
[`02c4807`](https://github.com/apache/spark/commit/02c4807176b768be56774539faaadcab3944c8f7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14025
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14025
  
**[Test build #61646 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61646/consoleFull)**
 for PR 14025 at commit 
[`02c4807`](https://github.com/apache/spark/commit/02c4807176b768be56774539faaadcab3944c8f7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14025: [WIP][DOC] update out-of-date code snippets using...

2016-07-01 Thread WeichenXu123

GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14025

[WIP][DOC] update out-of-date code snippets using SQLContext in all 
documents.

## What changes were proposed in this pull request?

I search the whole documents directory using SQLContext, and update the 
following places:

- docs/configuration.md, sparkR code snippets.
- docs/streaming-programming-guide.md, several example code.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark WIP_SQLContext_update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14025


commit 02c4807176b768be56774539faaadcab3944c8f7
Author: WeichenXu 
Date:   2016-07-02T10:54:33Z

WIP_SQLContext_update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus

Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371935
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

Yes, it only throws `MalformedURLException`. But Dongjoon Hyun suggested me 
to use `NonFatal(_)` here, as quoted

> We can use the following like Hive uses Exception. SecurityException 
occurs possibly as a RuntimeException.
> `case NonFatal(_) =>`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371928
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -19,9 +19,14 @@
 
 import org.apache.spark.sql.types.Decimal;
 import org.apache.spark.unsafe.Platform;
+import org.apache.spark.unsafe.bitset.BitSetMethods;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
+import java.util.Arrays;
--- End diff --

import order is wrong here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371924
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -341,63 +324,113 @@ public UnsafeArrayData copy() {
 return arrayCopy;
   }
 
-  public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
-if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
-  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
-"it's too big.");
-}
+  @Override
+  public boolean[] toBooleanArray() {
+int size = numElements();
+boolean[] values = new boolean[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public byte[] toByteArray() {
+int size = numElements();
+byte[] values = new byte[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public short[] toShortArray() {
+int size = numElements();
+short[] values = new short[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.SHORT_ARRAY_OFFSET, size * 2);
+return values;
+  }
 
-final int offsetRegionSize = 4 * arr.length;
-final int valueRegionSize = 4 * arr.length;
-final int totalSize = 4 + offsetRegionSize + valueRegionSize;
-final byte[] data = new byte[totalSize];
+  @Override
+  public int[] toIntArray() {
+int size = numElements();
+int[] values = new int[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.INT_ARRAY_OFFSET, size * 4);
+return values;
+  }
 
-Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length);
+  @Override
+  public long[] toLongArray() {
+int size = numElements();
+long[] values = new long[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.LONG_ARRAY_OFFSET, size * 8);
+return values;
+  }
 
-int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4;
-int valueOffset = 4 + offsetRegionSize;
-for (int i = 0; i < arr.length; i++) {
-  Platform.putInt(data, offsetPosition, valueOffset);
-  offsetPosition += 4;
-  valueOffset += 4;
+  @Override
+  public float[] toFloatArray() {
+int size = numElements();
+float[] values = new float[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.FLOAT_ARRAY_OFFSET, size * 4);
+return values;
+  }
+
+  @Override
+  public double[] toDoubleArray() {
+int size = numElements();
+double[] values = new double[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.DOUBLE_ARRAY_OFFSET, size * 8);
+return values;
+  }
+
+  private static UnsafeArrayData fromPrimitiveArray(Object arr, int 
length, final int elementSize) {
+final int headerSize = calculateHeaderPortionInBytes(length);
+if (length > (Integer.MAX_VALUE - headerSize) / elementSize) {
+  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
+"it's too big.");
 }
 
+final int valueRegionSize = elementSize * length;
+final byte[] data = new byte[valueRegionSize + headerSize];
--- End diff --

is it better to use `long[]` here? Then we can store more elements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371914
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -341,63 +324,113 @@ public UnsafeArrayData copy() {
 return arrayCopy;
   }
 
-  public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
-if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
-  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
-"it's too big.");
-}
+  @Override
+  public boolean[] toBooleanArray() {
+int size = numElements();
+boolean[] values = new boolean[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public byte[] toByteArray() {
+int size = numElements();
+byte[] values = new byte[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public short[] toShortArray() {
+int size = numElements();
+short[] values = new short[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.SHORT_ARRAY_OFFSET, size * 2);
+return values;
+  }
 
-final int offsetRegionSize = 4 * arr.length;
-final int valueRegionSize = 4 * arr.length;
-final int totalSize = 4 + offsetRegionSize + valueRegionSize;
-final byte[] data = new byte[totalSize];
+  @Override
+  public int[] toIntArray() {
+int size = numElements();
+int[] values = new int[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.INT_ARRAY_OFFSET, size * 4);
+return values;
+  }
 
-Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length);
+  @Override
+  public long[] toLongArray() {
+int size = numElements();
+long[] values = new long[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.LONG_ARRAY_OFFSET, size * 8);
+return values;
+  }
 
-int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4;
-int valueOffset = 4 + offsetRegionSize;
-for (int i = 0; i < arr.length; i++) {
-  Platform.putInt(data, offsetPosition, valueOffset);
-  offsetPosition += 4;
-  valueOffset += 4;
+  @Override
+  public float[] toFloatArray() {
+int size = numElements();
+float[] values = new float[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.FLOAT_ARRAY_OFFSET, size * 4);
+return values;
+  }
+
+  @Override
+  public double[] toDoubleArray() {
+int size = numElements();
+double[] values = new double[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.DOUBLE_ARRAY_OFFSET, size * 8);
+return values;
+  }
+
+  private static UnsafeArrayData fromPrimitiveArray(Object arr, int 
length, final int elementSize) {
+final int headerSize = calculateHeaderPortionInBytes(length);
+if (length > (Integer.MAX_VALUE - headerSize) / elementSize) {
+  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
+"it's too big.");
 }
 
+final int valueRegionSize = elementSize * length;
+final byte[] data = new byte[valueRegionSize + headerSize];
+
+Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, length);
 Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data,
--- End diff --

I think we also need to pass in the array offset.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371901
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -341,63 +324,113 @@ public UnsafeArrayData copy() {
 return arrayCopy;
   }
 
-  public static UnsafeArrayData fromPrimitiveArray(int[] arr) {
-if (arr.length > (Integer.MAX_VALUE - 4) / 8) {
-  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
-"it's too big.");
-}
+  @Override
+  public boolean[] toBooleanArray() {
+int size = numElements();
+boolean[] values = new boolean[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public byte[] toByteArray() {
+int size = numElements();
+byte[] values = new byte[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.BYTE_ARRAY_OFFSET, size);
+return values;
+  }
+
+  @Override
+  public short[] toShortArray() {
+int size = numElements();
+short[] values = new short[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.SHORT_ARRAY_OFFSET, size * 2);
+return values;
+  }
 
-final int offsetRegionSize = 4 * arr.length;
-final int valueRegionSize = 4 * arr.length;
-final int totalSize = 4 + offsetRegionSize + valueRegionSize;
-final byte[] data = new byte[totalSize];
+  @Override
+  public int[] toIntArray() {
+int size = numElements();
+int[] values = new int[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.INT_ARRAY_OFFSET, size * 4);
+return values;
+  }
 
-Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, arr.length);
+  @Override
+  public long[] toLongArray() {
+int size = numElements();
+long[] values = new long[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.LONG_ARRAY_OFFSET, size * 8);
+return values;
+  }
 
-int offsetPosition = Platform.BYTE_ARRAY_OFFSET + 4;
-int valueOffset = 4 + offsetRegionSize;
-for (int i = 0; i < arr.length; i++) {
-  Platform.putInt(data, offsetPosition, valueOffset);
-  offsetPosition += 4;
-  valueOffset += 4;
+  @Override
+  public float[] toFloatArray() {
+int size = numElements();
+float[] values = new float[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.FLOAT_ARRAY_OFFSET, size * 4);
+return values;
+  }
+
+  @Override
+  public double[] toDoubleArray() {
+int size = numElements();
+double[] values = new double[size];
+Platform.copyMemory(
+  baseObject, baseOffset + headerInBytes, values, 
Platform.DOUBLE_ARRAY_OFFSET, size * 8);
+return values;
+  }
+
+  private static UnsafeArrayData fromPrimitiveArray(Object arr, int 
length, final int elementSize) {
+final int headerSize = calculateHeaderPortionInBytes(length);
+if (length > (Integer.MAX_VALUE - headerSize) / elementSize) {
+  throw new UnsupportedOperationException("Cannot convert this array 
to unsafe format as " +
+"it's too big.");
 }
 
+final int valueRegionSize = elementSize * length;
+final byte[] data = new byte[valueRegionSize + headerSize];
+
+Platform.putInt(data, Platform.BYTE_ARRAY_OFFSET, length);
 Platform.copyMemory(arr, Platform.INT_ARRAY_OFFSET, data,
--- End diff --

why `INT_ARRAY_OFFSET` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371876
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -317,7 +301,6 @@ public boolean equals(Object other) {
 }
 return false;
   }
-
--- End diff --

keep this blank line please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13680: [SPARK-15962][SQL] Introduce implementation with ...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13680#discussion_r69371753
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -25,30 +25,36 @@
 import org.apache.spark.sql.types.*;
 import org.apache.spark.unsafe.Platform;
 import org.apache.spark.unsafe.array.ByteArrayMethods;
+import org.apache.spark.unsafe.bitset.BitSetMethods;
 import org.apache.spark.unsafe.hash.Murmur3_x86_32;
 import org.apache.spark.unsafe.types.CalendarInterval;
 import org.apache.spark.unsafe.types.UTF8String;
 
 /**
  * An Unsafe implementation of Array which is backed by raw memory instead 
of Java objects.
  *
- * Each tuple has three parts: [numElements] [offsets] [values]
+ * Each tuple has four parts: [numElements][null bits][values or 
offset][variable length portion]
  *
  * The `numElements` is 4 bytes storing the number of elements of this 
array.
  *
- * In the `offsets` region, we store 4 bytes per element, represents the 
relative offset (w.r.t. the
- * base address of the array) of this element in `values` region. We can 
get the length of this
- * element by subtracting next offset.
- * Note that offset can by negative which means this element is null.
+ * In the `null bits` region, we store 1 bit per element, represents 
whether a element has null
+ * Its total size is ceil(numElements / 8) bytes, and  it is aligned to 
8-byte word boundaries.
  *
- * In the `values` region, we store the content of elements. As we can get 
length info, so elements
- * can be variable-length.
+ * In the `values or offset` region, we store the content of elements. For 
fields that hold
+ * fixed-length primitive types, such as long, double, or int, we store 
the value directly
+ * in the field. For fields with non-primitive or variable-length values, 
we store a relative
+ * offset (w.r.t. the base address of the row) that points to the 
beginning of the variable-length
--- End diff --

`the base address of the array data`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371739
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

oh this is just code style suggestion: only catch the exception that may be 
thrown in the `try` block, `new URL` only throws some specific kind of 
exceptions right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus

Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371498
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
--- End diff --

Em, good point. I'll try to find a better way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus

Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371479
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

As hive returns null for invalid urls, it seems reasonable that spark does 
the same thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13976#discussion_r69371427
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala ---
@@ -89,4 +89,32 @@ class GeneratorFunctionSuite extends QueryTest with 
SharedSQLContext {
   exploded.join(exploded, exploded("i") === 
exploded("i")).agg(count("*")),
   Row(3) :: Nil)
   }
+
+  test("inline raises exception on empty array") {
--- End diff --

`on array of null type`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13976: [SPARK-16288][SQL] Implement inline table generat...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13976#discussion_r69371397
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala
 ---
@@ -195,3 +195,42 @@ case class Explode(child: Expression) extends 
ExplodeBase(child, position = fals
   extended = "> SELECT _FUNC_(array(10,20));\n  0\t10\n  1\t20")
 // scalastyle:on line.size.limit
 case class PosExplode(child: Expression) extends ExplodeBase(child, 
position = true)
+
+/**
+ * Explodes an array of structs into a table.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a) - Explodes an array of structs into a table.",
+  extended = "> SELECT _FUNC_(array(struct(1, 'a'), struct(2, 'b')));\n 
[1,a]\n[2,b]")
+case class Inline(child: Expression) extends UnaryExpression with 
Generator with CodegenFallback {
+
+  override def children: Seq[Expression] = child :: Nil
+
+  override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
+case ArrayType(et, _) if et.isInstanceOf[StructType] =>
+  TypeCheckResult.TypeCheckSuccess
+case _ =>
+  TypeCheckResult.TypeCheckFailure(
+s"input to function inline should be array of struct type, not 
${child.dataType}")
+  }
+
+  override def elementSchema: StructType = child.dataType match {
+case ArrayType(et : StructType, _) =>
+  StructType(et.fields.zipWithIndex.map {
--- End diff --

why not return `et` directly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371334
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

btw, does hive return null for invalid url?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371328
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
+} catch {
+  case NonFatal(_) => return null
--- End diff --

We should check the possible exceptions for `new URL` instead of catching 
NonFatal blindly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69371274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
 ---
@@ -652,6 +656,129 @@ case class StringRPad(str: Expression, len: 
Expression, pad: Expression)
   override def prettyName: String = "rpad"
 }
 
+object ParseUrl {
+  private val HOST = UTF8String.fromString("HOST")
+  private val PATH = UTF8String.fromString("PATH")
+  private val QUERY = UTF8String.fromString("QUERY")
+  private val REF = UTF8String.fromString("REF")
+  private val PROTOCOL = UTF8String.fromString("PROTOCOL")
+  private val FILE = UTF8String.fromString("FILE")
+  private val AUTHORITY = UTF8String.fromString("AUTHORITY")
+  private val USERINFO = UTF8String.fromString("USERINFO")
+  private val REGEXPREFIX = "(&|^)"
+  private val REGEXSUBFIX = "=([^&]*)"
+}
+
+/**
+ * Extracts a part from a URL
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(url, partToExtract[, key]) - extracts a part from a URL",
+  extended = """Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, 
USERINFO.
+Key specifies which query to extract.
+Examples:
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'HOST')\n 
'spark.apache.org'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY')\n 
'query=1'
+  > SELECT _FUNC_('http://spark.apache.org/path?query=1', 'QUERY', 
'query')\n '1'""")
+case class ParseUrl(children: Seq[Expression])
+  extends Expression with ImplicitCastInputTypes with CodegenFallback {
+
+  override def nullable = true
+  override def inputTypes: Seq[DataType] = 
Seq.fill(children.size)(StringType)
+  override def dataType: DataType = StringType
+
+  @transient private var lastUrlStr: UTF8String = _
+  @transient private var lastUrl: URL = _
+  // last key in string, we will update the pattern if key value changed.
+  @transient private var lastKey: UTF8String = _
+  // last regex pattern, we cache it for performance concern
+  @transient private var pattern: Pattern = _
+
+  private lazy val stringExprs = children.toArray
+  import ParseUrl._
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.size > 3 || children.size < 2) {
+  TypeCheckResult.TypeCheckFailure("parse_url function requires two or 
three arguments")
+} else {
+  super[ImplicitCastInputTypes].checkInputDataTypes()
+}
+  }
+
+  def parseUrlWithoutKey(url: Any, partToExtract: Any): Any = {
+if (url == null || partToExtract == null) {
+  null
+} else {
+  if (lastUrlStr == null || !url.equals(lastUrlStr)) {
+try {
+  lastUrl = new URL(url.toString)
+  lastUrlStr = url.asInstanceOf[UTF8String].clone()
--- End diff --

so if all urls are different, we need to clone them everytime?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14020
  
Might be a legitimate test failure, will look later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14020
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14020
  
**[Test build #61645 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61645/consoleFull)**
 for PR 14020 at commit 
[`693939c`](https://github.com/apache/spark/commit/693939cda31f67d4bab2351901d0dcebb40eb405).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14020
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61645/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14024: [SPARK-15923][YARN] Spark Application rest api returns '...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14024
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14024: [SPARK-15923][YARN] Spark Application rest api re...

2016-07-01 Thread Sherry302

GitHub user Sherry302 opened a pull request:

https://github.com/apache/spark/pull/14024

[SPARK-15923][YARN] Spark Application rest api returns 'no such app: â¦

## What changes were proposed in this pull request?
1. Updated the monitoring.md doc.
2. In YarnSchedulerBackend.scala: make applications run in Yarn cluster 
mode have attemptID "1" by default.

## How was this patch tested?
Manual tests passed.

â¦'

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Sherry302/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14024.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14024


commit a15dee1aee3afa53a455c4b0aba5e3388a0129d3
Author: Weiqing Yang 
Date:   2016-07-02T01:45:38Z

[SPARK-15923][YARN] Spark Application rest api returns 'no such app: 
'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14023: [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14023
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61644/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14023: [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14023
  
**[Test build #61644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61644/consoleFull)**
 for PR 14023 at commit 
[`c631603`](https://github.com/apache/spark/commit/c631603e0cb2293b5b8070b2e6f36a353671effe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14023: [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14023
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14022
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61643/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14022
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14022
  
**[Test build #61643 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61643/consoleFull)**
 for PR 14022 at commit 
[`f4e772a`](https://github.com/apache/spark/commit/f4e772ad2b5c8231c48b2cd1f8f75c81e754c2b4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14008: [SPARK-16281][SQL] Implement parse_url SQL functi...

2016-07-01 Thread janplus

Github user janplus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14008#discussion_r69370542
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -725,4 +725,43 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 checkEvaluation(FindInSet(Literal("abf"), Literal("abc,b,ab,c,def")), 
0)
 checkEvaluation(FindInSet(Literal("ab,"), Literal("abc,b,ab,c,def")), 
0)
   }
+
+  test("ParseUrl") {
+def checkParseUrl(expected: String, urlStr: String, partToExtract: 
String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType)),
+expected)
+}
+def checkParseUrlWithKey(expected: String, urlStr: String,
+  partToExtract: String, key: String): Unit = {
+  checkEvaluation(
+ParseUrl(Literal.create(urlStr, StringType), 
Literal.create(partToExtract, StringType),
+ Literal.create(key, StringType)), expected)
+}
+
+checkParseUrl("spark.apache.org", 
"http://spark.apache.org/path?query=1;, "HOST")
+checkParseUrl("/path", "http://spark.apache.org/path?query=1;, "PATH")
+checkParseUrl("query=1", "http://spark.apache.org/path?query=1;, 
"QUERY")
+checkParseUrl("Ref", "http://spark.apache.org/path?query=1#Ref;, "REF")
+checkParseUrl("http", "http://spark.apache.org/path?query=1;, 
"PROTOCOL")
+checkParseUrl("/path?query=1", "http://spark.apache.org/path?query=1;, 
"FILE")
+checkParseUrl("spark.apache.org:8080", 
"http://spark.apache.org:8080/path?query=1;, "AUTHORITY")
+checkParseUrl("jian", "http://j...@spark.apache.org/path?query=1;, 
"USERINFO")
+checkParseUrlWithKey("1", "http://spark.apache.org/path?query=1;, 
"QUERY", "query")
+
+// Null checking
+checkParseUrl(null, null, "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1;, null)
+checkParseUrl(null, null, null)
+checkParseUrl(null, "test", "HOST")
+checkParseUrl(null, "http://spark.apache.org/path?query=1;, "NO")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1;, 
"HOST", "query")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1;, 
"QUERY", "quer")
+checkParseUrlWithKey(null, "http://spark.apache.org/path?query=1;, 
"QUERY", null)
--- End diff --

Yes, you are right. Invalid `key` does this job. I'll fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14020
  
**[Test build #61645 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61645/consoleFull)**
 for PR 14020 at commit 
[`693939c`](https://github.com/apache/spark/commit/693939cda31f67d4bab2351901d0dcebb40eb405).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13993: [SPARK-16144][SPARKR] update R API doc for mllib

2016-07-01 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/13993
  
CC @junyangq too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14022
  
**[Test build #61643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61643/consoleFull)**
 for PR 14022 at commit 
[`f4e772a`](https://github.com/apache/spark/commit/f4e772ad2b5c8231c48b2cd1f8f75c81e754c2b4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13981
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13981
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61641/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13981
  
**[Test build #61641 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61641/consoleFull)**
 for PR 13981 at commit 
[`9eed400`](https://github.com/apache/spark/commit/9eed4003dd25c3915b73384259c63678910508ac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-01 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14022
  
Note this contains the code for #14020 also since I needed that to test the 
changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14022: [SPARK-16272][core] Allow config values to refere...

2016-07-01 Thread vanzin

GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/14022

[SPARK-16272][core] Allow config values to reference conf, env, system 
props.

This allows configuration to be more flexible when the cluster does not
have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

The feature is hooked up to spark.sql.hive.metastore.jars to show how to use
it, and because it's an example of said scenario that I ran into. It uses a
new "pathConf" config type that is a shorthand for enabling variable 
expansion
on string configs.

As part of the implementation, ConfigEntry now keeps track of all "known" 
configs
(i.e. those created through the use of ConfigBuilder), since that list is 
used
by the resolution code. This duplicates some code in SQLConf, which could 
potentially
be merged with this now. It will also make it simpler to implement some 
missing
features such as filtering which configs show up in the UI or in event logs 
- which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a 
string
map and a function that reads env variables, so that it can be called both 
from
SparkConf and SQLConf. This makes it so both places follow the same read 
path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-16272

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14022


commit f4e772ad2b5c8231c48b2cd1f8f75c81e754c2b4
Author: Marcelo Vanzin 
Date:   2016-06-28T22:39:49Z

[SPARK-16272][core] Allow config values to reference conf, env, system 
props.

This allows configuration to be more flexible when the cluster does not
have a homogeneous configuration (e.g. packages are installed on different
paths in different nodes). By allowing one to reference the environment from
the conf, it becomes possible to work around those in certain cases.

The feature is hooked up to spark.sql.hive.metastore.jars to show how to use
it, and because it's an example of said scenario that I ran into. It uses a
new "pathConf" config type that is a shorthand for enabling variable 
expansion
on string configs.

As part of the implementation, ConfigEntry now keeps track of all "known" 
configs
(i.e. those created through the use of ConfigBuilder), since that list is 
used
by the resolution code. This duplicates some code in SQLConf, which could 
potentially
be merged with this now. It will also make it simpler to implement some 
missing
features such as filtering which configs show up in the UI or in event logs 
- which
are not part of this change.

Another change is in the way ConfigEntry reads config data; it now takes a 
string
map and a function that reads env variables, so that it can be called both 
from
SparkConf and SQLConf. This makes it so both places follow the same read 
path,
instead of having to replicate certain logic in SQLConf. There are still a
couple of methods in SQLConf that peek into fields of ConfigEntry directly,
though.

Tested via unit tests, and by using the new variable expansion functionality
in a shell session with a custom spark.sql.hive.metastore.jars value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14021: [SPARK-16260][ML][Example]:PySpark ML Example Improvemen...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14021
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61642/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14021: [SPARK-16260][ML][Example]:PySpark ML Example Improvemen...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14021
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14021: [SPARK-16260][ML][Example]:PySpark ML Example Improvemen...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14021
  
**[Test build #61642 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61642/consoleFull)**
 for PR 14021 at commit 
[`c95d4e4`](https://github.com/apache/spark/commit/c95d4e408b4f4ffa5f1d6631a286637d39f3d183).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14020
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61640/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14020
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14020
  
**[Test build #61640 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61640/consoleFull)**
 for PR 14020 at commit 
[`693939c`](https://github.com/apache/spark/commit/693939cda31f67d4bab2351901d0dcebb40eb405).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14021: [SPARK-16260][ML][Example]:PySpark ML Example Improvemen...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14021
  
**[Test build #61642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61642/consoleFull)**
 for PR 14021 at commit 
[`c95d4e4`](https://github.com/apache/spark/commit/c95d4e408b4f4ffa5f1d6631a286637d39f3d183).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14021: [SPARK-16260][ML][Example]:PySpark ML Example Imp...

2016-07-01 Thread wangmiao1981

GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/14021

[SPARK-16260][ML][Example]:PySpark ML Example Improvements and Cleanup

## What changes were proposed in this pull request?
1). Remove unused import in Scala example;

2). Move spark session import outside example off;

3). Change parameter setting the same as Scala;

4). Change comment to be consistent;

5). Make sure that Scala and python using the same data set;

I did one pass and fixed the above issues. There are missing examples in 
python, which might be added later.

TODO: For some examples, there are comments on how to run examples; But 
there are many missing. We can add them later.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

Manually test them


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark ann

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14021.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14021


commit c95d4e408b4f4ffa5f1d6631a286637d39f3d183
Author: wm...@hotmail.com 
Date:   2016-07-01T22:47:31Z

fix examples




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13962: [SPARK-16095][YARN] Yarn cluster mode should repo...

2016-07-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13962


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13962: [SPARK-16095][YARN] Yarn cluster mode should report corr...

2016-07-01 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/13962
  
LGTM. Merging to master / 2.0 (fix is simple enough and I'd rather have the 
whole 2.x line behave the same here).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread MechCoder

Github user MechCoder commented on the issue:

https://github.com/apache/spark/pull/13981
  
I'm slightly in favour of keeping the original test because the impurity is 
set to "variance" explicitly by the `setImpurity` method, so it's a safe 
assumption that the `calculate` method returns the variance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14020: [SPARK-16349][sql] Fall back to isolated class loader wh...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14020
  
**[Test build #61640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61640/consoleFull)**
 for PR 14020 at commit 
[`693939c`](https://github.com/apache/spark/commit/693939cda31f67d4bab2351901d0dcebb40eb405).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13981: [SPARK-16307] [ML] Add test to verify the predicted vari...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13981
  
**[Test build #61641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61641/consoleFull)**
 for PR 13981 at commit 
[`9eed400`](https://github.com/apache/spark/commit/9eed4003dd25c3915b73384259c63678910508ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69363852
  
--- Diff: 
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py 
---
@@ -0,0 +1,103 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network over a
+ sliding window of configurable duration. Each line from the network is 
tagged
+ with a timestamp that is used to determine the windows into which it 
falls.
+
+ Usage: structured_network_wordcount_windowed.py   
+   
+  and  describe the TCP server that Structured Streaming
+ would connect to receive data.
+  gives the size of window, specified as integer number 
of seconds
+  gives the amount of time successive windows are offset 
from one another,
+ given in the same units as above.  should be less than or 
equal to
+ . If the two are equal, successive windows have no 
overlap. If
+  is not provided, it defaults to .
+
+ To run this on your local machine, you need to first run a Netcat server
+`$ nc -lk `
+ and then run the example
+`$ bin/spark-submit
+
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py
+localhost   `
+
+ One recommended ,  pair is 60, 30
+"""
+from __future__ import print_function
+
+import sys
+
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode
+from pyspark.sql.functions import split
+from pyspark.sql.functions import window
+
+if __name__ == "__main__":
+if len(sys.argv) != 5 and len(sys.argv) != 4:
+msg = ("Usage: structured_network_wordcount_windowed.py  
 "
+   " ")
--- End diff --

nit: optional args are usually written in square brackets, eg `[]` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69363771
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/TextSocketStreamSuite.scala
 ---
@@ -85,6 +86,47 @@ class TextSocketStreamSuite extends StreamTest with 
SharedSQLContext with Before
 }
   }
 
+  test("timestamped usage") {
+serverThread = new ServerThread()
+serverThread.start()
+
+val provider = new TextSocketSourceProvider
+val parameters = Map("host" -> "localhost", "port" -> 
serverThread.port.toString,
+  "includeTimestamp" -> "true")
+val schema = provider.sourceSchema(sqlContext, None, "", parameters)._2
+assert(schema === StructType(StructField("value", StringType) ::
+  StructField("timestamp", TimestampType) :: Nil))
+
+source = provider.createSource(sqlContext, "", None, "", parameters)
+
+failAfter(streamingTimeout) {
+  serverThread.enqueue("hello")
+  while (source.getOffset.isEmpty) {
+Thread.sleep(10)
+  }
+  val offset1 = source.getOffset.get
+  val batch1 = source.getBatch(None, offset1)
+  val batch1Seq = batch1.as[(String, Timestamp)].collect().toSeq
+  assert(batch1Seq.map(_._1) === Seq("hello"))
+  val batch1Stamp = batch1Seq(0)._2
+
+  serverThread.enqueue("world")
+  while (source.getOffset.get === offset1) {
+Thread.sleep(10)
+  }
+  val offset2 = source.getOffset.get
+  val batch2 = source.getBatch(Some(offset1), offset2)
+  val batch2Seq = batch2.as[(String, Timestamp)].collect().toSeq
+  assert(batch2Seq.map(_._1) === Seq("world"))
+  val batch2Stamp = batch2Seq(0)._2
+  assert(!batch2Stamp.before(batch1Stamp))
+
+  // Try stopping the source to make sure this does not block forever.
+  source.stop()
+  source = null
+}
+  }
+
--- End diff --

add tests below to make sure includeTimestamp errors are checked. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14020: [SPARK-16349][sql] Fall back to isolated class lo...

2016-07-01 Thread vanzin

GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/14020

[SPARK-16349][sql] Fall back to isolated class loader when classes not 
found.

Some Hadoop classes needed by the Hive metastore client jars are not present
in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So
if the parent class loader fails to find a class, try to load it from the
isolated class loader, in case it's available there.

I also took the opportunity to remove the HIVE_EXECUTION_VERSION constant 
since
it's not used anywhere.

Tested by setting spark.sql.hive.metastore.jars to local paths with 
Hive/Hadoop
libraries and verifying that Spark can talk to the metastore.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-16349

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14020.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14020


commit 693939cda31f67d4bab2351901d0dcebb40eb405
Author: Marcelo Vanzin 
Date:   2016-07-01T22:31:43Z

[SPARK-16349][sql] Fall back to isolated class loader when classes not 
found.

Some Hadoop classes needed by the Hive metastore client jars are not present
in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So
if the parent class loader fails to find a class, try to load it from the
isolated class loader, in case it's available there.

I also took the opportunity to remove the HIVE_EXECUTION_VERSION constant 
since
it's not used anywhere.

Tested by setting spark.sql.hive.metastore.jars to local paths with 
Hive/Hadoop
libraries and verifying that Spark can talk to the metastore.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69363665
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/socket.scala 
---
@@ -136,7 +149,8 @@ class TextSocketSourceProvider extends 
StreamSourceProvider with DataSourceRegis
   parameters: Map[String, String]): Source = {
 val host = parameters("host")
 val port = parameters("port").toInt
-new TextSocketSource(host, port, sqlContext)
+new TextSocketSource(host, port,
+  parameters.getOrElse("includeTimestamp", "false").toBoolean, 
sqlContext)
--- End diff --

but its better to have single function that catches that error and prints a 
better error message like "value of option includeTimestamp cannot be parsed. 
allowed values are "true" or "false".
also it needs to be a IllegalArgumentException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13077
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13957: [SPARK-16114] [SQL] structured streaming event time wind...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13957
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13957: [SPARK-16114] [SQL] structured streaming event time wind...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13957
  
**[Test build #61639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61639/consoleFull)**
 for PR 13957 at commit 
[`e7a81e1`](https://github.com/apache/spark/commit/e7a81e1a7455047d8f53d183f8f320dd77871882).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13957: [SPARK-16114] [SQL] structured streaming event time wind...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13957
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61639/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14019
  
@dongjoon-hyun I merged this into branch-2.0, so it should be present in 
the next RC.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69363332
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/socket.scala 
---
@@ -92,7 +102,11 @@ class TextSocketSource(host: String, port: Int, 
sqlContext: SQLContext)
 val endIdx = end.asInstanceOf[LongOffset].offset.toInt + 1
 val data = synchronized { lines.slice(startIdx, endIdx) }
 import sqlContext.implicits._
-data.toDF("value")
+if (includeTimestamp) {
+  data.toDF("value", "timestamp")
+} else {
+  data.map(_._1).toDF("value")
--- End diff --

nit: `data.select("value")` does not work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...

2016-07-01 Thread MechCoder

Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/13981#discussion_r69363260
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala
 ---
@@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite
   assert(variance === expectedVariance,
 s"Expected variance $expectedVariance but got $variance.")
 }
+
+val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0)
+dt.setMaxDepth(1)
+  .setMaxBins(6)
--- End diff --

"Explicit is better than implicit" ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14019
  
Oh, thank you for merging. @shivaram .
This will be the my fastest merging experience! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13957: [SPARK-16114] [SQL] structured streaming event time wind...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13957
  
**[Test build #61639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61639/consoleFull)**
 for PR 13957 at commit 
[`e7a81e1`](https://github.com/apache/spark/commit/e7a81e1a7455047d8f53d183f8f320dd77871882).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14019: [SPARK-16233][R][TEST] ORC test should be enabled...

2016-07-01 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14019


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69363245
  
--- Diff: 
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py 
---
@@ -0,0 +1,103 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network over a
+ sliding window of configurable duration. Each line from the network is 
tagged
+ with a timestamp that is used to determine the windows into which it 
falls.
+
+ Usage: structured_network_wordcount_windowed.py   
+   
+  and  describe the TCP server that Structured Streaming
+ would connect to receive data.
+  gives the size of window, specified as integer number 
of seconds
+  gives the amount of time successive windows are offset 
from one another,
+ given in the same units as above.  should be less than or 
equal to
+ . If the two are equal, successive windows have no 
overlap. If
+  is not provided, it defaults to .
+
+ To run this on your local machine, you need to first run a Netcat server
+`$ nc -lk `
+ and then run the example
+`$ bin/spark-submit
+
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py
+localhost   `
+
+ One recommended ,  pair is 60, 30
+"""
+from __future__ import print_function
+
+import sys
+
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode
+from pyspark.sql.functions import split
+from pyspark.sql.functions import window
+
+if __name__ == "__main__":
+if len(sys.argv) != 5 and len(sys.argv) != 4:
+msg = ("Usage: structured_network_wordcount_windowed.py  
 "
+   " ")
+print(msg, file=sys.stderr)
+exit(-1)
+
+host = sys.argv[1]
+port = int(sys.argv[2])
+windowSize = int(sys.argv[3])
+slideSize = int(sys.argv[4]) if (len(sys.argv) == 5) else windowSize
+if slideSize > windowSize:
+print(" must be less than or equal to ", file=sys.stderr)
+windowArg = '{} seconds'.format(windowSize)
+slideArg = '{} seconds'.format(slideSize)
+
+
+spark = SparkSession\
+.builder\
+.appName("StructuredNetworkWordCountWindowed")\
+.getOrCreate()
+
+# Create DataFrame representing the stream of input lines from 
connection to host:port
+lines = spark\
+.readStream\
+.format('socket')\
+.option('host', host)\
+.option('port', port)\
+.option('includeTimestamp', 'true')\
+.load()
+
+# Split the lines into words, retaining timestamps
+words = lines.select(
+# explode turns each item in an array into a separate row
+explode(split(lines.value, ' ')).alias('word'),
+lines.timestamp
+)
+
+# Group the data by window and word and compute the count of each group
+windowedCounts = words.groupBy(
+window(words.timestamp, windowArg, slideArg),
--- End diff --

rename here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14019
  
Is this included in new 2.0 RC?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14019
  
Thank you, @shivaram .
It passed at Jenkins. As you see the log, Jenkins run the ORC tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14019
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14019
  
Thanks @dongjoon-hyun - I manually tested this as well and it works fine.

LGTM. Will merge after Jenkins passes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14019
  
**[Test build #61637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61637/consoleFull)**
 for PR 14019 at commit 
[`88141fe`](https://github.com/apache/spark/commit/88141fe4ea8a3a4386bc24815745026dfa0ec362).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14019: [SPARK-16233][R][TEST] ORC test should be enabled only w...

2016-07-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14019
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61637/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69362821
  
--- Diff: 
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py 
---
@@ -0,0 +1,103 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network over a
+ sliding window of configurable duration. Each line from the network is 
tagged
+ with a timestamp that is used to determine the windows into which it 
falls.
+
+ Usage: structured_network_wordcount_windowed.py   
+   
+  and  describe the TCP server that Structured Streaming
+ would connect to receive data.
+  gives the size of window, specified as integer number 
of seconds
+  gives the amount of time successive windows are offset 
from one another,
+ given in the same units as above.  should be less than or 
equal to
+ . If the two are equal, successive windows have no 
overlap. If
+  is not provided, it defaults to .
+
+ To run this on your local machine, you need to first run a Netcat server
+`$ nc -lk `
+ and then run the example
+`$ bin/spark-submit
+
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py
+localhost   `
+
+ One recommended ,  pair is 60, 30
+"""
+from __future__ import print_function
+
+import sys
+
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import explode
+from pyspark.sql.functions import split
+from pyspark.sql.functions import window
+
+if __name__ == "__main__":
+if len(sys.argv) != 5 and len(sys.argv) != 4:
+msg = ("Usage: structured_network_wordcount_windowed.py  
 "
+   " ")
+print(msg, file=sys.stderr)
+exit(-1)
+
+host = sys.argv[1]
+port = int(sys.argv[2])
+windowSize = int(sys.argv[3])
+slideSize = int(sys.argv[4]) if (len(sys.argv) == 5) else windowSize
+if slideSize > windowSize:
+print(" must be less than or equal to ", file=sys.stderr)
+windowArg = '{} seconds'.format(windowSize)
+slideArg = '{} seconds'.format(slideSize)
+
+
+spark = SparkSession\
+.builder\
+.appName("StructuredNetworkWordCountWindowed")\
+.getOrCreate()
+
+# Create DataFrame representing the stream of input lines from 
connection to host:port
+lines = spark\
+.readStream\
+.format('socket')\
+.option('host', host)\
+.option('port', port)\
+.option('includeTimestamp', 'true')\
+.load()
+
+# Split the lines into words, retaining timestamps
+words = lines.select(
+# explode turns each item in an array into a separate row
--- End diff --

move this above with the other comment. and bit more explanation. 
e.g.
"split() splits each line into an array, and explode() turns the array into 
multiple rows"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13957: [SPARK-16114] [SQL] structured streaming event ti...

2016-07-01 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/13957#discussion_r69362732
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java
 ---
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql.streaming;
+
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.sql.*;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.streaming.StreamingQuery;
+import scala.Tuple2;
+
+import java.sql.Timestamp;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+/**
+ * Counts words in UTF8 encoded, '\n' delimited text received from the 
network over a
+ * sliding window of configurable duration. Each line from the network is 
tagged
+ * with a timestamp that is used to determine the windows into which it 
falls.
+ *
+ * Usage: JavaStructuredNetworkWordCountWindowed   
+ *   
+ *  and  describe the TCP server that Structured Streaming
+ * would connect to receive data.
+ *  gives the size of window, specified as integer number 
of seconds
+ *  gives the amount of time successive windows are offset 
from one another,
+ * given in the same units as above.  should be less than 
or equal to
+ * . If the two are equal, successive windows have no 
overlap. If
+ *  is not provided, it defaults to .
+ *
+ * To run this on your local machine, you need to first run a Netcat server
+ *`$ nc -lk `
+ * and then run the example
+ *`$ bin/run-example 
sql.streaming.JavaStructuredNetworkWordCountWindowed
+ *localhost   `
+ *
+ * One recommended ,  pair is 60, 30
+ */
+public final class JavaStructuredNetworkWordCountWindowed {
+
+  public static void main(String[] args) throws Exception {
+if (args.length < 3) {
+  System.err.println("Usage: JavaStructuredNetworkWordCountWindowed 
 " +
+"  ");
+  System.exit(1);
+}
+
+String host = args[0];
+int port = Integer.parseInt(args[1]);
+int windowSize = Integer.parseInt(args[2]);
+int slideSize = (args.length == 3) ? windowSize : 
Integer.parseInt(args[3]);
+if (slideSize > windowSize) {
+  System.err.println(" must be less than or equal to 
");
+}
+String windowArg = windowSize + " seconds";
+String slideArg = slideSize + " seconds";
+
+SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaStructuredNetworkWordCountWindowed")
+  .getOrCreate();
+
+// Create DataFrame representing the stream of input lines from 
connection to host:port
+Dataset> lines = spark
+  .readStream()
+  .format("socket")
+  .option("host", host)
+  .option("port", port)
+  .option("includeTimestamp", true)
+  .load().as(Encoders.tuple(Encoders.STRING(), Encoders.TIMESTAMP()));
+
+// Split the lines into words, retaining timestamps
+Dataset words = lines.flatMap(
+  new FlatMapFunction, Tuple2>() {
+@Override
+public Iterator> call(Tuple2 t) {
+  List> result = new ArrayList<>();
+  for (String word : t._1.split(" ")) {
+result.add(new Tuple2<>(word, t._2));
+  }
+  return result.iterator();
+}
+  },
+  Encoders.tuple(Encoders.STRING(), Encoders.TIMESTAMP())
+).toDF("word", "timestamp");
+
+// Group the data by window and word and compute the count of each 
group
+Dataset windowedCounts = words.groupBy(
+  functions.window(words.col("timestamp"), windowArg, slideArg),
--- End diff --

windowArg --> windowDuration

1 2 3 4 5 >

1 - 100 of 490 matches

Mail list logo