[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of non-struct ty...

2018-11-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23054#discussion_r234569150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1594,6 +1594,15 @@ object SQLConf {
 "WHERE, which does not follow SQL standard.")
   .booleanConf
   .createWithDefault(false)
+
+  val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY =
+buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey")
--- End diff --

Maybe aliasNonStructGroupingKeyAsValue, and default to true.

Then we can remove this in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...

2018-11-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23054
  
BTW what does the non-primitive types look like? Do they get flattened, or 
is there a strict?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...

2018-11-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23054
  
We should add a “legacy” flag in case somebody’s workload gets broken 
by this. We can remove the legacy flag in a future release.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2018-11-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18784
  
Go for it.

On Fri, Nov 16, 2018 at 6:08 AM Stavros Kontopoulos <
notificati...@github.com> wrote:

> @imaxxs <https://github.com/imaxxs> @rxin <https://github.com/rxin> I
> think its a good time to remove this, I will update the PR if you are all
> ok.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/18784#issuecomment-439403392>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPI-PKZYYhazC7vTtoMHqJv9eA-xlks5uvsbegaJpZM4OoOmC>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23021
  
One thing - I would put “pandas” right after test_ so you get the 
natural
logical grouping with sorting by file name.

On Tue, Nov 13, 2018 at 4:58 PM Hyukjin Kwon 
wrote:

> I am going to push after testing and double checking. The line counts
> would look like this
>
>   54 ./test_utils.py
>  199 ./test_catalog.py
>  503 ./test_grouped_agg_pandas_udf.py
>   45 ./test_group.py
>  320 ./test_session.py
>  153 ./test_readwriter.py
>  806 ./test_scalar_pandas_udf.py
>  216 ./test_pandas_udf.py
>  566 ./test_streaming.py
>   55 ./test_conf.py
>   16 ./__init__.py
>  530 ./test_grouped_map_pandas_udf.py
>  157 ./test_column.py
>  654 ./test_udf.py
>  262 ./test_window_pandas_udf.py
>  278 ./test_functions.py
>  263 ./test_context.py
>  138 ./test_serde.py
>  170 ./test_datasources.py
>  399 ./test_arrow.py
>   96 ./test_appsubmit.py
>  944 ./test_types.py
>  737 ./test_dataframe.py
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/23021#issuecomment-438497006>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPOg1IR6S5Fc4qv2mrPWTsDRRxf1Qks5uu2qagaJpZM4YbTYj>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23021: [SPARK-26032][PYTHON] Break large sql/tests.py files int...

2018-11-13 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/23021
  
Great initiative!

I'd break the pandas udf one into smaller pieces too, as you suggested. We 
should also investigate why the runtime didn't improve ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-07 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22957
  
i didn't look at your new code, but is your old code safe? e.g. a project 
that depends on the new alias.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15899
  
Thanks for the example. I didn't even know that was possible in earlier 
versions. I just looked it up: looks like Scala 2.11 rewrites for 
comprehensions into map, filter, and flatMap.

That said, I don't think it's a bad deal that this no longer works, given 
it was never intended to work and there's been a deprecation warning.

I still maintain that it is risky to support this, because Scala users 
learn for comprehension not just for a simple "for filter yield", but as a way 
to chain multiple generators together, which is not really well supported by 
Spark (even if it is, it's a really bad operation for users to shoot themselves 
in the foot because it would be a cartesian product).

Rather than faking it as a local collection, users should know RDD is not.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15899#discussion_r231390266
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -387,6 +387,14 @@ abstract class RDD[T: ClassTag](
   preservesPartitioning = true)
   }
 
+  /**
+* Return a new RDD containing only the elements that satisfy a 
predicate.
--- End diff --

Why bother unless we have consensus to introduce this API?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22889: [SPARK-25882][SQL] Added a function to join two datasets...

2018-11-05 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22889
  
Yea good idea (prefer Array over Seq for short lists)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-01 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22921
  
seems good to me; might want to leave this open for a few days so more 
people can take a look


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230135473
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -62,17 +62,6 @@ class SQLContext private[sql](val sparkSession: 
SparkSession)
 
   sparkSession.sparkContext.assertNotStopped()
 
-  // Note: Since Spark 2.0 this class has become a wrapper of 
SparkSession, where the
--- End diff --

keep these two lines?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-01 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r230132632
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -639,20 +639,6 @@ private[spark] object SparkConf extends Logging {
*/
   private val deprecatedConfigs: Map[String, DeprecatedConfig] = {
 val configs = Seq(
-  DeprecatedConfig("spark.cache.class", "0.8",
--- End diff --

do we need to remove these? they are warnings for users if they set the 
wrong config right


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-29 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22830
  
Perhaps @jkbradley and @mengxr can comment on it. If the trait is 
inheritable, then protected still means it is part of the API contract.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22830
  
Who introduced this? We should ask the person that introduced it whether it 
can be removed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...

2018-10-28 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22870

[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
This patch removes the rangeBetween functions introduced in SPARK-21608. As 
explained in SPARK-25841, these functions are confusing and don't quite work. 
We will redesign them and introduce better ones in SPARK-25843.

## How was this patch tested?
Removed relevant test cases as well. These test cases will need to be added 
back in SPARK-25843.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-25862

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22870.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22870


commit 00b0c6c746f4dbd3aa69071b99cf09bc5b53524a
Author: Reynold Xin 
Date:   2018-10-28T21:11:16Z

[SPARK-25862][SQL] Remove rangeBetween APIs introduced in SPARK-21608




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22853: [SPARK-25845][SQL] Fix MatchError for calendar in...

2018-10-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22853#discussion_r228608016
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFramesSuite.scala 
---
@@ -267,6 +267,25 @@ class DataFrameWindowFramesSuite extends QueryTest 
with SharedSQLContext {
 )
   }
 
+  test("range between should accept interval values as left boundary") {
--- End diff --

this is using intervals for both, rather than just left.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22815: [SPARK-25821][SQL] Remove SQLContext methods depr...

2018-10-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22815#discussion_r228594291
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -54,6 +54,7 @@ import org.apache.spark.sql.util.ExecutionListenerManager
  * @groupname Ungrouped Support functions for language integrated queries
  * @since 1.0.0
  */
+@deprecated("Use SparkSession instead", "3.0.0")
--- End diff --

Yea I wouldn't deprecate it now ... the data source API v1 still depends on 
it. 

Actually now I think about it, we should not be deprecate SQLContext until 
dsv2 is stable. Otherwise we have a stable API dsv1 depending on a deprecated 
API.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-26 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21588
  
Does this upgrade Hive for execution or also for metastore? Spark supports 
virtually all Hive metastore versions out there, and a lot of deployments do 
run different versions of Spark against the same old Hive metastore, and it'd 
be bad to break connectivity to old Hive metastores.

The execution part is a different story and we can upgrade them easily.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...

2018-10-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22841#discussion_r228376622
  
--- Diff: python/pyspark/sql/window.py ---
@@ -239,34 +212,27 @@ def rangeBetween(self, start, end):
 and "5" means the five off after the current row.
 
 We recommend users use ``Window.unboundedPreceding``, 
``Window.unboundedFollowing``,
-``Window.currentRow``, 
``pyspark.sql.functions.unboundedPreceding``,
-``pyspark.sql.functions.unboundedFollowing`` and 
``pyspark.sql.functions.currentRow``
-to specify special boundary values, rather than using integral 
values directly.
+and ``Window.currentRow`` to specify special boundary values, 
rather than using integral
--- End diff --

what do you mean? the old rangeBetween API is still valid.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json...

2018-10-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22775#discussion_r228372331
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -770,8 +776,17 @@ case class SchemaOfJson(
 factory
   }
 
-  override def convert(v: UTF8String): UTF8String = {
-val dt = 
Utils.tryWithResource(CreateJacksonParser.utf8String(jsonFactory, v)) { parser 
=>
+  @transient
+  private lazy val json = child.eval().asInstanceOf[UTF8String]
--- End diff --

It's not weird that users want to use schema_of_json at all.

Imagine it's a very large json with very complicated string. It's pretty 
difficult to actually write the ddl string.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22775: [SPARK-24709][SQL][FOLLOW-UP] Make schema_of_json's inpu...

2018-10-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22775
  
I agree it should be a literal value.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22841: [SPARK-25842][SQL] Deprecate rangeBetween APIs in...

2018-10-25 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22841

[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608

## What changes were proposed in this pull request?
See the detailed information at 
https://issues.apache.org/jira/browse/SPARK-25841 on why these APIs should be 
deprecated and redesigned.

## How was this patch tested?
Only deprecation and doc changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-25842

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22841


commit 0a49c859049a376872053dcfaacba81d47070d77
Author: Reynold Xin 
Date:   2018-10-25T23:44:36Z

[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22821: [SPARK-25832][SQL] remove newly added map related functi...

2018-10-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22821
  
We seem to be splitting hairs here. Why are we providing tech preview to
advanced users? Are you saying they construct expressions directly using
internal APIs? I doubt that’s tech preview.

Users can construct a lot of invalid plans that lead to weird semantics or
behaviors if they try, and this doesn’t really make it worse.

In either case it’s not that difficult to remove them and add them back so
I could see it going  either way.

On Thu, Oct 25, 2018 at 7:48 AM Dongjoon Hyun 
wrote:

> I'm just confused here. Shall we finish the discussion on the email
> thread? @cloud-fan <https://github.com/cloud-fan> and @gatorsmile
> <https://github.com/gatorsmile> . If the decision is officially made like
> that (providing tech. preview to advance users) in the email thread, I'm
> okay with this.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22821#issuecomment-433080200>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPODR0wtirRLTLDDSb7agOF-U8fqLks5uoc8ogaJpZM4X5fHp>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22815: [SPARK-25821][SQL] Remove SQLContext methods deprecated ...

2018-10-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22815
  
LGTM.

On a related note, we should probably deprecate the entire SQLContext.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  

@markhamstra how did you arrive at that conclusion? I said "it’s not a 
new regression and also
somewhat esoteric"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  
It’s certainly not a blocker since it’s not a new regression and also
somewhat esoteric. Would be good to fix though.

On Tue, Oct 23, 2018 at 8:20 AM Wenchen Fan 
wrote:

> This is not a PR that is ready to merge. We are likely talking about
> delaying 2.4.0 for multiple weeks because of this issue. Is it really 
worth?
>
> I'm not sure what's the exact policy, let's ping more people. @rxin
> <https://github.com/rxin> @srowen <https://github.com/srowen> @vanzin
> <https://github.com/vanzin> @felixcheung <https://github.com/felixcheung>
> @gatorsmile <https://github.com/gatorsmile>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22144#issuecomment-432287845>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPF8mJeJgi8oznKX1_RA-iXyRH9jJks5unzPKgaJpZM4WDFyL>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-10-12 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
But that would break both ipython notebooks and repl right? Pretty 
significant breaking change.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...

2018-10-10 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22010
  
If this is not yet in 2.4 it shouldn’t be merged now.

On Wed, Oct 10, 2018 at 10:57 AM Holden Karau 
wrote:

> Open question: is this suitable for branch-2.4 since it predates the
> branch cut or not? (I know we've gone back and forth on how we do that).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22010#issuecomment-428492653>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPO6Nlv4HOCVe9pPZfCd1GHXoVCDxks5ujbZlgaJpZM4Vw2BM>
> .
>
-- 
-x



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
@superbobry which blog were you referring to?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21157: [SPARK-22674][PYTHON] Removed the namedtuple pickling pa...

2018-09-27 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21157
  
so this change would introduce a pretty big regression?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22543: [SPARK-23715][SQL][DOC] improve document for from...

2018-09-25 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22543#discussion_r220410457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1018,9 +1018,20 @@ case class TimeAdd(start: Expression, interval: 
Expression, timeZoneId: Option[S
 }
 
 /**
- * Given a timestamp like '2017-07-14 02:40:00.0', interprets it as a time 
in UTC, and renders
- * that time as a timestamp in the given time zone. For example, 'GMT+1' 
would yield
- * '2017-07-14 03:40:00.0'.
+ * This is a common function for databases supporting TIMESTAMP WITHOUT 
TIMEZONE. This function
+ * takes a timestamp which is timezone-agnostic, and interprets it as a 
timestamp in UTC, and
+ * renders that timestamp as a timestamp in the given time zone.
+ *
+ * However, timestamp in Spark represents number of microseconds from the 
Unix epoch, which is not
+ * timezone-agnostic. So in Spark this function just shift the timestamp 
value from UTC timezone to
+ * the given timezone.
+ *
+ * This function may return confusing result if the input is a string with 
timezone, e.g.
+ * '2018-03-13T06:18:23+00:00'. The reason is that, Spark firstly cast the 
string to timestamp
+ * according to the timezone in the string, and finally display the result 
by converting the
+ * timestamp to string according to the session local timezone.
+ *
+ * We may remove this function in Spark 3.0.
--- End diff --

should also update the sql doc?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519][CORE] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIG...

2018-09-25 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
seems like our tests are really flaky


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-24 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
yup; just did


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22541: [SPARK-23907][SQL] Revert regr_* functions entire...

2018-09-24 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22541

[SPARK-23907][SQL] Revert regr_* functions entirely

## What changes were proposed in this pull request?
This patch reverts entirely all the regr_* functions added in SPARK-23907. 
These were added by @mgaido91 (and proposed by @gatorsmile) to improve 
compatibility with other database systems, without any actual use cases. 
However, they are very rarely used, and in Spark there are much better ways to 
compute these functions, due to Spark's flexibility in exposing real 
programming APIs.

I'm going through all the APIs added in Spark 2.4 and I think we should 
revert these. If there are strong enough demands and more use cases, we can add 
them back in the future pretty easily.

## How was this patch tested?
Reverted test cases also. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-23907

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22541.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22541


commit 623e35f118e3d28a49eb84365079d037fa519186
Author: Reynold Xin 
Date:   2018-09-24T21:30:43Z

[SPARK-23907][SQL] Revert regr_* functions entirely

commit ef0f5b02dbce29d6fbaa6f79f9c2ad62e7a16bb0
Author: Reynold Xin 
Date:   2018-09-24T21:34:34Z

i




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_CO...

2018-09-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22521
  
Jenkins, retest this please.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22521: [SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HI...

2018-09-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22521

[SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once - 
WIP

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-24519

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22521


commit 77442cf7e4b64b745079a1ee62684503c7b8c123
Author: Reynold Xin 
Date:   2018-09-19T00:58:24Z

[SPARK-24519] Compute SHUFFLE_MIN_NUM_PARTS_TO_HIGHLY_COMPRESS only once

commit f23c2202fbec04983d1181d92f7c124280ebcbe3
Author: Reynold Xin 
Date:   2018-09-21T16:48:59Z

Merge branch 'master' of github.com:apache/spark into SPARK-24519

commit ac3dee3227e4ceee4ec100bbe72988f791ae3c87
Author: Reynold Xin 
Date:   2018-09-21T16:49:52Z

x

commit f6f9658e19ae5e74697ee8846b6ab11ab8eba24c
Author: Reynold Xin 
Date:   2018-09-21T17:31:51Z

fix conflict




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...

2018-09-21 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21527#discussion_r219559889
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus {
 private[spark] object MapStatus {
 
   def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): 
MapStatus = {
-if (uncompressedSizes.length > 2000) {
+if (uncompressedSizes.length >  Option(SparkEnv.get)
--- End diff --

the only tricky thing is how to write the test cases for this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22515: [SPARK-19724][SQL] allowCreatingManagedTableUsing...

2018-09-21 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22515

[SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should 
have legacy prefix

One more legacy config to go ...

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark 
allowCreatingManagedTableUsingNonemptyLocation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22515.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22515


commit f7c372e6f803c86e189e984fa6c1dd81f84454e9
Author: Reynold Xin 
Date:   2018-09-21T02:10:10Z

[SPARK-19724][SQL] allowCreatingManagedTableUsingNonemptyLocation should 
have legacy prefix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...

2018-09-20 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/22456


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22509: [SPARK-25384][SQL] Clarify fromJsonForceNullableSchema w...

2018-09-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22509
  
cc @dongjoon-hyun @MaxGekk  we still need this pr don't we?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22509: [SPARK-25384][SQL] Clarify fromJsonForceNullableS...

2018-09-20 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22509

[SPARK-25384][SQL] Clarify fromJsonForceNullableSchema will be removed in 
Spark 3.0

See above. This should go into the 2.4 release.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-25384

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22509.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22509


commit 8ad50d5433ac5a0f888fb5909893317002d5aa51
Author: Reynold Xin 
Date:   2018-09-21T02:06:28Z

x




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22508: [SPARK-23549][SQL] Rename config spark.sql.legacy.compar...

2018-09-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22508
  
cc @gatorsmile who merged the original pr.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22508: [SPARK-23549][SQL] Rename config spark.sql.legacy...

2018-09-20 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22508

[SPARK-23549][SQL] Rename config 
spark.sql.legacy.compareDateTimestampInTimestamp

## What changes were proposed in this pull request?
See title.

## How was this patch tested?
Make sure all references have been updated:
```
> git grep compareDateTimestampInTimestamp
docs/sql-programming-guide.md:  - Since Spark 2.4, Spark compares a DATE 
type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set 
`false` to `spark.sql.legacy.compareDateTimestampInTimestamp` restores the 
previous behavior. This option will be removed in Spark 3.0.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
// if conf.compareDateTimestampInTimestamp is true

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
  => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else 
Some(StringType)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
  => if (conf.compareDateTimestampInTimestamp) Some(TimestampType) else 
Some(StringType)
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
buildConf("spark.sql.legacy.compareDateTimestampInTimestamp")
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:  
def compareDateTimestampInTimestamp : Boolean = 
getConf(COMPARE_DATE_TIMESTAMP_IN_TIMESTAMP)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala:
"spark.sql.legacy.compareDateTimestampInTimestamp" -> 
convertToTS.toString) {
```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark SPARK-23549

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22508.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22508


commit f29dd8905f0b14c937a47d7abe291828c7de48b9
Author: Reynold Xin 
Date:   2018-09-21T02:00:59Z

[SPARK-23549][SQL] Rename config 
spark.sql.legacy.compareDateTimestampInTimestamp




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22505: Revert "[SPARK-23715][SQL] the input of to/from_utc_time...

2018-09-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22505
  
lgtm - let's make sure tests pass


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-20 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r219297029
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

do we fail if users currently specify an option in dataframereader that 
doesn't apply? if we don't i wouldn't fail here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22471: [SPARK-25470][SQL][Performance] Concat.eval shoul...

2018-09-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22471#discussion_r219023998
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -2274,33 +2274,41 @@ case class Concat(children: Seq[Expression]) 
extends ComplexTypeMergingExpressio
 
   override def foldable: Boolean = children.forall(_.foldable)
 
-  override def eval(input: InternalRow): Any = dataType match {
+  override def eval(input: InternalRow): Any = evalFunction(input)
+
+  @transient private lazy val evalFunction: InternalRow => Any = dataType 
match {
--- End diff --

i think ti's good to have a transient to reduce closure size


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22476: [SPARK-24157][SS][FOLLOWUP] Rename to spark.sql.streamin...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22476
  
Merged in master/2.4.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22475
  
jenkins, retest this again


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22475: [SPARK-4502][SQL] Rename to spark.sql.optimizer.nestedSc...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22475
  
done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22476: [SPARK-24157][SS][FOLLOWUP] Rename to spark.sql.streamin...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22476
  
done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21169: [SPARK-23715][SQL] the input of to/from_utc_timestamp ca...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21169
  
i'm actually not sure if we should do this, given impala treats timestamp 
as timestamp without timezone, whereas spark treats it as a utc timestamp (with 
timezone). these functions are super confusing anyway and changing them to just 
match impala in some cases is still very confusing.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22476: [SPARK-24157] spark.sql.streaming.noDataMicroBatches.ena...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22476
  
cc @tdas @marmbrus @jose-torres 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22476: [SPARK-24157] spark.sql.streaming.noDataMicroBatc...

2018-09-19 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22476

[SPARK-24157] spark.sql.streaming.noDataMicroBatches.enabled

## What changes were proposed in this pull request?
This patch changes the config option 
`spark.sql.streaming.noDataMicroBatchesEnabled` to 
`spark.sql.streaming.noDataMicroBatches.enabled` to be more consistent with 
rest of the configs. Unfortunately there is one streaming config called 
`spark.sql.streaming.metricsEnabled`. For that one we should just use a 
fallback config and change it in a separate patch.

## How was this patch tested?
Made sure no other references to this config are in the code base:
```
> git grep "noDataMicro"
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
buildConf("spark.sql.streaming.noDataMicroBatches.enabled")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-24157

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22476


commit 37716c836a20a684ef7425addd5f43cf10cd857f
Author: Reynold Xin 
Date:   2018-09-19T21:40:33Z

[SPARK-24157] spark.sql.streaming.noDataMicroBatches.enabled




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22475: [SPARK-4502][SQL] spark.sql.optimizer.nestedSchem...

2018-09-19 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22475

[SPARK-4502][SQL] spark.sql.optimizer.nestedSchemaPruning.enabled

## What changes were proposed in this pull request?
This patch adds an "optimizer" prefix to nested schema pruning.

## How was this patch tested?
Should be covered by existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-4502

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22475


commit 5159883f5b4a65ac8ecec8b0368e172680aa6897
Author: Reynold Xin 
Date:   2018-09-19T21:37:08Z

[SPARK-4502][SQL] spark.sql.optimizer.nestedSchemaPruning.enabled




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22475: [SPARK-4502][SQL] spark.sql.optimizer.nestedSchemaPrunin...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22475
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22472: [SPARK-23173][SQL] Reverting of spark.sql.fromJsonForceN...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22472
  
im ok either way


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22471: [SPARK-25470][SQL][Performance] Concat.eval should use p...

2018-09-19 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22471
  
@ueshin can you review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20858: [SPARK-23736][SQL] Extending the concat function ...

2018-09-19 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20858#discussion_r218677837
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -665,3 +667,219 @@ case class ElementAt(left: Expression, right: 
Expression) extends GetMapValueUti
 
   override def prettyName: String = "element_at"
 }
+
+/**
+ * Concatenates multiple input columns together into a single column.
+ * The function works with strings, binary and compatible array columns.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(col1, col2, ..., colN) - Returns the concatenation of 
col1, col2, ..., colN.",
+  examples = """
+Examples:
+  > SELECT _FUNC_('Spark', 'SQL');
+   SparkSQL
+  > SELECT _FUNC_(array(1, 2, 3), array(4, 5), array(6));
+ | [1,2,3,4,5,6]
+  """)
+case class Concat(children: Seq[Expression]) extends Expression {
+
+  private val MAX_ARRAY_LENGTH: Int = 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH
+
+  val allowedTypes = Seq(StringType, BinaryType, ArrayType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (children.isEmpty) {
+  TypeCheckResult.TypeCheckSuccess
+} else {
+  val childTypes = children.map(_.dataType)
+  if (childTypes.exists(tpe => 
!allowedTypes.exists(_.acceptsType(tpe {
+return TypeCheckResult.TypeCheckFailure(
+  s"input to function $prettyName should have been StringType, 
BinaryType or ArrayType," +
+s" but it's " + childTypes.map(_.simpleString).mkString("[", 
", ", "]"))
+  }
+  TypeUtils.checkForSameTypeInputExpr(childTypes, s"function 
$prettyName")
+}
+  }
+
+  override def dataType: DataType = 
children.map(_.dataType).headOption.getOrElse(StringType)
+
+  lazy val javaType: String = CodeGenerator.javaType(dataType)
+
+  override def nullable: Boolean = children.exists(_.nullable)
+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override def eval(input: InternalRow): Any = dataType match {
--- End diff --

so this pattern match will probably cause significant regression in the 
interpreted (non-codegen) mode, due to the way scala pattern matching is 
implemented.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19868
  
can somebody explain to me what the pr description has to do with 
missingFiles? I'm probably missing something but i feel the implementation is 
very different from the pr description.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16677
  
ok after thinking about it more, i think we should just revert all of these 
changes and go back to the drawing board. here's why:

1. the prs change some of the most common/core parts of spark, and are not 
properly designed (as in they haven't gone through actual discussions; there's 
not even a doc on how they work). the prs created a much more complicated 
implementations for limit / top k. you might be able to justify the complexity 
with the perf improvements, but we better write them down, discuss them, and 
make sure they are the right design choices. this is just a comment about the 
process, not the actual design.

2. now onto the design, i am having issues with two major parts:

2a. this pr really wanted an abstraction to buffer data, and then have the 
driver analyze some statistics about data (records per map task), and then make 
decisions. because spark doesn't yet have that infrastructure, this pr just 
adds some hacks to shuffle to make it work. there is no proper abstraction here.

2b. i'm not even sure if the algorithm here is the right one. the pr tries 
to parallelize as much as possible by keeping the same number of tasks. imo a 
simpler design that would work for more common cases is to buffer the data, get 
the records per map task, and create a new rdd with the first N number of 
partitions that reach limit. that way, we don't launch too many asks, and we 
retain ordering.

3. the pr implementation quality is poor. variable names are confusing 
(output vs records); it's severely lacking documentation; the doc for the 
config option is arcane.

sorry about all of the above, but we gotta do better.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22456#discussion_r218666270
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -31,7 +31,7 @@ import org.apache.spark.util.Utils
 
 /**
  * Result returned by a ShuffleMapTask to a scheduler. Includes the block 
manager address that the
- * task ran on, the sizes of outputs for each reducer, and the number of 
outputs of the map task,
+ * task ran on, the sizes of outputs for each reducer, and the number of 
records of the map task,
--- End diff --

size was about bytes; so it doesn't really matter whether it's a record or 
a row or a block. it's also already pointed out below that it's about bytes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218665902
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
 }
 
 /**
- * Take the first `limit` elements of each child partition, but do not 
collect or shuffle them.
+ * Take the `limit` elements of the child output.
  */
-case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+  override def output: Seq[Attribute] = child.output
 
   override def outputPartitioning: Partitioning = child.outputPartitioning
-}
 
-/**
- * Take the first `limit` elements of the child's single output partition.
- */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
 
-  override def requiredChildDistribution: List[Distribution] = AllTuples 
:: Nil
+  private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
 
-  override def outputPartitioning: Partitioning = child.outputPartitioning
+  protected override def doExecute(): RDD[InternalRow] = {
+val childRDD = child.execute()
+val partitioner = LocalPartitioning(childRDD)
+val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency(
+  childRDD, child.output, partitioner, serializer)
+val numberOfOutput: Seq[Long] = if 
(shuffleDependency.rdd.getNumPartitions != 0) {
+  // submitMapStage does not accept RDD with 0 partition.
+  // So, we will not submit this dependency.
+  val submittedStageFuture = 
sparkContext.submitMapStage(shuffleDependency)
+  submittedStageFuture.get().recordsByPartitionId.toSeq
+} else {
+  Nil
+}
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+// During global limit, try to evenly distribute limited rows across 
data
+// partitions. If disabled, scanning data partitions sequentially 
until reaching limit number.
+// Besides, if child output has certain ordering, we can't evenly pick 
up rows from
+// each parititon.
+val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && 
child.outputOrdering == Nil
+
+val shuffled = new ShuffledRowRDD(shuffleDependency)
+
+val sumOfOutput = numberOfOutput.sum
+if (sumOfOutput <= limit) {
+  shuffled
+} else if (!flatGlobalLimit) {
+  var numRowTaken = 0
+  val takeAmounts = numberOfOutput.map { num =>
+if (numRowTaken + num < limit) {
+  numRowTaken += num.toInt
+  num.toInt
+} else {
+  val toTake = limit - numRowTaken
+  numRowTaken += toTake
+  toTake
+}
+  }
+  val broadMap = sparkContext.broadcast(takeAmounts)
+  shuffled.mapPartitionsWithIndexInternal { case (index, iter) =>
+iter.take(broadMap.value(index).toInt)
+  }
+} else {
+  // We try to evenly require the asked limit number of rows across 
all child rdd's partitions.
+  var rowsNeedToTake: Long = limit
+  val takeAmountByPartition: Array[Long] = 
Array.fill[Long](numberOfOutput.length)(0L)
+  val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*)
+
+  while (rowsNeedToTake > 0) {
+val nonEmptyParts = remainingRowsByPartition.count(_ > 0)
+// If the rows needed to take are less the number of non-empty 
partitions, take one row from
+// each non-empty partitions until we reach `limit` rows.
+// Otherwise, evenly divide the needed rows to each non-empty 
partitions.
+val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts)
+remainingRowsByPartition.zipWithIndex.foreach { case (num, index) 
=>
+  // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of 
`rowsNeedToTake` during
+  // the traversal, so we need to add this check.
+  if (rowsNeedToTake > 0 && num > 0) {
+if (num >= takePerPart) {
+  rowsNeedToTake -= takePerPart
+  takeAmountByPartition(index) += takePerPart
+  remainingRowsByPartition(index) -= takePerPart
+} else {
+  rowsNeedToTake -= num
+  takeAmountByPartition(index) += num
+  remainingRowsByPartition(index) -= num
+}
+  }
  

[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21527#discussion_r218640616
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus {
 private[spark] object MapStatus {
 
   def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): 
MapStatus = {
-if (uncompressedSizes.length > 2000) {
+if (uncompressedSizes.length >  Option(SparkEnv.get)
--- End diff --

btw this creates the impression that this can be modified in runtime, but 
in practice the executors generate these MapStatuses so they can't really be 
changed on executors.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218640368
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
 }
 
 /**
- * Take the first `limit` elements of each child partition, but do not 
collect or shuffle them.
+ * Take the `limit` elements of the child output.
  */
-case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+  override def output: Seq[Attribute] = child.output
 
   override def outputPartitioning: Partitioning = child.outputPartitioning
-}
 
-/**
- * Take the first `limit` elements of the child's single output partition.
- */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
 
-  override def requiredChildDistribution: List[Distribution] = AllTuples 
:: Nil
+  private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
 
-  override def outputPartitioning: Partitioning = child.outputPartitioning
+  protected override def doExecute(): RDD[InternalRow] = {
+val childRDD = child.execute()
+val partitioner = LocalPartitioning(childRDD)
+val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency(
+  childRDD, child.output, partitioner, serializer)
+val numberOfOutput: Seq[Long] = if 
(shuffleDependency.rdd.getNumPartitions != 0) {
+  // submitMapStage does not accept RDD with 0 partition.
+  // So, we will not submit this dependency.
+  val submittedStageFuture = 
sparkContext.submitMapStage(shuffleDependency)
+  submittedStageFuture.get().recordsByPartitionId.toSeq
+} else {
+  Nil
+}
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+// During global limit, try to evenly distribute limited rows across 
data
+// partitions. If disabled, scanning data partitions sequentially 
until reaching limit number.
+// Besides, if child output has certain ordering, we can't evenly pick 
up rows from
+// each parititon.
+val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && 
child.outputOrdering == Nil
+
+val shuffled = new ShuffledRowRDD(shuffleDependency)
+
+val sumOfOutput = numberOfOutput.sum
+if (sumOfOutput <= limit) {
+  shuffled
+} else if (!flatGlobalLimit) {
+  var numRowTaken = 0
+  val takeAmounts = numberOfOutput.map { num =>
+if (numRowTaken + num < limit) {
+  numRowTaken += num.toInt
+  num.toInt
+} else {
+  val toTake = limit - numRowTaken
+  numRowTaken += toTake
+  toTake
+}
+  }
+  val broadMap = sparkContext.broadcast(takeAmounts)
+  shuffled.mapPartitionsWithIndexInternal { case (index, iter) =>
+iter.take(broadMap.value(index).toInt)
+  }
+} else {
+  // We try to evenly require the asked limit number of rows across 
all child rdd's partitions.
+  var rowsNeedToTake: Long = limit
+  val takeAmountByPartition: Array[Long] = 
Array.fill[Long](numberOfOutput.length)(0L)
+  val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*)
+
+  while (rowsNeedToTake > 0) {
+val nonEmptyParts = remainingRowsByPartition.count(_ > 0)
+// If the rows needed to take are less the number of non-empty 
partitions, take one row from
+// each non-empty partitions until we reach `limit` rows.
+// Otherwise, evenly divide the needed rows to each non-empty 
partitions.
+val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts)
+remainingRowsByPartition.zipWithIndex.foreach { case (num, index) 
=>
+  // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of 
`rowsNeedToTake` during
+  // the traversal, so we need to add this check.
+  if (rowsNeedToTake > 0 && num > 0) {
+if (num >= takePerPart) {
+  rowsNeedToTake -= takePerPart
+  takeAmountByPartition(index) += takePerPart
+  remainingRowsByPartition(index) -= takePerPart
+} else {
+  rowsNeedToTake -= num
+  takeAmountByPartition(index) += num
+  remainingRowsByPartition(index) -= num
+}
+  }
  

[GitHub] spark pull request #21527: [SPARK-24519] Make the threshold for highly compr...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21527#discussion_r218639496
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -50,7 +50,9 @@ private[spark] sealed trait MapStatus {
 private[spark] object MapStatus {
 
   def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): 
MapStatus = {
-if (uncompressedSizes.length > 2000) {
+if (uncompressedSizes.length >  Option(SparkEnv.get)
--- End diff --

this should be done once, rather than for every constructor, shouldn't it? 
might introduce a hot codepath for very large workloads here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22459: [SPARK-23173] rename spark.sql.fromJsonForceNulla...

2018-09-18 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22459

[SPARK-23173] rename spark.sql.fromJsonForceNullableSchema

## What changes were proposed in this pull request?
`spark.sql.fromJsonForceNullableSchema` -> 
`spark.sql.function.fromJson.forceNullable`


## How was this patch tested?
Made sure there are no more references to 
`spark.sql.fromJsonForceNullableSchema`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-23173

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22459.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22459


commit cf2420e8c43e641d2e5b8814674e208a983b1643
Author: Reynold Xin 
Date:   2018-09-19T00:46:21Z

[SPARK-23173] rename spark.sql.fromJsonForceNullableSchema

commit f6f427fd06a87274d6950550137ce0f551276ff6
Author: Reynold Xin 
Date:   2018-09-19T00:47:38Z

move




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22459: [SPARK-23173][SQL] rename spark.sql.fromJsonForceNullabl...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22459
  
cc @mswit-databricks @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16677
  
actually looking at the design - this could cause perf regressions in some 
cases too right? it introduces a barrier that was previously non-existent. if 
the number of records to take isn't substantially less than the actual records 
on each partition, perf would be much worse. also it feels to me this isn't 
shuffle at all, and we are piggybacking on the wrong infrastructure. what you 
really want is a way to buffer blocks temporarily, and can launch a 2nd wave of 
tasks to rerun some of them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218633220
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -68,22 +68,42 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   object SpecialLimits extends Strategy {
 override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
   case ReturnAnswer(rootPlan) => rootPlan match {
-case Limit(IntegerLiteral(limit), Sort(order, true, child))
-if limit < conf.topKSortFallbackThreshold =>
-  TakeOrderedAndProjectExec(limit, order, child.output, 
planLater(child)) :: Nil
-case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, 
true, child)))
-if limit < conf.topKSortFallbackThreshold =>
-  TakeOrderedAndProjectExec(limit, order, projectList, 
planLater(child)) :: Nil
+case Limit(IntegerLiteral(limit), s@Sort(order, true, child)) =>
+  if (limit < conf.topKSortFallbackThreshold) {
--- End diff --

btw here we really need to document what the strategies are. when there 
were only two cases it's not a big deal because it'd take a few seconds to 
understand. but this block is pretty large now that's difficult to understand. 
see join strategy documentation for example.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218632551
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -98,7 +98,8 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) 
extends UnaryExecNode wi
 /**
  * Take the `limit` elements of the child output.
  */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class GlobalLimitExec(limit: Int, child: SparkPlan,
+   orderedLimit: Boolean = false) extends 
UnaryExecNode {
--- End diff --

thanks @viirya 

can you write a design doc or put it in the classdoc of limit on how we 
handle limits? your sequence of prs are making limits much more complicated 
(with optimizations) and very difficult to reason about. i think we can make it 
easier to reason about, if we actually document the execution strategy.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218631745
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
 }
 
 /**
- * Take the first `limit` elements of each child partition, but do not 
collect or shuffle them.
+ * Take the `limit` elements of the child output.
  */
-case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+  override def output: Seq[Attribute] = child.output
 
   override def outputPartitioning: Partitioning = child.outputPartitioning
-}
 
-/**
- * Take the first `limit` elements of the child's single output partition.
- */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
 
-  override def requiredChildDistribution: List[Distribution] = AllTuples 
:: Nil
+  private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
 
-  override def outputPartitioning: Partitioning = child.outputPartitioning
+  protected override def doExecute(): RDD[InternalRow] = {
+val childRDD = child.execute()
+val partitioner = LocalPartitioning(childRDD)
+val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency(
+  childRDD, child.output, partitioner, serializer)
+val numberOfOutput: Seq[Long] = if 
(shuffleDependency.rdd.getNumPartitions != 0) {
+  // submitMapStage does not accept RDD with 0 partition.
+  // So, we will not submit this dependency.
+  val submittedStageFuture = 
sparkContext.submitMapStage(shuffleDependency)
+  submittedStageFuture.get().recordsByPartitionId.toSeq
+} else {
+  Nil
+}
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+// During global limit, try to evenly distribute limited rows across 
data
+// partitions. If disabled, scanning data partitions sequentially 
until reaching limit number.
+// Besides, if child output has certain ordering, we can't evenly pick 
up rows from
+// each parititon.
+val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && 
child.outputOrdering == Nil
+
+val shuffled = new ShuffledRowRDD(shuffleDependency)
+
+val sumOfOutput = numberOfOutput.sum
+if (sumOfOutput <= limit) {
+  shuffled
+} else if (!flatGlobalLimit) {
+  var numRowTaken = 0
+  val takeAmounts = numberOfOutput.map { num =>
+if (numRowTaken + num < limit) {
+  numRowTaken += num.toInt
+  num.toInt
+} else {
+  val toTake = limit - numRowTaken
+  numRowTaken += toTake
+  toTake
+}
+  }
+  val broadMap = sparkContext.broadcast(takeAmounts)
+  shuffled.mapPartitionsWithIndexInternal { case (index, iter) =>
+iter.take(broadMap.value(index).toInt)
+  }
+} else {
+  // We try to evenly require the asked limit number of rows across 
all child rdd's partitions.
+  var rowsNeedToTake: Long = limit
+  val takeAmountByPartition: Array[Long] = 
Array.fill[Long](numberOfOutput.length)(0L)
+  val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*)
+
+  while (rowsNeedToTake > 0) {
+val nonEmptyParts = remainingRowsByPartition.count(_ > 0)
+// If the rows needed to take are less the number of non-empty 
partitions, take one row from
+// each non-empty partitions until we reach `limit` rows.
+// Otherwise, evenly divide the needed rows to each non-empty 
partitions.
+val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts)
+remainingRowsByPartition.zipWithIndex.foreach { case (num, index) 
=>
+  // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of 
`rowsNeedToTake` during
+  // the traversal, so we need to add this check.
+  if (rowsNeedToTake > 0 && num > 0) {
+if (num >= takePerPart) {
+  rowsNeedToTake -= takePerPart
+  takeAmountByPartition(index) += takePerPart
+  remainingRowsByPartition(index) -= takePerPart
+} else {
+  rowsNeedToTake -= num
+  takeAmountByPartition(index) += num
+  remainingRowsByPartition(index) -= num
+}
+  }
  

[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218631682
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -93,25 +96,93 @@ trait BaseLimitExec extends UnaryExecNode with 
CodegenSupport {
 }
 
 /**
- * Take the first `limit` elements of each child partition, but do not 
collect or shuffle them.
+ * Take the `limit` elements of the child output.
  */
-case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+  override def output: Seq[Attribute] = child.output
 
   override def outputPartitioning: Partitioning = child.outputPartitioning
-}
 
-/**
- * Take the first `limit` elements of the child's single output partition.
- */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
BaseLimitExec {
+  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
 
-  override def requiredChildDistribution: List[Distribution] = AllTuples 
:: Nil
+  private val serializer: Serializer = new 
UnsafeRowSerializer(child.output.size)
 
-  override def outputPartitioning: Partitioning = child.outputPartitioning
+  protected override def doExecute(): RDD[InternalRow] = {
+val childRDD = child.execute()
+val partitioner = LocalPartitioning(childRDD)
+val shuffleDependency = ShuffleExchangeExec.prepareShuffleDependency(
+  childRDD, child.output, partitioner, serializer)
+val numberOfOutput: Seq[Long] = if 
(shuffleDependency.rdd.getNumPartitions != 0) {
+  // submitMapStage does not accept RDD with 0 partition.
+  // So, we will not submit this dependency.
+  val submittedStageFuture = 
sparkContext.submitMapStage(shuffleDependency)
+  submittedStageFuture.get().recordsByPartitionId.toSeq
+} else {
+  Nil
+}
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
+// During global limit, try to evenly distribute limited rows across 
data
+// partitions. If disabled, scanning data partitions sequentially 
until reaching limit number.
+// Besides, if child output has certain ordering, we can't evenly pick 
up rows from
+// each parititon.
+val flatGlobalLimit = sqlContext.conf.limitFlatGlobalLimit && 
child.outputOrdering == Nil
+
+val shuffled = new ShuffledRowRDD(shuffleDependency)
+
+val sumOfOutput = numberOfOutput.sum
+if (sumOfOutput <= limit) {
+  shuffled
+} else if (!flatGlobalLimit) {
+  var numRowTaken = 0
+  val takeAmounts = numberOfOutput.map { num =>
+if (numRowTaken + num < limit) {
+  numRowTaken += num.toInt
+  num.toInt
+} else {
+  val toTake = limit - numRowTaken
+  numRowTaken += toTake
+  toTake
+}
+  }
+  val broadMap = sparkContext.broadcast(takeAmounts)
+  shuffled.mapPartitionsWithIndexInternal { case (index, iter) =>
+iter.take(broadMap.value(index).toInt)
+  }
+} else {
+  // We try to evenly require the asked limit number of rows across 
all child rdd's partitions.
+  var rowsNeedToTake: Long = limit
+  val takeAmountByPartition: Array[Long] = 
Array.fill[Long](numberOfOutput.length)(0L)
+  val remainingRowsByPartition: Array[Long] = Array(numberOfOutput: _*)
+
+  while (rowsNeedToTake > 0) {
+val nonEmptyParts = remainingRowsByPartition.count(_ > 0)
+// If the rows needed to take are less the number of non-empty 
partitions, take one row from
+// each non-empty partitions until we reach `limit` rows.
+// Otherwise, evenly divide the needed rows to each non-empty 
partitions.
+val takePerPart = math.max(1, rowsNeedToTake / nonEmptyParts)
+remainingRowsByPartition.zipWithIndex.foreach { case (num, index) 
=>
+  // In case `rowsNeedToTake` < `nonEmptyParts`, we may run out of 
`rowsNeedToTake` during
+  // the traversal, so we need to add this check.
+  if (rowsNeedToTake > 0 && num > 0) {
+if (num >= takePerPart) {
+  rowsNeedToTake -= takePerPart
+  takeAmountByPartition(index) += takePerPart
+  remainingRowsByPartition(index) -= takePerPart
+} else {
+  rowsNeedToTake -= num
+  takeAmountByPartition(index) += num
+  remainingRowsByPartition(index) -= num
+}
+  }
  

[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218631461
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -98,7 +98,8 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) 
extends UnaryExecNode wi
 /**
  * Take the `limit` elements of the child output.
  */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class GlobalLimitExec(limit: Int, child: SparkPlan,
+   orderedLimit: Boolean = false) extends 
UnaryExecNode {
--- End diff --

code needs to be documented. we won't find this pr discussion a year from 
now by looking at the source code, trying to figure out what it means. also the 
doc needs to be readable. the current doc for the config flag is unfortunately 
unparsable.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218630599
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -98,7 +98,8 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) 
extends UnaryExecNode wi
 /**
  * Take the `limit` elements of the child output.
  */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class GlobalLimitExec(limit: Int, child: SparkPlan,
+   orderedLimit: Boolean = false) extends 
UnaryExecNode {
--- End diff --

please document it in code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218630513
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala 
---
@@ -22,21 +22,29 @@ import scala.collection.JavaConverters._
 import org.scalatest.BeforeAndAfter
 
 import org.apache.spark.sql.hive.test.{TestHive, TestHiveQueryExecution}
+import org.apache.spark.sql.internal.SQLConf
 
 /**
  * A set of test cases that validate partition and column pruning.
  */
 class PruningSuite extends HiveComparisonTest with BeforeAndAfter {
 
+  private val originalLimitFlatGlobalLimit = 
TestHive.conf.limitFlatGlobalLimit
+
   override def beforeAll(): Unit = {
 super.beforeAll()
 TestHive.setCacheTables(false)
+TestHive.setConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT, false)
--- End diff --

why do we set this flag here? we need to document it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218630488
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -557,11 +557,13 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 
   test("SPARK-18004 limit + aggregates") {
-val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", 
"value")
-val limit2Df = df.limit(2)
-checkAnswer(
-  limit2Df.groupBy("id").count().select($"id"),
-  limit2Df.select($"id"))
+withSQLConf(SQLConf.LIMIT_FLAT_GLOBAL_LIMIT.key -> "true") {
--- End diff --

why do we set this flag here? we need to document it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218630324
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -204,6 +204,13 @@ object SQLConf {
 .intConf
 .createWithDefault(4)
 
+  val LIMIT_FLAT_GLOBAL_LIMIT = 
buildConf("spark.sql.limit.flatGlobalLimit")
+.internal()
+.doc("During global limit, try to evenly distribute limited rows 
across data " +
+  "partitions. If disabled, scanning data partitions sequentially 
until reaching limit number.")
+.booleanConf
+.createWithDefault(true)
--- End diff --

so i read this config doc five times, and i still couldn't figure out what 
it does, until i went ahead and read the implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218629650
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -98,7 +98,8 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) 
extends UnaryExecNode wi
 /**
  * Take the `limit` elements of the child output.
  */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class GlobalLimitExec(limit: Int, child: SparkPlan,
+   orderedLimit: Boolean = false) extends 
UnaryExecNode {
--- End diff --

what do you mean by "it's not goes for TakeOrderedAndProjectExec"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22344: [SPARK-25352][SQL] Perform ordered global limit when lim...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22344
  
guys - the whole sequence of prs for this feature are contributing a lot of 
cryptic code with arcane documentation everywhere. i worry a lot about the 
maintainability of the code that's coming in. can you submit a pr to improve 
the readability? otherwise i think we should revert all of them.

please document the algorithm, the change, the config flags. it shouldn't 
require reading the actual implementation to understand the config flags.

we as committers should also do a better job at gating unreadable code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r218623478
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala ---
@@ -98,7 +98,8 @@ case class LocalLimitExec(limit: Int, child: SparkPlan) 
extends UnaryExecNode wi
 /**
  * Take the `limit` elements of the child output.
  */
-case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode {
+case class GlobalLimitExec(limit: Int, child: SparkPlan,
+   orderedLimit: Boolean = false) extends 
UnaryExecNode {
--- End diff --

what does orderedLimit mean here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22457: [SPARK-24626] Add statistics prefix to parallelFi...

2018-09-18 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22457

[SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation

## What changes were proposed in this pull request?
To be more consistent with other statistics based configs.

## How was this patch tested?
N/A - straightforward rename of config option. Used `git grep` to make sure 
there are no mention of it.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-24626

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22457.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22457


commit 8d46d7246fbac185e066cc542639d2fb5ce10a4b
Author: Reynold Xin 
Date:   2018-09-18T22:58:52Z

[SPARK-24626] Add statistics prefix to parallelFileListingInStatsComputation




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22456: [SPARK-19355][SQL] Fix variable names numberOfOutput -> ...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22456
  
cc @hvanhovell @cloud-fan 

also @viirya please don't use such cryptic variable names ... we also need 
to fix the documentation for the config flag - it's arcane.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22456: [SPARK-19355][SQL] Fix variable names numberOfOut...

2018-09-18 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/22456

[SPARK-19355][SQL] Fix variable names numberOfOutput

## What changes were proposed in this pull request?
SPARK-19355 introduced a variable / method called numberOfOutput, which is 
a really bad name because it is unclear whether it is a block, or a row. This 
patch renamed it numRecords, and also changed couple other places to make them 
consistent.

## How was this patch tested?
Should be covered by existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-19355

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22456


commit 793fc19d2519f47f5f3278b79e827f1159d9e440
Author: Reynold Xin 
Date:   2018-09-18T22:47:57Z

[SPARK-19355][SQL] Fix variable names.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2018-09-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16677
  
two questions about this (i just saw this from a different place):

1. is numOutput about number of records?

2. how much memory usage will be increased by, for the driver, at scale?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16677: [SPARK-19355][SQL] Use map output statistics to i...

2018-09-18 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16677#discussion_r218614872
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -44,18 +45,23 @@ private[spark] sealed trait MapStatus {
* necessary for correctness, since block fetchers are allowed to skip 
zero-size blocks.
*/
   def getSizeForBlock(reduceId: Int): Long
+
+  /**
+   * The number of outputs for the map task.
+   */
+  def numberOfOutput: Long
--- End diff --

what does this mean? output blocks? output files?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression

2018-09-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22395
  
Looks like a use case for a legacy config.

On Mon, Sep 17, 2018 at 6:41 PM Wenchen Fan 
wrote:

> To clarify, it's not following hive, but following the behavior of
> previous Spark versions, which is same as hive.
>
> I also think returning left operand's type is more reasonable, but we
> should do it in another PR since it's a behavior change, and we should 
also
> add migration guide for it.
>
> @mgaido91 <https://github.com/mgaido91> do you have time to do this
> change? Thanks!
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22395#issuecomment-45132>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPDeW3F4Jsc-gS6CFrrGZY_lFXGxbks5ucE9WgaJpZM4Wjmfh>
> .
>
-- 
--
excuse the brevity and lower case due to wrist injury



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22395: [SPARK-16323][SQL] Add IntegralDivide expression

2018-09-17 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22395
  
why are we always returning long type here? shouldn't they be the same as 
the left expr's type? see mysql

```mysql> create temporary table rxin_temp select 4 div 2, 123456789124 div 
2, 4 / 2, 123456789124 / 2;
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe rxin_temp;
++---+--+-+-+---+
| Field  | Type  | Null | Key | Default | Extra |
++---+--+-+-+---+
| 4 div 2| int(1)| YES  | | NULL|   |
| 123456789124 div 2 | bigint(12)| YES  | | NULL|   |
| 4 / 2  | decimal(5,4)  | YES  | | NULL|   |
| 123456789124 / 2   | decimal(16,4) | YES  | | NULL|   |
++---+--+-+-+---+
4 rows in set (0.01 sec)```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22442: [SPARK-25447][SQL] Support JSON options by schema...

2018-09-17 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22442#discussion_r218250393
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3611,6 +3611,20 @@ object functions {
*/
   def schema_of_json(e: Column): Column = withExpr(new 
SchemaOfJson(e.expr))
 
+  /**
+   * Parses a column containing a JSON string and infers its schema using 
options.
+   *
+   * @param e a string column containing JSON data.
+   * @param options JSON datasource options that control JSON parsing and 
type inference.
--- End diff --

maybe you can say refer to DataFrameReader.json for the list of options


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21433: [SPARK-23820][CORE] Enable use of long form of callsite ...

2018-09-11 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21433
  
Yea we can add this back easily.

On Tue, Sep 11, 2018 at 12:50 PM Sean Owen  wrote:

> Given lack of certainty, and that's this is small and easy to add back in
> a different form, and the fact that 2.4 is quickly teeing up, let me 
revert
> this for now. We can proceed with a different approach in a new PR.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21433#issuecomment-420401254>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPPKRRxsg30kJA9RAItGJDHPF4mX_ks5uaBQjgaJpZM4UONdo>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...

2018-09-08 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22010
  
Actually @holdenk is this change even correct? RDD.distinct is not key 
based. It is based on the value of the elements in RDD. Even if `numPartitions 
== partitions.length`, it doesn't mean the RDD is hash partitioned this way.

Consider this RDD:

Partition 1: 1, 2, 3
Partition 2: 1, 2, 3

rdd.distinct() should return 1, 2, 3

with your change it'd still return 1, 2, 3, 1, 2, 3.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22010#discussion_r216145892
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag](
* Return a new RDD containing the distinct elements in this RDD.
*/
   def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] = withScope {
-map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
+partitioner match {
--- End diff --

you can just create a new MapPartitionsRDD with preservesPartitioning set 
to true, can't you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22332: [SPARK-25333][SQL] Ability add new columns in Dataset in...

2018-09-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22332
  
Thanks guys.

On Thu, Sep 6, 2018 at 2:12 AM Hyukjin Kwon 
wrote:

> Thanks, @wmellouli <https://github.com/wmellouli>.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/22332#issuecomment-419022484>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPJcYji4KcDEN2c9ruguA1X9MH5_Gks5uYOb5gaJpZM4WZYhq>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-09-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
BTW I think this is probably SPIP-worthy. At the very least we should write 
a design doc on this, similar to the other docs for dsv2 sub-components. We 
should really think about whether it'd be possible to unify the three modes 
(batch, microbatch streaming, CP).



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-09-04 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
Given the uncertainty about how this works across batch, streaming, and CP, 
and given we are still flushing out the main APIs, I think we should revert 
this, and revisit when the main APIs are done.

In general for API design, it is best to flush out the big skeletons first, 
and then work on filling the gaps. Think about building a house. You build the 
frame, and put the stud in, the walls, and then do the final finish. You don't 
start by putting plumbing fixtures in one room when you are still moving the 
main plumbing lines.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-08-31 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
I will take a look at this tomorrow, since I’m already looking at data
source apis myself. Can provide opinion after another look on whether we
should keep it unstable or revert.

On Fri, Aug 31, 2018 at 12:07 AM Hyukjin Kwon 
wrote:

> So .. @cloud-fan <https://github.com/cloud-fan>, and @rxin
> <https://github.com/rxin>, how about this:
>
>1. Mark this as Unstable for now - which means we likely happen to
>change this if we face a design issue.
>2. Write a design doc to cover up continuous and batch and find out
>better way (or check if the current way works).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/21721#issuecomment-417572882>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/AATvPIgkFv4_bAQyopGaxBUfF6Zt6Dvvks5uWODMgaJpZM4VExgx>
> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-08-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
I'm confused by this api. Is this for streaming only? If yes, why are they 
not in the stream package? If not, I only found streaming implementation. Maybe 
I missed it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21721: [SPARK-24748][SS] Support for reporting custom metrics v...

2018-08-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/21721
  
Stuff like this merits api discussions. Not just implementation changes ...



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22227: [SPARK-25202] [SQL] Implements split with limit sql func...

2018-08-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/7
  
Please remove the 0 semantics. IMO the zero vs negative number difference 
is too subtle. I only find Java String supporting that. Python doesn't. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

2018-08-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/7#discussion_r214135400
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -229,33 +229,58 @@ case class RLike(left: Expression, right: Expression) 
extends StringRegexExpress
 
 
 /**
- * Splits str around pat (pattern is a regular expression).
+ * Splits str around matches of the given regex.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match 
`regex`.",
+  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences 
that match `regex`" +
+" and returns an array of at most `limit`",
+  arguments = """
+Arguments:
+  * str - a string expression to split.
+  * regex - a string representing a regular expression. The regex 
string should be a
+Java regular expression.
+  * limit - an integer expression which controls the number of times 
the regex is applied.
+
+limit > 0: The resulting array's length will not be more than 
`limit`, and the resulting
+   array's last entry will contain all input beyond the 
last matched regex.
+
+limit < 0: `regex` will be applied as many times as possible, and 
the resulting
+   array can be of any size.
+
+limit = 0: `regex` will be applied as many times as possible, the 
resulting array can
--- End diff --

yea but i'd focus on what behavior we want to enable. do other database 
systems have this split=0 semantics? if not, i'd rewrite split=0 internally to 
just -1.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >