[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@wzhfy @scwf Thanks for comment.

Until we have a way to figure out how to avoid the defect, I will close 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@wzhfy Thanks for comment.

I meant that we can just deal with the cases we are confident that the all 
rows in all partitions must be much larger than the limit number. I am not sure 
if cbo can tell which cases the estimation is more accurate. Can we obtain such 
parameter from cbo framework?

If we can't be sure that we won't hit the case of the all rows are nearly 
the same of the limit number, then the change proposed here can't be used if we 
don't want to bring performance regression for the cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16633
  
Hi @viirya , the main concern of @scwf is that, we can't afford performance 
regression in any customer scenarios. I think you can understand that :)

I went through the discussion above, it seems we've had some solution for 
both cases you mentioned 
[here](https://github.com/apache/spark/pull/16633#issuecomment-273963150), then 
talking points becomes the following two:
1. how to decide the threshold of the two cases;
2. rdd chain is broken.

Let's wait @rxin 's comment on the second point. 

Here I'm just interested in the first one.
One possible way to get the number is to modify the mapoutput statistics 
suggested by @scwf .
For cbo, if the computing logic before limit is complex, it's hard to get 
an accurate estimation. E.g. joins from filtered tables, where join keys and 
filter keys are probably different (that'll need column correlation info).
As you mentioned we can get an estimated number and confidence, can you 
describe how?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
@viirya i suggest fix the 2 in this pr, let's wait some comment on 1.  /cc 
@rxin and @wzhfy who may comment on the first case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
we don't need accurate number. we can have a confident margin.

the bad with broken rdd chain is re-processing the rows. anything else?

I don't think it is worth changing core and scheduler for this purpose. too
risk and might introduce new bugs.

we still can avoid shuffling for 2. we don't need to shuffle those
partitions.


On Jan 20, 2017 11:08 AM, "Fei Wang"  wrote:

For 1, my idea is not use the proposal in this PR,

   1. how you determine total rows in all partitions are (much) more than
   limit number. and then go into this code path and how to decide the much
   more than, i can not use cbo estimate stats here because the locallimit
   plan maybe complex and we can not ensure the accuracy of the estimate row
   number.
   2 as @rxin  suggest, this break the rdd chain

So for 1, i think it need some improvement of spark core and scheduler as i
mentioned above

For 2 it is ok to me, the solution is the same with i described above(still
shuffle +shuffle to multi partition + modified mapoutput statistics), right?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
, or mute
the thread


.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
For 1,  my idea is not use the proposal in this PR, 
1. how you determine  `total rows in all partitions are (much) more than 
limit number.` and then go into this code path and how to decide the `much more 
than`,  i can not use cbo estimate stats here because the locallimit plan maybe 
complex and we can not ensure the accuracy of the estimate row number.  
2 as @rxin suggest, this break the rdd chain

So for 1, i think it need some improvement of spark core and scheduler as i 
mentioned above

For 2 it is ok to me, the solution is the same with i described above(still 
shuffle +shuffle to multi partition + modified mapoutput statistics), right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
Ok. I think it is clearer now. We have two cases needed to solve:

1. After local limit, total rows in all partitions are (much) more than 
limit number.
2. After local limit, total rows in all partitions are nearly the limit 
number.

For 1. The current change in this PR is effective. We can save shuffling 
and most of local limit processing.

For 2. The current change will re-process all the rows. So it is not 
efficient. Fallback to old global limit will degrade parallelism, so if the 
limit number is big, the performance will be bad. One solution is that we can 
get the exact number of rows in each partitions after local limit by a modified 
mapoutput statistics. And we can take only the partitions with required number 
of rows.

@scwf What do you think?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
all partitions after local limit are about/nearly 100,000,000 rows


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
Do you mean totally rows in all partitions after local limit are 
about/nearly 100,000,000 rows? Or each partition after local limit has 
about/nearly 100,000,000 rows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Again, to clean, I am against the performance regression in flowing case
0.  limit num is 100,000,000
1.  the original table rows is very big, much larger than 100,000,000 rows
2.  after local limit stage, the output row num is about/nearly 100,000,000 
rows




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
That is why I propose to avoid shuffling to single partition. We can save 
shuffling and keep parallelism. So I don't know what you are against?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
I think shuffle is ok, but shuffle to one partition leads to the 
performance issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf So sounds like it is the problem of shuffling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Assume local limit output 100,000,000 rows,  then in global limit it will 
be take in a single partition, so it is very slow and can not use other free 
cores to improve the parallelism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf I am not sure if you really think about this. Can you describe the 
single partition issue based on your understanding?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
@viirya my team member post the mail list, actually we mean the case i 
listed above,  the main issue is the single partition issue in global limit,  
if in that case you fall back to old global limit it is still unresolved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
That case only happens when the all row counts in all partitions are less 
than or (nearly) equal to the limit number. So it needs to scan (almost) all 
partitions.

One possible way to deal with this case, is to use row count statistics to 
decide whether we do this global limit without shuffle, or old global limit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
I think the local limit cost is important, we assume recompute partions 
number: m, all the partitions: n
m = 1, n =100 is a positive case, but there also cases that m very close to 
n(even m = n).

Our customers has this scenario, so i am so care about this issue. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Your proposal avoid the cost of all partitions compute and shuffle for 
local limit but introduce some partitions recompute for local limit stage.

We can not decide which cost is cheaper(in most cases), note computation 
logical for local limit stage maybe very complex and costly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf You still don't get the point. Although few partitions need to 
recompute in local limit, most of other partitions are saved from computation. 
In most cases, it is worth. You can refer to my previous comment that shows a 
simple calculation.

It doesn't matter the local limit is simple or complex. In simple words, 
one is to recompute 1 partition, another is to compute all 100 partitions, 
which one is better?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf it is fair. anyway, i don't think a proposal can't improve any point 
of the issues is worth so many requested changing...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
I think before compare our proposals , we should first make sure our 
proposal will not bring performance regression. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf I understand your point. But the main issue is, you can't save the 
local limit cost and the shuffling cost. You still need to process all rows in 
all partitions and shuffle (some of) them to single partition.

Simply said, my previous comment is showing you, in most of cases, your 
proposal won't perform better than this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Not get you, but let me explain more,
If we use map output statistics to decide each global limit should take how 
many element.
1.  local limit shuffle with the maillist partitioner and return the map 
output statistics
2.  global limit each partition take or drop some rows(just like what you 
do in this pr) based on the statistics

Then,
1. the shuffle cost is almost the same as now
2. global limit without single partition issue when a big limit number


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf Even it works finally, I don't think it is better in performance.

Simply calculate it. Assume the limit number is `n`, partition number is 
`N`, and each partition has `n / r` rows in average.

For this change, in a worse case, let suppose the first scan partitioned 
returns 0 rows, then we quadruple the partitions to scan and each partition 
returns `n / r` rows. So we totally scan `4 * n / r + n = n * (4 + r) / r` rows 
in the end.

Suppose you can know how many elements in each partition to retrieve back 
to single partition for global limit operation. You need to produce all rows in 
all partitions `N * n / r` + shuffling `n` rows to single partition.

If we don't consider shuffling cost. So compare `n * (4 + r) / r` and `N * 
n / r`, your solution scans less rows only if `N < 4 + r`. For example, if each 
partition has `n / 2` rows, `N` must less than `6`.  So your solution will only 
perform better when the partition number is small relatively.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
need define a new map output statistics  to do this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf I don't think it would work. map output statistics is just 
approximate number of output bytes. You can't use it to get correct row number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
Yes, you are right, we can not ensure the uniform distribution for global 
limit.
An idea is not use a special partitioner, after the shuffle we should get 
the mapoutput statistics for row num of each bucket, and decide each global 
limit should take how many element.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71633/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
 for PR 16633 at commit 
[`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf No. A simple example: if there are 5 local limit which produce 1, 2, 
1, 1, 1 rows when limit is 10. If you shuffle to 5 partitions, the 
distributions for each local limit look like:

1: (1, 0, 0, 0, 0)
2: (1, 1, 0, 0, 0)
3: (1, 0, 0, 0, 0)
4: (1, 0, 0, 0, 0)
5: (1, 0, 0, 0, 0)

So the final rows in 5 partitions are (5, 1, 0, 0, 0) which is not 
uniformly distributed.

You don't know how many rows each local limit can get. So how do you know 
how many partitions and how many rows to retrieve for each partitions?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
refer to the maillist
>One issue left is how to decide shuffle partition number. 
We can have a config of the maximum number of elements for each GlobalLimit 
task to process, 
then do a factorization to get a number most close to that config. 
E.g. the config is 2000: 
if limit=1,  1 = 2000 * 5, we shuffle to 5 partitions 
if limit=,  =  * 9, we shuffle to 9 partitions 
if limit is a prime number, we just fall back to single partition 

You mean for the prime number case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71633/testReport)**
 for PR 16633 at commit 
[`3cbd6ee`](https://github.com/apache/spark/commit/3cbd6ee19a994d368a4130da47a2554bd0019679).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf 
> it use a special partitioner to do this, the partitioner like the 
row_numer in sql it give each row a uniform partitionid, so in the reduce task, 
each task handle num of rows very closely.

I see @wzhfy wants to use a partitioner to uniformly distribute the rows in 
each local limit. However, because each local limit can produce different 
number of rows, you can't get a real uniform distribution. So in the global 
limit operation, you can't know how many partitions you need to use in order to 
satisfy the final limit number.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
To clear, now we have these issues:
1.  local limit compute all partitions, that means it launch many tasks  
but actually maybe very small tasks is enough.
2.  global limit single partition issue, now the global limit will shuffle 
all the data to one partition, so if the limit num is very big, it cause 
performance bottleneck 

It is perfect if we combine the global limit and local limit into one 
stage, and avoid the shuffle, but for now i can not find a very good 
solution(no performance regression) to do this without change spark 
core/scheduler, your solution is trying to do that, but as i suggest, there are 
some cases the performance maybe worse.

@wzhfy 's idea is just resolve the single partition issue, still shuffle, 
still local limit on all the partitions, but it not bring performance down in 
that cases compare with current code path.

> Another issue is, how do you make sure you create a uniform distribution 
of the result of local limit. Each local limit can produce different number of 
rows.

it use a special partitioner to do this, the partitioner like the 
`row_numer`  in sql it give each row a uniform partitionid, so in the reduce 
task, each task handle num of rows very closely.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71627/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)**
 for PR 16633 at commit 
[`6ba8b28`](https://github.com/apache/spark/commit/6ba8b284ec8f43a76c9ba54349438e484a097223).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode with CodegenSupport `
  * `case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@scwf The main issue the user posted in the mailing list is, the limit is 
big enough or partition number is big enough to cause performance bottleneck in 
shuffling the data of local limit. But @wzhfy's idea is also involving 
shuffling.

Another issue is, how do you make sure you create a uniform distribution of 
the result of local limit. Each local limit can produce different number of 
rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread scwf
Github user scwf commented on the issue:

https://github.com/apache/spark/pull/16633
  
@viirya @rxin i support the idea of @wzhfy in the maillist 
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-td20570.html,
 it solved the single partition issue in the global limit without break the job 
chain. 

For local limit it still compute the all partitions, i think we can 
consider resolve the local limit issue with some changes in core scheduler in 
future,  we may provide a mechanism: do not compute all the tasks in a stage if 
some condition is satisfied for the stage.

what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@rxin even it breaks the RDD job chain. I think it is still useful in some 
cases, for example, the number of partitions is big and you only need to get 
one or few partitions to satisfy the limit.

Broken RDD job chain means we do an extra scan of the few partitions. You 
can save the time to scan all other partitions and shuffling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@rxin ok. I see what you mean breaking the RDD job chain.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71627 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71627/testReport)**
 for PR 16633 at commit 
[`6ba8b28`](https://github.com/apache/spark/commit/6ba8b284ec8f43a76c9ba54349438e484a097223).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16633
  
@rxin Can you explain it more? I don't get it. Why it breaks the RDD job 
chain?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16633
  
This breaks the RDD job chain doesn't it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16633
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71605/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71605 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)**
 for PR 16633 at commit 
[`b26488f`](https://github.com/apache/spark/commit/b26488f77acf442db768b41f94bbda9773b523a2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LocalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode with CodegenSupport `
  * `case class GlobalLimitExec(limit: Int, child: SparkPlan) extends 
UnaryExecNode `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

2017-01-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16633
  
**[Test build #71605 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71605/testReport)**
 for PR 16633 at commit 
[`b26488f`](https://github.com/apache/spark/commit/b26488f77acf442db768b41f94bbda9773b523a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org