[GitHub] spark issue #22202: [SPARK-25211][Core] speculation and fetch failed result ...

2018-08-24 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/22202
  
@jinxing64 Do you have any idea?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22202: [SPARK-25211][Core] speculation and fetch failed ...

2018-08-24 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22202#discussion_r212601673
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2246,58 +2247,6 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assertDataStructuresEmpty()
   }
 
-  test("Trigger mapstage's job listener in submitMissingTasks") {
--- End diff --

Because that PR is conflict with this PR.
In that PR, shuffleMapStage waits the completion of parent stages's rerun.
In this PR, shuffleMapStage completes immediately when all partitions are 
ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22202: [SPARK-25211][Core] speculation and fetch failed result ...

2018-08-24 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/22202
  
@Ngone51 Because some shuffleMapStage has mapStageJobs(JobWaiter) by 
`SparkContext.submitMapStage`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22202: [SPARK-25211][Core] speculation and fetch failed ...

2018-08-23 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/22202

[SPARK-25211][Core] speculation and fetch failed result in hang of job

## What changes were proposed in this pull request?

In current `DAGScheduler.handleTaskCompletion` code, when a shuffleMapStage 
with job not in runningStages and its `pendingPartitions` is empty, the job of 
this shuffleMapStage will never complete.

*Think about below*

1. Stage 0 runs and generates shuffle output data.

2. Stage 1 reads the output from stage 0 and generates more shuffle data. 
It has two tasks with the same partition: ShuffleMapTask0 and ShuffleMapTask0.1.

3. ShuffleMapTask0 fails to fetch blocks and sends a FetchFailed to the 
driver. The driver resubmits stage 0 and stage 1. The driver will place stage 0 
in runningStages and place stage 1 in waitingStages.

4. ShuffleMapTask0.1 successfully finishes and sends Success back to 
driver. The driver will add the mapstatus to the set of output locations of 
stage 1. because of stage 1 not in runningStages, the job will not complete.

5. stage 0 completes and the driver will run stage 1. But, because the 
output sets of stage 1 is complete, the drive will not submit any tasks and 
make stage 1 complte right now. Because the job complete relay on the 
`CompletionEvent` and there will never a `CompletionEvent` come, the job will 
hang.

## How was this patch tested?

UT

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-25211

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22202.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22202


commit 4f51199daafec0466a5ac836c4f6281f5ba45381
Author: liulijia 
Date:   2018-08-23T13:42:13Z

[SPARK-25211][Core] speculation and fetch failed result in hang of job




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongToUnsafeRowMap in ex...

2018-07-24 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
Jenkins, test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-23 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
@viirya This case  occurred in our cluster and we took a lot of time to 
find this bug.
For some man-made reasons, the small table's max id has become abnormally 
large. The LongHasedRelation generated based on the table was not optimized to 
`dense` and has become abnormally big(approximately 400MB).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-23 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r204613880
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/HashedRelationSuite.scala
 ---
@@ -278,6 +278,39 @@ class HashedRelationSuite extends SparkFunSuite with 
SharedSQLContext {
 map.free()
   }
 
+  test("SPARK-24809: Serializing LongHashedRelation in executor may result 
in data error") {
--- End diff --

I think this UT can cover the case I had met.
End-to-end test is too hard to structure because this case just occurs when 
executor's memory is not enough to hold the block and the broadcast cache is 
removed by the garbage collector.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-22 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
@viirya Hi, Could you have more time to review this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-19 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
Jenkins test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-19 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
@viirya Yes, absolutely right. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-18 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r203365167
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 
 writeLong(array.length)
 writeLongArray(writeBuffer, array, array.length)
-val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
-writeLong(used)
+val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET
+writeLong(cursorFlag)
+val used = (cursorFlag / 8).toInt
--- End diff --


![image](https://issues.apache.org/jira/secure/attachment/12932027/Spark%20LongHashedRelation%20serialization.svg)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21772: [SPARK-24809] [SQL] Serializing LongHashedRelation in ex...

2018-07-17 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21772
  
@hvanhovell Thanks for reviewing. Losing data because the variable 
**cursor** in executor is 0 and serialization depends on it. I will add an UT 
later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-17 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21772#discussion_r203241485
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 ---
@@ -726,8 +726,9 @@ private[execution] final class LongToUnsafeRowMap(val 
mm: TaskMemoryManager, cap
 
 writeLong(array.length)
 writeLongArray(writeBuffer, array, array.length)
-val used = ((cursor - Platform.LONG_ARRAY_OFFSET) / 8).toInt
-writeLong(used)
+val cursorFlag = cursor - Platform.LONG_ARRAY_OFFSET
+writeLong(cursorFlag)
+val used = (cursorFlag / 8).toInt
--- End diff --

losing data when serializing LongHashedRelation in executor, can you see 
[this picture](http://oi67.tinypic.com/2z5pzs7.jpg)? In executor, the cursor is 
0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21772: [SPARK-24809] [SQL] Serializing LongHashedRelatio...

2018-07-15 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/21772

[SPARK-24809] [SQL] Serializing LongHashedRelation in executor may result 
in data error

When join key is long or int in broadcast join, Spark will use 
LongHashedRelation as the broadcast value. Details see SPARK-14419. But, if the 
broadcast value is abnormal big, executor will serialize it to disk. But, data 
will lost when serializing.
A flow chart [see](http://oi67.tinypic.com/2z5pzs7.jpg)

## What changes were proposed in this pull request?
Write cursor instead when serializing and setting cursor value when 
deserializing.

## How was this patch tested?
manual test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-24809

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21772.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21772


commit a72fe61863e119c0e902cef3054d9140b6d04f77
Author: liulijia 
Date:   2018-07-15T11:24:55Z

[SPARK-24809] [SQL] Serializing LongHashedRelation in executor may result 
in data error




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait ...

2018-05-07 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21164
  
@gatorsmile Could you please give some comments when you have time? Thanks 
so much.
In addition, I think this is a critical bug!!!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec shoul...

2018-05-02 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21164#discussion_r185693669
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
 ---
@@ -137,13 +137,12 @@ case class ScriptTransformationExec(
 throw writerThread.exception.get
   }
 
-  if (!proc.isAlive) {
-val exitCode = proc.exitValue()
-if (exitCode != 0) {
-  logError(stderrBuffer.toString) // log the stderr circular 
buffer
-  throw new SparkException(s"Subprocess exited with status 
$exitCode. " +
-s"Error: ${stderrBuffer.toString}", cause)
-}
+  proc.waitFor()
--- End diff --

Although writerThread._exception is a volatile variable, some times 
writerThread.exception function may be called before writerThread._exception's 
assignment due to readThread and writerThread are working parallel.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait ...

2018-05-01 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21164
  
@rxin hi, Do you have time to look at this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait ...

2018-04-27 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21164
  
@cloud-fan hi, fan, do you have time to see this pr? I think this is a 
critical bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21164: [SPARK-24098][SQL] ScriptTransformationExec should wait ...

2018-04-26 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/21164
  
Bug Reappearance:
1. Add Thread.sleep(1000 * 600) before assign for _exception.
2. structure a python script witch will throw exception like follow:
test.py
```import sys 
for line in sys.stdin:   
  raise Exception('error') 
  print line
```
3. use script created in step 2 in transform.
```spark-sql>add files /path_to/test.py;```
```spark-sql>select transform(id) using 'python test.py' as id from city;```
The result is that spark will end successfully.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21164: [SPARK-24098][SQL] ScriptTransformationExec shoul...

2018-04-26 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/21164

[SPARK-24098][SQL] ScriptTransformationExec should wait process exiting 
before output iterator finish

When feed thread doesn't set its _exception variable and the progress 
doesn't exit completely, output Iterator will return false in hasNext function.

## What changes were proposed in this pull request?
wait script process exiting before output iterator finish.

## How was this patch tested?
manual test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-24098

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21164


commit a77ac06180cbecc28383f3b73e05259502896613
Author: liutang123 <liutang123@...>
Date:   2018-04-26T08:54:44Z

[SPARK-24098][SQL] ScriptTransformationExec should wait process exiting 
before output iterator finish




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21100: [SPARK-24012][SQL] Union of map and other compati...

2018-04-23 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21100#discussion_r183608838
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -896,6 +896,25 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-24012 Union of map and other compatible columns") {
--- End diff --

@cloud-fan , Yes, I am not familiar with TypeCoercionSuite. In order to 
save time, in my opinion, this PR can be merged first. Thanks a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21100: [SPARK-24012][SQL] Union of map and other compati...

2018-04-22 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21100#discussion_r183269702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -171,6 +171,15 @@ object TypeCoercion {
   .orElse((t1, t2) match {
 case (ArrayType(et1, containsNull1), ArrayType(et2, 
containsNull2)) =>
   findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || 
containsNull2))
+case (MapType(keyType1, valueType1, n1), MapType(keyType2, 
valueType2, n2))
--- End diff --

Hi, I implements this logic in `findTightestCommonType`, looking forward to 
further review. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21100: [SPARK-24012][SQL] Union of map and other compati...

2018-04-20 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21100#discussion_r183005177
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -896,6 +896,19 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-24012 Union of map and other compatible columns") {
+checkAnswer(
+  sql(
+"""
+  |SELECT map(1, 2), 'str'
+  |UNION ALL
+  |SELECT map(1, 2, 3, NULL), 1""".stripMargin),
--- End diff --

Of course we can. 
two solution:
1. Try cast two map types to one no matter key types are not the same or 
value types are not the same.
`select map(1, 2) union all map(1, 'str')` will work.
2. Cast two map type to one only when the key type and value type are the 
same. This solution just resolve the problem that map<t1, nullable t2> and 
map<t1, not nullable t2> can't be union.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21100: [SPARK-24012][SQL] Union of map and other compati...

2018-04-20 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21100#discussion_r182979953
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -896,6 +896,19 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-24012 Union of map and other compatible columns") {
+checkAnswer(
+  sql(
+"""
+  |SELECT map(1, 2), 'str'
+  |UNION ALL
+  |SELECT map(1, 2, 3, NULL), 1""".stripMargin),
--- End diff --

map<int, nullable int> and map<int, not nullable int> are accepted by 
Union, but, string and int are not.

If types of one column can not be accepted by Union, 
TCWSOT(TypeCoercion.WidenSetOperationTypes) will try to coerce them to a 
completely identical type. TCWSOT works when all of the columns can be coerced 
and not work when columns can not be coerced exist.
map<int, nullable int> and map<int, not nullable int> can not be coerced, 
so, TCWSOT didn't work and string and int will not be coerced.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21100: [SPARK-24012][SQL] Union of map and other compati...

2018-04-18 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/21100

[SPARK-24012][SQL] Union of map and other compatible column

## What changes were proposed in this pull request?
Union of map and other compatible column result in unresolved operator 
'Union; exception

Reproduction
`spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1`
Output:
```
Error in query: unresolved operator 'Union;;
'Union
:- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
:  +- OneRowRelation$
+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 
AS INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
   +- OneRowRelation$
```
So, we should cast part of columns to be compatible when appropriate.

## How was this patch tested?
unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-24012

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21100.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21100


commit a422a7f1c1fb0f055fbb8736a364c5a641afc2a9
Author: liutang123 <liutang123@...>
Date:   2018-04-18T14:29:15Z

[SPARK-24012][SQL] Union of map and other compatible column




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20846: [SPARK-5498][SQL][FOLLOW] add schema to table partition

2018-03-17 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/20846
  
The exception is not thrown in `ALTER TABLE`.
We should prevent user to change table's column type. But, for historical 
data, should we do some compatible measures?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20846: [SPARK-5498][SQL][FOLLOW] add schema to table par...

2018-03-17 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20846#discussion_r175274916
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -99,7 +99,8 @@ case class CatalogTablePartition(
 spec: CatalogTypes.TablePartitionSpec,
 storage: CatalogStorageFormat,
 parameters: Map[String, String] = Map.empty,
-stats: Option[CatalogStatistics] = None) {
+stats: Option[CatalogStatistics] = None,
+schema: Option[StructType] = None) {
--- End diff --

Some times, partition's schema is different from the table's.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20846: [SPARK-5498][SQL][FOLLOW] add schema to table par...

2018-03-16 Thread liutang123
GitHub user liutang123 reopened a pull request:

https://github.com/apache/spark/pull/20846

[SPARK-5498][SQL][FOLLOW] add schema to table partition

## What changes were proposed in this pull request?

When query a orc table witch some partition schemas are different from 
table schema, ClassCastException will occured.
reproduction:
`create table test_par(a string)
PARTITIONED BY (`b` bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';`
`ALTER TABLE test_par CHANGE a a bigint restrict;  -- in hive
`select * from test_par;`

## How was this patch tested?

manual test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-5498

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20846


commit 2317bfdf18fc1a7b21cd43e0ec12f5e957fb1895
Author: liutang123 <liutang123@...>
Date:   2017-06-21T04:27:42Z

Merge pull request #1 from apache/master

20170521 pull request

commit 821b1f88e15bbe2bf7147f9cfa2e39ce7cb52b12
Author: liutang123 <liutang123@...>
Date:   2017-11-24T08:54:11Z

Merge branch 'master' of https://github.com/liutang123/spark

commit 1841f60861a96fb1508257c84e8703ca1ffb57de
Author: liutang123 <liutang123@...>
Date:   2017-11-24T08:54:59Z

Merge branch 'master' of https://github.com/apache/spark

commit 16f4a52aa556cdc5182570979bad9b4cf6f092d5
Author: liutang123 <liutang123@...>
Date:   2018-03-16T10:10:57Z

Merge branch 'master' of https://github.com/apache/spark

commit cdd5987178280f0424e9a828dd348df11e62758a
Author: liutang123 <liutang123@...>
Date:   2018-03-16T10:29:23Z

[SPARK-5498][SQL][FOLLOW] add schema to table partition.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20846: [SPARK-5498][SQL][FOLLOW] add schema to table par...

2018-03-16 Thread liutang123
Github user liutang123 closed the pull request at:

https://github.com/apache/spark/pull/20846


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20846: [SPARK-5498][SQL][FOLLOW] add schema to table par...

2018-03-16 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/20846

[SPARK-5498][SQL][FOLLOW] add schema to table partition

## What changes were proposed in this pull request?

When query a orc table witch some partition schemas are different from 
table schema, ClassCastException will occured.
reproduction:
`create table test_par(a string)
PARTITIONED BY (`b` bigint)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
ALTER TABLE test_par CHANGE a a bigint restrict;  -- in hive
select * from test_par;`

## How was this patch tested?

manual test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-5498

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20846


commit 2317bfdf18fc1a7b21cd43e0ec12f5e957fb1895
Author: liutang123 <liutang123@...>
Date:   2017-06-21T04:27:42Z

Merge pull request #1 from apache/master

20170521 pull request

commit 821b1f88e15bbe2bf7147f9cfa2e39ce7cb52b12
Author: liutang123 <liutang123@...>
Date:   2017-11-24T08:54:11Z

Merge branch 'master' of https://github.com/liutang123/spark

commit 1841f60861a96fb1508257c84e8703ca1ffb57de
Author: liutang123 <liutang123@...>
Date:   2017-11-24T08:54:59Z

Merge branch 'master' of https://github.com/apache/spark

commit 16f4a52aa556cdc5182570979bad9b4cf6f092d5
Author: liutang123 <liutang123@...>
Date:   2018-03-16T10:10:57Z

Merge branch 'master' of https://github.com/apache/spark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-17 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/20184
  
hi, @jerryshao , I try lazily allocate all the InputStream and byte arr in 
UnsafeSorterSpillReader. 
And would you please look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-15 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/20184
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-15 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/20184
  
I think that a lazy buffer allocation can not thoroughly solve this problem 
because UnsafeSorterSpillReader has BufferedFileInputStream witch will allocate 
off heap memory.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OOM when ...

2018-01-12 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/20184
  
Hi, @jerryshao , we can  produce this issue as follows:
```
$ bin/spark-shell --master local --conf 
spark.sql.windowExec.buffer.spill.threshold=1 --driver-memory 1G 
scala>sc.range(1, 2000).toDF.registerTempTable("test_table")
scala>spark.sql("select row_number() over (partition by 1)  from 
test_table").collect
```
This will cause OOM.
The above case is an extreme case.
Normally, the spark.sql.windowExec.buffer.spill.threshold is 4096 by 
default and the cache used in UnsafeSorterSpillReader is more than 1MB. When 
the rows in a window is more than 4096000, UnsafeExternalSorter.ChainedIterator 
will has a queue witch contains 1000 UnsafeSorterSpillReader(s). So, the queue 
costs a lot of memory and is liable to cause OOM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20184: [SPARK-22987][Core] UnsafeExternalSorter cases OO...

2018-01-07 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/20184

[SPARK-22987][Core] UnsafeExternalSorter cases OOM when invoking 
`getIterator` function.

## What changes were proposed in this pull request?

ChainedIterator.UnsafeExternalSorter remains a Queue of 
UnsafeSorterIterator. When call `getIterator` function of UnsafeExternalSorter, 
UnsafeExternalSorter passes an ArrayList of UnsafeSorterSpillReader to the 
constructor of UnsafeExternalSorter. But, UnsafeSorterSpillReader maintains a 
byte array as buffer, witch capacity is more than 1 MB. When spilling 
frequently, this case maybe causes OOM.

I try to change the Queue in ChainedIterator to a Iterator. 

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-22987

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20184


commit d57ce865729ce4d0d84a0fee0edf4dd6febe54bc
Author: liutang123 <liutang123@...>
Date:   2018-01-08T04:33:51Z

[SPARK-22987][Core] UnsafeExternalSorter cases OOM when invoking 
`getIterator` function.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19364: [SPARK-22144][SQL] ExchangeCoordinator combine the parti...

2017-12-06 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19364
  
@cloud-fan Would you please look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19812: [SPARK-22598][CORE] ExecutorAllocationManager doe...

2017-11-29 Thread liutang123
Github user liutang123 closed the pull request at:

https://github.com/apache/spark/pull/19812


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-29 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19812
  
Sorry, I can not reproduce it now. But, sometimes, 
`ExecutorAllocationManager ` did not request new executors and 
`YarnSchedulerBackend.requestedTotalExecutors` is 0. I will close this PR now 
and reopen it next time when I encounter this situation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-28 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19812
  
Hi @jerryshao , I modified the info of this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-26 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19812
  
@srowen Would you please look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19812: [SPARK-22598][CORE] ExecutorAllocationManager does not r...

2017-11-24 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19812
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19812: [SPARK-22598][CORE] ExecutorAllocationManager doe...

2017-11-24 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/19812

[SPARK-22598][CORE] ExecutorAllocationManager does not requests new 
executors when executor has failed and target has not changed

## What changes were proposed in this pull request?

Check if the number of current executors when target has not changed.

## How was this patch tested?

manual test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-22598

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19812.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19812


commit 6db1b8687060de9a67c0225d42e32afcfb60faf8
Author: liutang123 <liutang...@yeah.net>
Date:   2017-11-24T08:40:40Z

[SPARK-22598][CORE]ExecutorAllocationManager does not requests new 
executors when executor has failed and target has not changed.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19692: [SPARK-22469][SQL] Accuracy problem in comparison with s...

2017-11-16 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19692
  
Sorry, I just saw it.
Thank fan for doing this.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19692: [SPARK-22469][SQL] Accuracy problem in comparison with s...

2017-11-14 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19692
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19692: [SPARK-22469][SQL] Accuracy problem in comparison...

2017-11-13 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19692#discussion_r150726583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -137,6 +137,8 @@ object TypeCoercion {
 case (DateType, TimestampType) => Some(StringType)
 case (StringType, NullType) => Some(StringType)
 case (NullType, StringType) => Some(StringType)
+case (n: NumericType, s: StringType) => Some(DoubleType)
+case (s: StringType, n: NumericType) => Some(DoubleType)
--- End diff --

Because `select '1.1' >1` returns false, I prefer casting all NumericType 
to double like hive.
Therefore, casting decimal to double looks better for me.
But, in our cluster, many users write SQL like `select '1.1' > 1`, this 
compatibility brings great difficulties to transferring hive task to spark 
task. So, don't we really need to think about casting all NumericType to double?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19692: [SPARK-22469][SQL] Accuracy problem in comparison...

2017-11-08 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19692#discussion_r149659840
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -137,6 +137,8 @@ object TypeCoercion {
 case (DateType, TimestampType) => Some(StringType)
 case (StringType, NullType) => Some(StringType)
 case (NullType, StringType) => Some(StringType)
+case (n: NumericType, s: StringType) => Some(DoubleType)
+case (s: StringType, n: NumericType) => Some(DoubleType)
--- End diff --

This solved DecimalType but  may also have known error.
For example:
`select '1.01' > 1.0;` // false 

Is it possible to use DecimalType.SYSTEM_DEFAULT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19692: [SPARK-22469][SQL] Accuracy problem in comparison with s...

2017-11-08 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19692
  
@cloud-fan Would you please look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19692: [SPARK-22469][SQL] Accuracy problem in comparison...

2017-11-08 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/19692

[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric

## What changes were proposed in this pull request?

When compare string and numeric, cast them as double like Hive.

## How was this patch tested?



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-22469

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19692.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19692


commit aada8eec812348062d7ef6413342f7676e2b5fab
Author: liutang123 <liutang...@yeah.net>
Date:   2017-11-08T09:30:14Z

[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19504: [SPARK-22233] [CORE] [FOLLOW-UP] Allow user to fi...

2017-10-16 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19504#discussion_r144823226
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -549,9 +551,11 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
   expectedPartitionNum = 2)
   }
 
-  test("spark.files.ignoreEmptySplits work correctly (new Hadoop API)") {
+  test("spark.hadoopRDD.ignoreEmptySplits work correctly (new Hadoop 
API)") {
 val conf = new SparkConf()
-conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+  .setAppName("test")
+  .setMaster("local")
+  .set(HADOOP_RDD_IGNORE_EMPTY_SPLITS, true)
 sc = new SparkContext(conf)
 
 def testIgnoreEmptySplits(
--- End diff --

```
testIgnoreEmptySplits(
   data = Array.empty[Tuple2[String, String]],
   actualPartitionNum = 1,
   expectedPartitionNum = 0)
```
=>
```
testIgnoreEmptySplits(
   data = Array.empty[(String, String)],
   actualPartitionNum = 1,
   expectedPartitionNum = 0)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19504: [SPARK-22233] [CORE] [FOLLOW-UP] Allow user to filter ou...

2017-10-16 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19504
  
It looks better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-13 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144683771
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,87 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.files.ignoreEmptySplits work correctly (old Hadoop API)") {
+val conf = new SparkConf()
+conf.setAppName("test").setMaster("local").set(IGNORE_EMPTY_SPLITS, 
true)
+sc = new SparkContext(conf)
+
+def testIgnoreEmptySplits(
+  data: Array[Tuple2[String, String]],
+  actualPartitionNum: Int,
+  expectedPart: String,
+  expectedPartitionNum: Int): Unit = {
+  val output = new File(tempDir, "output")
+  sc.parallelize(data, actualPartitionNum)
+.saveAsHadoopFile[TextOutputFormat[String, String]](output.getPath)
+  assert(new File(output, expectedPart).exists() === true)
+  val hadoopRDD = sc.textFile(new File(output, "part-*").getPath)
+  assert(hadoopRDD.partitions.length === expectedPartitionNum)
+  Utils.deleteRecursively(output)
--- End diff --

I think we don't need `try... finally` here. Because 
`Utils.deleteRecursively(output)` just to ensure
the success of next invocation of the `testIgnoreEmptySplits`. When test 
finished, wether be passed or not, the `tempDir` will be deleted in 
`FileSuite.afterEach()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-12 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144235787
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.hadoop.filterOutEmptySplit") {
+val sf = new SparkConf()
+
sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit",
 "true")
+sc = new SparkContext(sf)
+val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
+emptyRDD.saveAsHadoopFile[TextOutputFormat[String, 
String]](tempDir.getPath + "/output")
+assert(new File(tempDir.getPath + "/output/part-0").exists() === 
true)
+
+val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-0")
+assert(hadoopRDD.partitions.length === 0)
--- End diff --

The resources will be recycled by default in the afterEach function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-12 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144235417
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -122,7 +122,10 @@ class NewHadoopRDD[K, V](
   case _ =>
 }
 val jobContext = new JobContextImpl(_conf, jobId)
-val rawSplits = inputFormat.getSplits(jobContext).toArray
+var rawSplits = 
inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit])
+if 
(sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
+  rawSplits = rawSplits.filter(_.getLength>0)
--- End diff --

Is there any one use empty file to do something ?
for example:
sc.textFile("/somepath/*").mapPartitions()
setting this flag to true by default may change the behavior of user's 
application.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

2017-10-11 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19464
  
@kiszk Any other suggestions an can ti PR be merged?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: Spark 22233

2017-10-10 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/19464

Spark 22233

## What changes were proposed in this pull request?
add spark.hadoop.filterOutEmptySplit confituration to allow user to filter 
out empty split in HadoopRDD.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-22233

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19464


commit 2317bfdf18fc1a7b21cd43e0ec12f5e957fb1895
Author: liutang123 <liutang...@yeah.net>
Date:   2017-06-21T04:27:42Z

Merge pull request #1 from apache/master

20170521 pull request

commit e3f993959fabdb80b966a42bf40d1cb5c6b44d95
Author: liulijia <liuli...@meituan.com>
Date:   2017-09-28T06:12:04Z

Merge branch 'master' of https://github.com/apache/spark

commit 8f57d43b6bf127fc67e3e391d851efae3a859206
Author: liulijia <liuli...@meituan.com>
Date:   2017-10-10T02:16:18Z

Merge branch 'master' of https://github.com/apache/spark

commit 3610f78837f4a5623f6d47b9feab1e565ed6
Author: liulijia <liuli...@meituan.com>
Date:   2017-10-10T10:19:29Z

allow user to filter empty split in HadoopRDD




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19364: [SPARK-22144][SQL] ExchangeCoordinator combine the parti...

2017-10-08 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19364
  
@maropu Any other suggestions and can this PR be merged?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19364: [SPARK-22144][SQL] ExchangeCoordinator combine th...

2017-09-28 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19364#discussion_r141785140
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala
 ---
@@ -232,7 +232,7 @@ class ExchangeCoordinator(
   // number of post-shuffle partitions.
   val partitionStartIndices =
 if (mapOutputStatistics.length == 0) {
-  None
+  Some(Array[Int]())
--- End diff --

partitionStartIndices  should be an Option[Array[Int]] as the second 
parameter of ShuffleExchange.preparePostShuffleRDD. If use Array.empty[Int], a 
syntax error appears.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19364: [SPARK-22144][SQL] ExchangeCoordinator combine th...

2017-09-28 Thread liutang123
Github user liutang123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19364#discussion_r141780697
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala
 ---
@@ -232,7 +232,7 @@ class ExchangeCoordinator(
   // number of post-shuffle partitions.
   val partitionStartIndices =
 if (mapOutputStatistics.length == 0) {
-  None
+  Some(Array[Int]())
--- End diff --

1. If we don't use Some(Array[Int]()) here, the number of post-shuffle 
partitions will be spark.sql.shuffle.partitions.
2. mapOutputStatistics.length == 0 indicates estimatePartitionStartIndices 
will not be invoked, so minNumPostShufflePartitions will be ignored anyway.
In my opition, all the number of pre-shuffle partitions is 0, the number of 
post-shuffle partitions should be 0 rather than 
spark.sql.adaptive.minNumPostShufflePartitions or spark.sql.shuffle.partitions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19364: [SPARK-22144] ExchangeCoordinator combine the partitions...

2017-09-28 Thread liutang123
Github user liutang123 commented on the issue:

https://github.com/apache/spark/pull/19364
  
@yhuai @hvanhovell  Would you please look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19364: [SPARK-22144] ExchangeCoordinator combine the par...

2017-09-27 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/19364

[SPARK-22144] ExchangeCoordinator combine the partitions of an 0 sized 
pre-shuffle to 0

## What changes were proposed in this pull request?
when the length of pre-shuffle's partitions is 0, the length of 
post-shuffle's partitions should be 0 instead of spark.sql.shuffle.partitions.

## How was this patch tested?
ExchangeCoordinator converted a  pre-shuffle that partitions is 0 to a 
post-shuffle that partitions is 0 instead of one that partitions is 
spark.sql.shuffle.partitions.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark SPARK-22144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19364


commit 589d6f01d95ce8d37083fb48294fce31128ac9f2
Author: liulijia <liuli...@meituan.com>
Date:   2017-09-27T09:36:28Z

[SPARK-22144] ExchangeCoordinator combine the partitions of an 0 sized 
pre-shuffle to 0

commit 843df8603db242bc56dc4acfe3c7f64c0f5722be
Author: liulijia <liuli...@meituan.com>
Date:   2017-09-27T09:44:16Z

Merge remote-tracking branch 'upstream/master' into SPARK-22144




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18871: Merge pull request #1 from apache/master

2017-08-07 Thread liutang123
Github user liutang123 closed the pull request at:

https://github.com/apache/spark/pull/18871


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18871: Merge pull request #1 from apache/master

2017-08-07 Thread liutang123
GitHub user liutang123 opened a pull request:

https://github.com/apache/spark/pull/18871

Merge pull request #1 from apache/master

20170521 pull request

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liutang123/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18871


commit 2317bfdf18fc1a7b21cd43e0ec12f5e957fb1895
Author: liutang123 <liutang...@yeah.net>
Date:   2017-06-21T04:27:42Z

Merge pull request #1 from apache/master

20170521 pull request




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org