[GitHub] spark pull request #19067: [SPARK-21849][Core]Make the serializer function m...

2017-08-30 Thread djvulee
Github user djvulee closed the pull request at:

https://github.com/apache/spark/pull/19067


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19067: [SPARK-21849][Core]Make the serializer function more rob...

2017-08-28 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/19067
  
Yes, I agree. it is better included in a normal pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19067: [SPARK-21849][Core]Make the serializer function m...

2017-08-28 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/19067

[SPARK-21849][Core]Make the serializer function more robust

## What changes were proposed in this pull request?

make sure the `close` function is called in the `finally` block.

## How was this patch tested?

No Test, just compile.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark serializer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19067


commit b523ecbef727df73b3b018eb851fb66981e98770
Author: DjvuLee <l...@bytedance.com>
Date:   2017-08-28T07:00:56Z

[SPARK-21849][Core]Make the serializer function more robust




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18651: [SPARK-21383][Core] Fix the YarnAllocator allocates more...

2017-07-25 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/18651
  
I update the code, please take a look at @vanzin @tgravescs 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-23 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128943316
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
   amClient.releaseAssignedContainer(containerId)
+  } finally {
+numExecutorsStarting.decrementAndGet()
--- End diff --

I agree that put `numExecutorsStarting.decrementAndGet()` together with 
`numExecutorsRunning.incrementAndGet()` in the `updateInternalState` is better 
if we can.

Why I try to put `numExecutorsStarting.decrementAndGet()` in the `finally` 
block is that if there some Exceptions is not `NonFatal`, and caught by the 
following code, we may can not allocated resources as we  specified, this is 
the same as @vanzin worried.

We may double the count in the current code, but this only slow down the 
allocation rate for a small time.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-20 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128435807
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -525,8 +535,9 @@ private[yarn] class YarnAllocator(
   } catch {
 case NonFatal(e) =>
   logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
+  // Assigned container should be released immediately
+  // to avoid unnecessary resource occupation.
+  numExecutorsStarting.decrementAndGet()
--- End diff --

Yes, it is more robust. I have update the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-20 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128432387
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -242,7 +244,7 @@ private[yarn] class YarnAllocator(
 if (executorIdToContainer.contains(executorId)) {
   val container = executorIdToContainer.get(executorId).get
   internalReleaseContainer(container)
-  numExecutorsRunning -= 1
--- End diff --

Yes, I just try to keep consistency with `numExecutorsStarting`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-19 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128290766
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -294,7 +296,8 @@ private[yarn] class YarnAllocator(
   def updateResourceRequests(): Unit = {
 val pendingAllocate = getPendingAllocate
 val numPendingAllocate = pendingAllocate.size
-val missing = targetNumExecutors - numPendingAllocate - 
numExecutorsRunning
+val missing = targetNumExecutors - numPendingAllocate -
+  numExecutorsStarting.get - numExecutorsRunning.get
 
--- End diff --

Thanks for your advice! I just add the debug info.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18651: [SPARK-21383][Core] Fix the YarnAllocator allocates more...

2017-07-19 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/18651
  
I just update the code, and test by experiment, can you take a look at 
@vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128144194
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -505,32 +508,37 @@ private[yarn] class YarnAllocator(
 
   if (numExecutorsRunning < targetNumExecutors) {
 if (launchContainers) {
-  launcherPool.execute(new Runnable {
-override def run(): Unit = {
-  try {
-new ExecutorRunnable(
-  Some(container),
-  conf,
-  sparkConf,
-  driverUrl,
-  executorId,
-  executorHostname,
-  executorMemory,
-  executorCores,
-  appAttemptId.getApplicationId.toString,
-  securityMgr,
-  localResources
-).run()
-updateInternalState()
-  } catch {
-case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
-  // Assigned container should be released immediately to 
avoid unnecessary resource
-  // occupation.
-  amClient.releaseAssignedContainer(containerId)
+  try {
+numExecutorToBeLaunched += 1
+launcherPool.execute(new Runnable {
+  override def run(): Unit = {
+try {
+  new ExecutorRunnable(
+Some(container),
+conf,
+sparkConf,
+driverUrl,
+executorId,
+executorHostname,
+executorMemory,
+executorCores,
+appAttemptId.getApplicationId.toString,
+securityMgr,
+localResources
+  ).run()
+  updateInternalState()
+} catch {
+  case NonFatal(e) =>
+logError(s"Failed to launch executor $executorId on 
container $containerId", e)
+// Assigned container should be released immediately
+// to avoid unnecessary resource occupation.
+amClient.releaseAssignedContainer(containerId)
+}
   }
-}
-  })
+})
+  } finally {
+numExecutorToBeLaunched -= 1
--- End diff --

Yes, you're right. When I test the code by experiment, i decrease the 
`numExecutorToBeLaunched` in the `updateInternalState` function, but I later 
found this may impact the test.

I will fix this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-18 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/18651#discussion_r128143898
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ---
@@ -82,6 +82,8 @@ private[yarn] class YarnAllocator(
 
   @volatile private var numExecutorsRunning = 0
 
+  @volatile private var numExecutorToBeLaunched = 0
--- End diff --

OK,I will change the name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18651: [SPARK-21383][Core] Fix the YarnAllocator allocat...

2017-07-17 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/18651

[SPARK-21383][Core] Fix the YarnAllocator allocates more Resource

## What changes were proposed in this pull request?
When NodeManagers launching Executors,
the `missing` value will excel the
real value when the launch is slow, this can lead to YARN allocates more 
resource.


We add the `numExecutorToBeLaunched` when calculate the `missing` to avoid 
this.


## How was this patch tested?
Test by experiment.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark YarnAllocate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18651


commit 818c9126959e8576861478e18389e6ed8fdbeac4
Author: DjvuLee <l...@bytedance.com>
Date:   2017-07-17T07:54:09Z

[Core] Fix the YarnAllocator allocate more Resource

When NodeManagers launched the Executors, the missing will excel the
real value, this can lead to YARN allocate more resource.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18280: [SPARK-21064][Core][Test] Fix the default value b...

2017-06-12 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/18280

[SPARK-21064][Core][Test] Fix the default value bug in 
NettyBlockTransferServiceSuite

## What changes were proposed in this pull request?

The default value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.
So we change it to 100 to avoid test failure.

## How was this patch tested?
No test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark NettyTestBug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18280


commit 273f76d183eeda9aef7c9c10dbcd9307773c3eec
Author: DjvuLee <l...@bytedance.com>
Date:   2017-06-12T12:13:03Z

[SPARK-21064][Core][Test] Fix the default value bug in 
NettyBlockTransferServiceSuite

The defalut value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18279: [SPARK-21064][Core][Test] Fix the default value b...

2017-06-12 Thread djvulee
Github user djvulee closed the pull request at:

https://github.com/apache/spark/pull/18279


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18279: [SPARK-21064][Core][Test] Fix the default value bug in N...

2017-06-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/18279
  
Ok, thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18279: [SPARK-21064][Core][Test] Fix the default value bug in N...

2017-06-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/18279
  
we should port this to master too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18279: [SPARK-21064][Core][Test] Fix the default value b...

2017-06-12 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/18279

[SPARK-21064][Core][Test] Fix the default value bug in NettyBlockTran…

## What changes were proposed in this pull request?

Fix the default value bug in NettyBlockTransferServiceSuite.
The defalut value for `spark.port.maxRetries` is 100, but we use the 10
in the suite file, we change 10 to 100.

## How was this patch tested?
No Test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18279.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18279


commit 5de1790783f07737432f75ef7ed7ea8804fc6b20
Author: DjvuLee <l...@bytedance.com>
Date:   2017-06-12T11:50:02Z

[SPARK-21064][Core][Test] Fix the default value bug in 
NettyBlockTransferServiceSuite

The defalut value for `spark.port.maxRetries` is 100, but we use the 10
in the suite file.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15505: [SPARK-18890][CORE] Move task serialization from the Tas...

2017-05-16 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15505
  
>I agree with Kay that putting in a smaller change first is better, 
assuming it still has the performance gains. That doesn't preclude any further 
optimizations that are bigger changes.

>I'm a little surprised that the serializing tasks has much of an impact, 
given how little data is getting serialized. But if it really is, I feel like 
there is a much bigger optimization we're completely missing. Why are we 
repeating the work of serialization for each task in a taskset? The serialized 
data is almost exactly the same for every task. they only differ in the 
partition id (an int) and the preferred locations (which aren't even used by 
the executor at all).

>Task serialization already leverages the idea of having info across all 
the tasks in the Broadcast for the task binary. We just need to use that same 
idea for all the rest of the task data that is sent to the executor. Then the 
only difference between the serialized task data sent to executors is the int 
for the partitionId. You'd serialize into a bytebuffer once, and then your 
per-task "serialization" becomes copying the buffer and modifying that int 
directly.




@squito  I like this idea very much. I just encounte the de-serialization 
time is too long (about more than 10s for some tasks). Is there any PR try to 
solve this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16671: [SPARK-19327][SparkSQL] a better balance partitio...

2017-02-01 Thread djvulee
Github user djvulee closed the pull request at:

https://github.com/apache/spark/pull/16671


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
@HyukjinKwon One assumption behind this design is that the specified column 
has index in most real scenario, so the table scan cost is not much high. 

What I observed is that most large table has sharding, so count cost is 
acceptable, this is the reason 
why we cost less time in a 5M rows table than in a 1M rows table. If we use 
the `repartition`,  there is a bottleneck  when loading data from DB and high 
cost for `repartition`.

Anyway, this solution is expensive indeed and not a good one, maybe the 
best way is using the Spark connectors provided by the DBMS vendors as 
@gatorsmile suggested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Yes. I will leave this PR for a few days to seen if others interested in 
this, and then close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-22 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Yes, I agree with you, sampling bases is the right choose, but through 
`jdbc` API is not possible to achieve this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Using the *predicates* parameters to split the table seems reasonable, but 
it just put some work should be done by Spark to users in my personal opinion. 
Users need know how to split the table uniform at first,  so it may use the 
`count(*)` extra to explode the distribution of the table. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SPARK-19327][SparkSQL] a better balance partition metho...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Yes, this solution is not suitable for large table, but I can not find a 
better solution, this is the best optimisation I can find.
So just add it as a choose, let the users know what he is doing, and need a 
explicit enable.

From my experience, the origin equal step method can lead to some problem 
for real data. This conclusion can be get from the spark-user email and our 
real scenario. Such as users will use the `id` to partition the table, because 
the `id` is unique and with index, but after many inserts and deletes, the `id` 
range is very large, and data will lead to a skew distribution by `id`.

Very large table is not so common, and if the large table with sharding, 
this method maybe acceptable.

My personal opinion is: 
>Given another choose for users maybe valuable, only we do not enable it by 
default.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
@gatorsmile can you take a look at?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Table2 with about 5M rows, 200partition by SparkSQL.

(The table using the MySQL sharding, and every partition will return 10K 
rows at most)


old partition result(elements in each partition)


>1,49,54,53,60,59,48,61,52,57,60,69,58,57,50,52,51,66,58,45,59,52,61,56,67,51,45,49,70,49,58,59,61,53,50,53,47,50,46,53,55,53,62,55,48,58,52,62,62,37,65,59,58,55,61,59,46,53,49,49,61,72,60,46,50,51,45,47,55,63,64,63,55,47,65,57,60,60,51,45,48,77,58,57,59,39,50,62,55,57,49,63,51,38,49,66,62,58,53,54,50,54,52,69,51,49,61,60,64,49,52,50,54,58,48,51,50,49,41,68,54,45,65,62,44,52,64,58,47,51,65,47,37,42,39,44,51,65,56,54,69,51,61,63,51,52,47,55,58,66,47,54,53,53,60,66,66,68,64,66,55,58,64,55,50,57,46,56,39,60,57,63,40,51,56,58,44,46,46,44,42,52,52,44,53,46,55,57,68,57,62,48,47,52,59,58,49,44,52,47

(most of data is in partition 0, but each partition will return 10K at most 
because our sharding limit.)


new partition result(elements in each partition)


>2083,1,1,6932,9799,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8150,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,9,70,2,1,1,1,655,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
 
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,40,76,145,38,86,176,369,696,1338,2776,5381'


count cost time: 0.8ms


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16671: [SparkSQL] a better balance partition method for jdbc AP...

2017-01-21 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16671
  
Here is the real data test result:
Table with 1.2Million rows, 50partition by SparkSQL.



old partition result(elements in each partition)


>100061,100064,100059,100066,100065,100065,100066,100066,100063,100061,100066,100065,70747,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


new partition result(elements in each partition)

>19543,19544,39083,39088,19544,19545,39085,19544,19542,19543,19545,39086,39087,19544,19545,39088,19544,19544,39088,19543,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,19543,19544,39086,19543,19545,39086,39086,19544,19545,39088,19544,19545,39088,19544,19544,39088,19544,19545,20701,0


count cost time: 1.27s












---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16671: [SparkSQL] a better balance partition method for ...

2017-01-21 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/16671

[SparkSQL] a better balance partition method for jdbc API

## What changes were proposed in this pull request?

The partition method in` jdbc` using the equal
step, this can lead to skew between partitions. The new method
introduce a balance partition method base on the elements when split the
elements, this can relieve the skew problem with a little query cost.


## How was this patch tested?
UnitTest and real data.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark balancePartition

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16671


commit 88cdf294aa579f65b8272870d762548cf54349ce
Author: DjvuLee <l...@bytedance.com>
Date:   2017-01-20T09:53:57Z

[SparkSQL] a better balance partition method for jdbc API

The partition method in jdbc when specify the column using te equal
step, this can lead to skew between partitions. The new method
introduce a new partition method base on the elements when split the
elements, this can keep the elements balanced between partitions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16599: [SPARK-19239][PySpark] Check the lowerBound and upperBou...

2017-01-17 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16599
  
I update the PR and test the change in pyspark shell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...

2017-01-16 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/16599#discussion_r96357233
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, 
lowerBound=None, upperBound=None, numPar
 if column is not None:
 if numPartitions is None:
 numPartitions = self._spark._sc.defaultParallelism
--- End diff --

I have a little worry whether this change will break the API. If some users 
just specify the `column`, `lowerBound`, `upperBound` in some Spark version, 
its program will fail after update, even very few people just use the default  
parallelism. 

In my personal opinion, I prefer to make a change and  keep API consistent.

If your opinion is to add the assert on `numPartitions`, I will update the 
PR soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...

2017-01-16 Thread djvulee
Github user djvulee commented on a diff in the pull request:

https://github.com/apache/spark/pull/16599#discussion_r96339764
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -431,6 +432,8 @@ def jdbc(self, url, table, column=None, 
lowerBound=None, upperBound=None, numPar
 if column is not None:
 if numPartitions is None:
 numPartitions = self._spark._sc.defaultParallelism
+assert lowerBound != None, "lowerBound can not be None when 
``column`` is specified"
+assert upperBound != None, "upperBound can not be None when 
``column`` is specified"
--- End diff --

Yes, The Scala code could check this, but the PySpark code will fail at 
```int(lowerBound)``` first, so the customer is confused. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16599: [SPARK-19239][PySpark] Check the lowerBound and upperBou...

2017-01-16 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16599
  
@zsxwing can you take a look at?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16599: [SPARK-19239][PySpark] Check the lowerBound and u...

2017-01-16 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/16599

[SPARK-19239][PySpark] Check the lowerBound and upperBound whether equals 
None in jdbc API

## What changes were proposed in this pull request?

The `jdbc` API do not check the `lowerBound` and `upperBound` when we
specified the ``column``, and just throw the following exception:

>```int() argument must be a string or a number, not 'NoneType'```


If we check the parameter, we can give a more friendly suggestion.


## How was this patch tested?
Test using the pyspark shell, without the lowerBound and upperBound 
parameters.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark pysparkFix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16599.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16599


commit 94c44ba368acb3c7fa648ad66cfd3cac352af911
Author: DjvuLee <l...@bytedance.com>
Date:   2017-01-16T08:43:34Z

[SPARK-19239][PySparK] Check the lowerBound and upperBound whether equal 
None in jdbc API

The ``jdbc`` API do not check the lowerBound and upperBound when we
specified the ``column``, and just throw the following exception:
```int() argument must be a string or a number, not 'NoneType'```
If we check the parameter, we can give a more friendly suggestion.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16210: [Core][SPARK-18778]Fix the scala classpath under ...

2016-12-13 Thread djvulee
Github user djvulee closed the pull request at:

https://github.com/apache/spark/pull/16210


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...

2016-12-13 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16210
  
Yes, this PR do not consider well, I will close this and update the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...

2016-12-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16210
  
@srowen Sorry for late reply and thanks for your reproduce!

As I have mentioned in last reply, this is not a  environment problem, but 
a misunderstand of SPARK_SUBMIT_OPTS by ourself, or a deployment  problem. This 
works for anyone because few people use the ```SPARK_SUBMIT_OPTS``` option and 
do not put ```SPARK_SUBMIT_OPTS``` in the spark-env.sh file.

 It maybe better to separate the ```Dscala.usejavacp=true``` from the 
```SPARK_SUBMIT_OPTS``` in the spark-shell file to avoid the misunderstanding.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...

2016-12-08 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16210
  
I find the reason, because we pass some  SPARK_SUBMIT_OPTS defined by 
ourself, so it seem that spark only parse the opts defined by ourself, ignore 
the ```-Dscala.usejavacp=true```. 

Since we want user to use the `SPARK_SUBMIT_OPTS`, the best way it to 
separate the ```-Dscala.usejavacp=true```  from the SPARK_SUBMIT_OPTS, maybe 
move to SparkSubmitCommandBuilder is good idea as suggestioned by @vanzin.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...

2016-12-08 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16210
  
@jodersky Yes. I try different ways, here is the result:

```
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true -usejavacp"
```

and

```
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true -Dusejavacp"
```

will output
```
Exception in thread "main" java.lang.AssertionError: assertion failed: null
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
```

```
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -usejavacp"
```
will output:
```
Unrecognized option: -usejavacp
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16210: [Core][SPARK-18778]Fix the scala classpath under some en...

2016-12-07 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/16210
  
@rxin our jdk is jdk1.8.0_91, and we do not install the scala, the OS is 
Debian 4.6.4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16210: [Core][SPARK-18778]Fix the scala classpath under ...

2016-12-07 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/16210

[Core][SPARK-18778]Fix the scala classpath under some environment

## What changes were proposed in this pull request?
under some environment, the Dscala.usejavacp=true option seems not work, 
pass the -usejavacp directly to the repl fix this.

## How was this patch tested?
we test in our cluster environment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark sparkShell

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16210


commit ab81a7af165c7287c0356758097dfa5ded6adea3
Author: DjvuLee <l...@bytedance.com>
Date:   2016-12-08T07:15:59Z

[Core]Fix the scala classpath under some envrionment

under some envrionment, the Dscala.usejavacp=true option seems not work,
pass the -usejavacp directly to the repl fix this.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

2016-09-26 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15249
  
I would say this is a very important PR. As our experience, sometimes we 
just need to skip some nodes for the bad disks,the exist blacklist mechanism 
effects little. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metri...

2016-09-12 Thread djvulee
Github user djvulee closed the pull request at:

https://github.com/apache/spark/pull/15052


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen Yes, the file seems always empty before write, so the origin way is 
OK.  Sorry for this PR is not thoughtful enough, I just get a mislead by the 
other method in the shuffle.py, which used the pos before write. 

I will close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen No. It does not matter whether the file is empty or not, if the 
file is empty, the `getsize()` just return 0, and this should be OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  I update PR using an increment way to update the DiskBytesSpilled 
metrics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-12 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  you are right, I will correct it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-11 Thread djvulee
Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  @davies  mind taking a look? This PR is very simple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metri...

2016-09-11 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/15052

[SPARK-17500][PySpark]Make DiskBytesSpilled metric in PySpark shuffle right

## What changes were proposed in this pull request?

The origin way increases the DiskBytesSpilled metric with the file size 
during each spill in ExternalMerger && ExternalGroupBy, but we only need the 
last size.

## How was this patch tested?

No extra tests, because this just update the metrics

Author: Li Hu <l...@bytedance.com>

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark PyDiskSpillMetric

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15052


commit 1b90b0dd61c22ffba6d578f73cf5aca88629b1be
Author: DjvuLee <l...@bytedance.com>
Date:   2016-09-11T19:41:32Z

Make DiskBytesSpilled metric in PySpark shuffle right

The origin way increase the DiskBytesSpilled metric with the file
size during each spill in ExternalMerger && ExternalGroupBy, but we only 
need the last size.

No extra Tests, because this just update the metrics

Author: Li Hu <l...@bytedance.com>




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Branch 1.1 typo error in HistoryServer

2014-12-02 Thread djvulee
GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/3566

Branch 1.1 typo error in HistoryServer 

There is a typo error in the 167  168 line in HistoryServer.scala file.
The ./sbin/spark-history-server.sh  should be 
./sbin/start-history-server.sh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3566.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3566


commit 9a62cf3655dcab49b5c0f94ad094603eaf288251
Author: Michael Armbrust mich...@databricks.com
Date:   2014-08-27T22:14:08Z

[SPARK-3235][SQL] Ensure in-memory tables don't always broadcast.

Author: Michael Armbrust mich...@databricks.com

Closes #2147 from marmbrus/inMemDefaultSize and squashes the following 
commits:

5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' 
into inMemDefaultSize
14204d3 [Michael Armbrust] Set the context before creating 
SparkLogicalPlans.
8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail 
to provide statistcs
18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast.

(cherry picked from commit 7d2a7a91f263bb9fbf24dc4dbffde8fe5e2c7442)
Signed-off-by: Michael Armbrust mich...@databricks.com

commit 0c03fb621e5b080f24863cfc17032bd828b65b99
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-08-27T22:48:00Z

Revert [maven-release-plugin] prepare for next development iteration

This reverts commit 9af3fb7385d1f9f221962f1d2d725ff79bd82033.

commit 0b17c7d4f2176f0c0e8aaab95e034be54467ff30
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-08-27T22:48:13Z

Revert [maven-release-plugin] prepare release v1.1.0-snapshot2

This reverts commit e1535ad3c6f7400f2b7915ea91da9c60510557ba.

commit d4cf7a068da099f0f07f04a834d7edf6b743ceb3
Author: Matthew Farrellee m...@redhat.com
Date:   2014-08-27T22:50:30Z

Add line continuation for script to work w/ py2.7.5

Error was -

$ SPARK_HOME=$PWD/dist ./dev/create-release/generate-changelist.py
  File ./dev/create-release/generate-changelist.py, line 128
if day  SPARK_REPO_CHANGE_DATE1 or
  ^
SyntaxError: invalid syntax

Author: Matthew Farrellee m...@redhat.com

Closes #2139 from mattf/master-fix-generate-changelist.py-0 and squashes 
the following commits:

6b3a900 [Matthew Farrellee] Add line continuation for script to work w/ 
py2.7.5
(cherry picked from commit 64d8ecbbe94c47236ff2d8c94d7401636ba6fca4)

Signed-off-by: Patrick Wendell pwend...@gmail.com

commit 8597e9cf356b0d8e17600a49efc4c4a0356ecb5d
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-08-27T22:55:59Z

BUILD: Updating CHANGES.txt for Spark 1.1

commit 58b0be6a29eab817d350729710345e9f39e4c506
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-08-27T23:28:08Z

[maven-release-plugin] prepare release v1.1.0-rc1

commit 78e3c036eee7113b2ed144eec5061e070b479e56
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-08-27T23:28:27Z

[maven-release-plugin] prepare for next development iteration

commit 54ccd93e621c1bc4afc709a208b609232ab701d1
Author: Andrew Or andrewo...@gmail.com
Date:   2014-08-28T06:03:46Z

[HOTFIX] Wait for EOF only for the PySpark shell

In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send 
us an `EOF` before finishing the application. This is applicable for the 
PySpark shell because we terminate the application the same way. However if we 
run a python application, for instance, the JVM actually never exits unless it 
receives a manual EOF from the user. This is causing a few tests to timeout.

We only need to do this for the PySpark shell because Spark submit runs as 
a python subprocess only in this case. Thus, the normal Spark shell doesn't 
need to go through this case even though it is also a REPL.

Thanks davies for reporting this.

Author: Andrew Or andrewo...@gmail.com

Closes #2170 from andrewor14/bootstrap-hotfix and squashes the following 
commits:

42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell
(cherry picked from commit dafe343499bbc688e266106e4bb897f9e619834e)

Signed-off-by: Patrick Wendell pwend...@gmail.com

commit 233c283e3d946bdcbf418375122c5763559c0119
Author: Michael Armbrust mich...@databricks.com
Date:   2014-08-28T06:05:34Z

[HOTFIX][SQL] Remove cleaning of UDFs

It is not safe to run the closure cleaner on slaves.  #2153 introduced this 
which broke all UDF execution on slaves.  Will re-add cleaning of UDF closures 
in a follow-up PR.

Author