[GitHub] spark pull request: [SPARK-3781] code format and little improvemen...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2734#issuecomment-58849017
  
Hi @shijinkui,

After some discussion, we've decided that we'd like to avoid merging pull 
requests that make large, sweeping style changes/improvements, since these 
changes tend to create maintenance headaches for us by making `git blame` less 
useful and creating merge-conflicts when backporting to maintenance branches.  
However, we'd be open to automatic style checks if they can be conditionally 
applied only to new / modified code (see 
https://issues.apache.org/jira/browse/SPARK-3849 for more details).

In the meantime, do you mind closing this pull request?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3873] [build] Add style checker to enfo...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2757#issuecomment-58849258
  
Hi @vanzin,

We'd like to avoid making large refactorings for style, since these changes 
tend to create merge-conflicts when backporting to maintenance branches and 
make `git blame` significantly less useful.  However, we'd be open to automatic 
style checks if they can be enforced only for new code (see 
https://issues.apache.org/jira/browse/SPARK-3849 for more details).

In the meantime, do you mind closing this pull request? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add echo Run streaming tests ...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2778#issuecomment-58849321
  
LGTM; thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3854] Scala style: require spaces befor...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2761#issuecomment-58849280
  
Hi @sarutak,

We'd like to avoid making large refactorings for style, since these changes 
tend to create merge-conflicts when backporting to maintenance branches and 
make git blame significantly less useful. However, we'd be open to automatic 
style checks if they can be enforced only for new code (see 
https://issues.apache.org/jira/browse/SPARK-3849 for more details).

In the meantime, do you mind closing this pull request? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add echo Run streaming tests ...

2014-10-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2778


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3869] ./bin/spark-class miss Java versi...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2725#issuecomment-58849438
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-13 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-58849802
  
@zhzhan @scwf - I think this should be okay now for protobuf. We made some 
other changes this week updating the protobuf version to be based on protobuf 
2.5 instead of 2.4 in akka. So now throughout Spark we use this. Mind rebasing 
this? I think the protobuf issue will go away.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2749#issuecomment-58849985
  
Hmm, but #implementing-and-using-a-custom-actor-based-receiver is a not 
valid link, sorry, did not get you, can you explain more?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3869] ./bin/spark-class miss Java versi...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2725#issuecomment-58850138
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21677/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-58850172
  
Ok, thanks for that, i will also test it in 
https://github.com/apache/spark/pull/2685


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2749#issuecomment-58850273
  
Oh, I meant that you could have linked the page like this, so that the link 
jumps to the Akka-specific section: 
https://spark.apache.org/docs/latest/streaming-custom-receivers.html#implementing-and-using-a-custom-actor-based-receiver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Bug Fix: without unpersist method in RandomFor...

2014-10-13 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2775#issuecomment-58850468
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3869] ./bin/spark-class miss Java versi...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2725#issuecomment-58850691
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/354/consoleFull)
 for   PR 2725 at commit 
[`f894ebd`](https://github.com/apache/spark/commit/f894ebd0b6799af4037134fadf6c515af09181fc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Bug Fix: without unpersist method in RandomFor...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2775#issuecomment-58851060
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21678/consoleFull)
 for   PR 2775 at commit 
[`815d543`](https://github.com/apache/spark/commit/815d543606efb0f90da8c5a1c87b3e12924d25a7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2749#issuecomment-58851126
  
Get it, use 
```streaming-custom-receivers.html#implementing-and-using-a-custom-actor-based-receiver```
 here jumps to the Akka-specific section:)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3812] [BUILD] Adapt maven build to publ...

2014-10-13 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/2673#issuecomment-58851300
  
@pwendell I don't see an easy way with maven shade plugin either ? Do you 
?, One way is to include a fake dependency and then ask it to shade that across 
all artifacts. But I somehow felt this is more invasive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2749#issuecomment-58851418
  
Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2749#issuecomment-58851490
  
LGTM.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3899][Doc]fix wrong links in streaming ...

2014-10-13 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2749


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3854] Scala style: require spaces befor...

2014-10-13 Thread sarutak
Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/2761#issuecomment-58852435
  
O.K. I close this PR for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/2779

[SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone 
mode

The goal of this patch is to fix the swapped arguments in standalone mode, 
which was caused by  
https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153.

More details can be found in the JIRA: 
[SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)

Tested in Standalone mode, but not in Mesos.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark fix-standalone

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2779.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2779


commit 9d703feb03b3201d73012cb1081b8d20d7ba4ac1
Author: Aaron Davidson aa...@databricks.com
Date:   2014-10-13T06:26:46Z

[SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone 
mode

The goal of this patch is to fix the swapped arguments in standalone mode, 
which was caused by  
https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153.

More details can be found in the JIRA: 
[SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)

Tested in Standalone mode, but not in Mesos.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3854] Scala style: require spaces befor...

2014-10-13 Thread sarutak
Github user sarutak closed the pull request at:

https://github.com/apache/spark/pull/2761


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread chouqin
GitHub user chouqin opened a pull request:

https://github.com/apache/spark/pull/2780

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree 
more adaptively

DecisionTree splits on continuous features by choosing an array of values 
from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it 
could end up having multiple copies of the same split. In this PR, we choose 
splits for a continuous feature in 3 steps:

1. Sort sample values for this feature
2. Get number of occurrence of each distinct value
3. Iterate the value count array computed in step 2 to choose splits.

After find splits, `numSplits` and `numBins` in metadata will be updated.


CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chouqin/spark dt-findsplits

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2780


commit af7cb7962ff9f5041981ea5e4fe2465eceb6f0e5
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-09T11:47:09Z

Choose splits for continuous features in DecisionTree more adaptively

commit 365282375ce3d1a26664695893ebad13d1b3bc47
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-09T12:40:55Z

fix bug

commit 0cd744a4e710463591324b36f01d9dab028e79ef
Author: liqi liqiping1...@gmail.com
Date:   2014-10-10T04:33:24Z

fix bug

commit 1b25a3530f5429b245a50d4c706ebad2d2875726
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-11T01:36:38Z

Merge branch 'master' of https://github.com/apache/spark into dt-findsplits

commit 9e7138e09dfe27c41d8d20ba6fcf9cb59d64a46b
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T01:11:31Z

Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into 
dt-findsplits

Conflicts:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit 8f46af6b57149fefd1e32120947ebe3291730af0
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T03:48:42Z

add comments and unit test

commit 369f812a9ffce7dd10fc37e4a937158f2fa93e1c
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T03:53:07Z

fix style

commit c339a614362f3045ee95975f99b6fde884657d48
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T04:31:23Z

fix bug

commit 2a8267ab9bd8853fa1f638b69373dbbbf0d1a329
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T04:43:44Z

fix bug

commit af6dc974258a9b07020e233e16cbbb584f501122
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T05:03:43Z

fix bug

commit ab303a4ab1931b0c1a90ae2c3923f25d8f266178
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T06:10:33Z

fix bug

commit f69f47f25f292995aa8710da6384bf631787711a
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T06:12:10Z

fix bug

commit 092efcb89c4113eba8374e47587c6f1272aa7125
Author: Qiping Li liqiping1...@gmail.com
Date:   2014-10-13T06:31:58Z

fix bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58853083
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58853271
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21680/consoleFull)
 for   PR 2780 at commit 
[`092efcb`](https://github.com/apache/spark/commit/092efcb89c4113eba8374e47587c6f1272aa7125).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2779#issuecomment-58853441
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21679/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-13 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-58853725
  
Hi @zhzhan and @scwf  - I made some changes to the build to simplify it a 
bit. I made a PR into your branch. I tested it locally compiling for 0.12 and 
0.13, but it would be good if you tested it as well to make sure it works.

https://github.com/zhzhan/spark/pull/1/files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2779#issuecomment-58853884
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/2781

[Spark 3922] Refactor spark-core to use Utils.UTF_8

A global UTF8 constant is very helpful to handle encoding problems when 
converting between String and bytes. There are several solutions here:

1. Add `val UTF_8 = Charset.forName(UTF-8)` to Utils.scala
2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7)
3. io.netty.util.CharsetUtil.UTF_8
4. com.google.common.base.Charsets.UTF_8
5. org.apache.commons.lang.CharEncoding.UTF_8
6. org.apache.commons.lang3.CharEncoding.UTF_8

IMO, I prefer option 1) because people can find it easily.

This is a PR for option 1) and only fixes Spark Core.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-3922

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2781


commit 65b6b8ef68aa71ac45d292eefd7b3e4de0de3bf8
Author: zsxwing zsxw...@gmail.com
Date:   2014-10-13T06:53:11Z

Add UTF_8 to Utils

commit 80f4af8812d3f36a3807e574478a10511916dfbc
Author: zsxwing zsxw...@gmail.com
Date:   2014-10-13T06:53:26Z

Refactor spark-core to use Utils.UTF_8




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58854034
  
/cc @rxin, @JoshRosen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58854193
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2779#issuecomment-58854400
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21682/consoleFull)
 for   PR 2779 at commit 
[`9d703fe`](https://github.com/apache/spark/commit/9d703feb03b3201d73012cb1081b8d20d7ba4ac1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58854409
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21681/consoleFull)
 for   PR 2781 at commit 
[`80f4af8`](https://github.com/apache/spark/commit/80f4af8812d3f36a3807e574478a10511916dfbc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...

2014-10-13 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2241#issuecomment-58854626
  
Note @scwf there are some TODO's in there that need to be addressed in your 
patch for JDBC.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3453] Netty-based BlockTransferService,...

2014-10-13 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2753#discussion_r18755797
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/netty/NettyBlockFetcher.scala ---
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.netty
+
+import java.nio.ByteBuffer
+import java.util
+
+import org.apache.spark.Logging
+import org.apache.spark.network.BlockFetchingListener
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.network.buffer.ManagedBuffer
+import org.apache.spark.network.client.{RpcResponseCallback, 
ChunkReceivedCallback, SluiceClient}
+import org.apache.spark.storage.BlockId
+import org.apache.spark.util.Utils
+
+/**
+ * Responsible for holding the state for a request for a single set of 
blocks. This assumes that
+ * the chunks will be returned in the same order as requested, and that 
there will be exactly
+ * one chunk per block.
+ *
+ * Upon receipt of any block, the listener will be called back. Upon 
failure part way through,
+ * the listener will receive a failure callback for each outstanding block.
+ */
+class NettyBlockFetcher(
+serializer: Serializer,
+client: SluiceClient,
+blockIds: Seq[String],
+listener: BlockFetchingListener)
+  extends Logging {
+
+  require(blockIds.nonEmpty)
+
+  val ser = serializer.newInstance()
+
+  var streamHandle: ShuffleStreamHandle = _
+
+  val chunkCallback = new ChunkReceivedCallback {
+// On receipt of a chunk, pass it upwards as a block.
+def onSuccess(chunkIndex: Int, buffer: ManagedBuffer): Unit = 
Utils.logUncaughtExceptions {
+  buffer.retain()
+  listener.onBlockFetchSuccess(blockIds(chunkIndex), buffer)
+}
+
+// On receipt of a failure, fail every block from chunkIndex onwards.
+def onFailure(chunkIndex: Int, e: Throwable): Unit = {
+  blockIds.drop(chunkIndex).foreach { blockId =
+listener.onBlockFetchFailure(blockId, e);
+  }
+}
+  }
+
+  // Send the RPC to open the given set of blocks. This will return a 
ShuffleStreamHandle.
+  
client.sendRpc(ser.serialize(OpenBlocks(blockIds.map(BlockId.apply))).array(),
--- End diff --

does this even need to be a class on its own? if yes, maybe have a separate 
init method so we don't get a weird object ctor failure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58855147
  
I vote for `com.google.common.base.Charsets.UTF_8` now, and 
`java.nio.charset.StandardCharsets.UTF_8` when Spark moves to Java 7+. No need 
to define this constant yet again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3869] ./bin/spark-class miss Java versi...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2725#issuecomment-58855812
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/354/consoleFull)
 for   PR 2725 at commit 
[`f894ebd`](https://github.com/apache/spark/commit/f894ebd0b6799af4037134fadf6c515af09181fc).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Bug Fix: without unpersist method in RandomFor...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2775#issuecomment-58856471
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21678/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2779#issuecomment-58856481
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Bug Fix: without unpersist method in RandomFor...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2775#issuecomment-58856468
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21678/consoleFull)
 for   PR 2775 at commit 
[`815d543`](https://github.com/apache/spark/commit/815d543606efb0f90da8c5a1c87b3e12924d25a7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3826][SQL]enable hive-thriftserver to s...

2014-10-13 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/2685#issuecomment-58856535
  
@pwendell, i am resolving the conflicts, other TODO's here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-58856594
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21683/consoleFull)
 for   PR 2388 at commit 
[`1e2485c`](https://github.com/apache/spark/commit/1e2485c05c77dbca4332b9af616c27c45f2f5e32).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58857589
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21680/consoleFull)
 for   PR 2780 at commit 
[`092efcb`](https://github.com/apache/spark/commit/092efcb89c4113eba8374e47587c6f1272aa7125).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58859149
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21681/consoleFull)
 for   PR 2781 at commit 
[`80f4af8`](https://github.com/apache/spark/commit/80f4af8812d3f36a3807e574478a10511916dfbc).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark 3922] Refactor spark-core to use Utils....

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2781#issuecomment-58859154
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21681/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3921] Fix CoarseGrainedExecutorBackend'...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2779#issuecomment-58859175
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21682/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

2014-10-13 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/2705#issuecomment-58862691
  
@chanda what's your problem formulation?
min x'Hx + c'x
s.t Ax = B
You can write it as min x'Hx + c'x + g(z)
s.t Ax = B + z
g(z) here is indicator function that z = 0

Now we can solve this using QuadraticMinimizer.scala...Let me know if this 
formulation makes sense and I will point to the rest of the steps to you...

By the way I am working on adding H as sparse matrix but it will take some 
time since we need LDL factorization and that's in ECOS code base...Once I make 
the ECOS jar available we should be able to use LDL from there...

Is your matrix sparse since you keep sparse kernel for SVM and not all 
entries from RBF ?

For now I will say use the dense formulation, partition your kernel matrix 
and solve a QP on each worker and then combine the results using treeAggregate 
on master...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-58862844
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21683/consoleFull)
 for   PR 2388 at commit 
[`1e2485c`](https://github.com/apache/spark/commit/1e2485c05c77dbca4332b9af616c27c45f2f5e32).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class TopicModelingKryoRegistrator extends KryoRegistrator `
  * `class StreamingContext(object):`
  * `class DStream(object):`
  * `class TransformedDStream(DStream):`
  * `class TransformFunction(object):`
  * `class TransformFunctionSerializer(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-58862848
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21683/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

2014-10-13 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/2705#issuecomment-58863216
  
@Chanda breeze sparse matrix does not solve your problem since breeze does 
not have sparse LDL but the ECOS jar has the ldl and amd native libraries which 
we will use for sparse LDL...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread chouqin
Github user chouqin commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58865080
  
Jekins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58865274
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21684/consoleFull)
 for   PR 2780 at commit 
[`9e64699`](https://github.com/apache/spark/commit/9e64699f67e64424f877aea8fc1e6282e32c8595).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58865777
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21685/consoleFull)
 for   PR 2780 at commit 
[`9e64699`](https://github.com/apache/spark/commit/9e64699f67e64424f877aea8fc1e6282e32c8595).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread chouqin
Github user chouqin commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58865951
  
@jkbradley, RandomForestSuite fails because original splits are better fit 
for the training data(for example, 899.5 is a split threshold, which is close 
to 900.) I think this PR's method to choose splits is more reasonable than the 
original method in that the first threshold found by the original method will 
be the average value of the first two `featureSamples`. 

For example, if `featureSamples` is `Array(0, 1, 2, 3, 4, 5)`, find a split 
point using the original method will return 0.5, while this PR's method will 
return 2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3814][SQL] Bitwise does not work in H...

2014-10-13 Thread ravipesala
Github user ravipesala commented on the pull request:

https://github.com/apache/spark/pull/2736#issuecomment-58872032
  
Thank you @scwf  , I have created new PR since it has merge conflicts. It 
will not be neat If I rebase and push to old PR because it will show all 
changed files which are merged while rebasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58872054
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21684/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58872046
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21684/consoleFull)
 for   PR 2780 at commit 
[`9e64699`](https://github.com/apache/spark/commit/9e64699f67e64424f877aea8fc1e6282e32c8595).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58872500
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21685/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58872495
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21685/consoleFull)
 for   PR 2780 at commit 
[`9e64699`](https://github.com/apache/spark/commit/9e64699f67e64424f877aea8fc1e6282e32c8595).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs

2014-10-13 Thread viper-kun
Github user viper-kun commented on a diff in the pull request:

https://github.com/apache/spark/pull/2471#discussion_r18763243
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -214,6 +224,43 @@ private[history] class FsHistoryProvider(conf: 
SparkConf) extends ApplicationHis
 }
--- End diff --

@vanzin  sorry, i do not what you means. do you means that do not throw 
Throwable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58876511
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21686/consoleFull)
 for   PR 2780 at commit 
[`d353596`](https://github.com/apache/spark/commit/d3535963cf69bf36e7b059f2c7fd6ee148892135).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3911] [SQL] HiveSimpleUdf can not be op...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2771#discussion_r18764783
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala 
---
@@ -99,6 +99,16 @@ private[hive] case class 
HiveSimpleUdf(functionClassName: String, children: Seq[
   @transient
   protected lazy val arguments = children.map(c = 
toInspector(c.dataType)).toArray
 
+  @transient
+  protected lazy val isUDFDeterministic = {
+val udfType = function.getClass().getAnnotation(classOf[HiveUDFType])
+(udfType != null  udfType.deterministic())
--- End diff --

Nit: redundant parenthesis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3911] [SQL] HiveSimpleUdf can not be op...

2014-10-13 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2771#issuecomment-58881732
  
This LGTM. Would you mind to add some tests? Probably in 
`ExpressionOptimizationSuite`. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58883501
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21686/consoleFull)
 for   PR 2780 at commit 
[`d353596`](https://github.com/apache/spark/commit/d3535963cf69bf36e7b059f2c7fd6ee148892135).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3207][MLLIB]Choose splits for continuou...

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2780#issuecomment-58883509
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21686/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18765864
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/hive/HiveFromSpark.scala 
---
@@ -62,6 +62,16 @@ object HiveFromSpark {
 println(Result of SELECT *:)
 sql(SELECT * FROM records r JOIN src s ON r.key = 
s.key).collect().foreach(println)
 
+// Write out an RDD as a orc file.
+rdd.saveAsOrcFile(pair.orc)
+
+// Read in orc file. Orc files are self-describing so the schmema is 
preserved.
+val orcFile = hiveContext.orcFile(pair.orc)
+
+// These files can also be registered as tables.
+orcFile.registerTempTable(orcFile)
+sql(SELECT * FROM records r JOIN orcFile s ON r.key = 
s.key).collect().foreach(println)
+
--- End diff --

I think test cases and documentation can be better places to illustrate the 
API usage. This example is used to illustrate how Spark SQL cooperates with 
Hive. And with this PR, we don't need Hive (Metastore) to access ORC files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18765873
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
 ---
@@ -114,6 +114,22 @@ case class InsertIntoTable(
   }
 }
 
+case class InsertIntoOrcTable(
+table: LogicalPlan,
--- End diff --

Why is the type of `table` `LogicalPlan` rather than `OrcRelation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766058
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala
 ---
@@ -128,6 +144,13 @@ case class WriteToFile(
   override def output = child.output
 }
 
+case class WriteToOrcFile(
--- End diff --

I think we should rename the original `WriteToFile` class to 
`WriteToParquetFile`. That name was too general and rather confusing in the 
first place, and it becomes even more confusing after ORC is supported.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766091
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala 
---
@@ -77,6 +77,18 @@ private[sql] trait SchemaRDDLike {
   }
 
   /**
+   * Saves the contents of this `SchemaRDD` as a orc file, preserving the 
schema.  Files that
--- End diff --

Please use ORC instead of orc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

2014-10-13 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2478#issuecomment-58884300
  
@JoshRosen @pwendell any further comment on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -504,19 +505,41 @@ private[parquet] object FileSystemHelper {
 fs.listStatus(path).map(_.getPath)
   }
 
-/**
- * Finds the maximum taskid in the output file names at the given path.
- */
-  def findMaxTaskId(pathStr: String, conf: Configuration): Int = {
+  /**
+   *  List files with special extension
+   */
+  def listFiles(origPath: Path, conf: Configuration, extension: String): 
Seq[Path] = {
+val fs = origPath.getFileSystem(conf)
+if (fs == null) {
+  throw new IllegalArgumentException(
+sOrcTableOperations: Path $origPath is incorrectly formatted)
--- End diff --

This helper class is not specific to ORC support, please reword the 
exception message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3818] Graph coarsening

2014-10-13 Thread uncleGen
Github user uncleGen commented on the pull request:

https://github.com/apache/spark/pull/2679#issuecomment-58885039
  
@ankurdave I have some doubts, but not about this patch. In [GraphX OSDI 
paper](http://ankurdave.com/dl/graphx-osdi14.pdf) , I find that you have 
implemented a memory-based shuffle manager. But, I do not find it in any 
release. Do you have any concerns? Actually, I am going to do and doing 
somethings about the memory-based  shuffle manager. Please give me some 
advice, thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766443
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -504,19 +505,41 @@ private[parquet] object FileSystemHelper {
 fs.listStatus(path).map(_.getPath)
   }
 
-/**
- * Finds the maximum taskid in the output file names at the given path.
- */
-  def findMaxTaskId(pathStr: String, conf: Configuration): Int = {
+  /**
+   *  List files with special extension
+   */
+  def listFiles(origPath: Path, conf: Configuration, extension: String): 
Seq[Path] = {
+val fs = origPath.getFileSystem(conf)
+if (fs == null) {
+  throw new IllegalArgumentException(
+sOrcTableOperations: Path $origPath is incorrectly formatted)
+}
+val path = origPath.makeQualified(fs)
+if (fs.exists(path)  fs.getFileStatus(path).isDir) {
+  fs.listStatus(path).map(_.getPath).filter(p = 
p.getName.endsWith(extension))
--- End diff --

I think `FileSystem.globStatus` can be convenient and more efficient here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs

2014-10-13 Thread mattf
Github user mattf commented on the pull request:

https://github.com/apache/spark/pull/2471#issuecomment-58885115
  
 @mattf I understand what you're trying to say, but think about it in 
context. As I said above, the when to poll the file system code is the most 
trivial part of this change. The only advantage of using cron for that is that 
you'd have more scheduling options - e.g., absolute times instead of a period.
 
 To achieve that, you'd be considerably complicating everything else. 
You'd be creating a new command line tool in Spark, that needs to deal with 
command line arguments, be documented, and handle security settings (e.g. 
kerberos) - so it's more burden for everybody, maintaners of the code and 
admins alike.
 
 And all that for a trivial, and I'd say, not really needed gain in 
functionality.

@aw-altiscale pointed me to camus which has a nearly separable component: 
https://github.com/linkedin/camus/tree/master/camus-sweeper

my objection to this is about the architecture and responsibilities of the 
spark components. i don't object to having the functionality.

i think you should implement the ability to sweep/rotate/clean log files in 
hdfs, but not as part of a spark process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766608
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala
 ---
@@ -504,19 +505,41 @@ private[parquet] object FileSystemHelper {
 fs.listStatus(path).map(_.getPath)
   }
 
-/**
- * Finds the maximum taskid in the output file names at the given path.
- */
-  def findMaxTaskId(pathStr: String, conf: Configuration): Int = {
+  /**
+   *  List files with special extension
+   */
+  def listFiles(origPath: Path, conf: Configuration, extension: String): 
Seq[Path] = {
+val fs = origPath.getFileSystem(conf)
+if (fs == null) {
+  throw new IllegalArgumentException(
+sOrcTableOperations: Path $origPath is incorrectly formatted)
+}
+val path = origPath.makeQualified(fs)
+if (fs.exists(path)  fs.getFileStatus(path).isDir) {
+  fs.listStatus(path).map(_.getPath).filter(p = 
p.getName.endsWith(extension))
+} else {
+  Seq.empty
+}
+  }
+
+  /**
+   * Finds the maximum taskid in the output file names at the given path.
+   */
+  def findMaxTaskId(pathStr: String, conf: Configuration, extension: 
String): Int = {
 val files = FileSystemHelper.listFiles(pathStr, conf)
-// filename pattern is part-r-int.parquet
-val nameP = new 
scala.util.matching.Regex(part-r-(\d{1,}).parquet, taskid)
+// filename pattern is part-r-int.$extension
+val nameP = extension match {
+  case parquet = new scala.util.matching.Regex( 
part-r-(\d{1,}).parquet, taskid)
+  case orc =  new scala.util.matching.Regex( 
part-r-(\d{1,}).orc, taskid)
+  case _ =
+sys.error(sERROR: unsupported extension: $extension)
+}
--- End diff --

Move this `match` expression to the beginning of this function since 
`.listFiles` can be expensive. Also this expression can be simplified to:

```scala
require(Seq(orc, parquet).contains(extension), sUnsupported extension: 
$extension)
val nameP = new Regex(spart-r-(\d{1,}).$extension, taskid)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766670
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -121,6 +122,48 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
   }
 
   /**
+   * Loads a Orc file, returning the result as a [[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def orcFile(path: String): SchemaRDD =
+new SchemaRDD(this, orc.OrcRelation(path, 
Some(sparkContext.hadoopConfiguration), this))
+
+  /**
+   * :: Experimental ::
+   * Creates an empty orc file with the schema of class `A`, which can be 
registered as a table.
--- End diff --

Capitalize orc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766697
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -121,6 +122,48 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
   }
 
   /**
+   * Loads a Orc file, returning the result as a [[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def orcFile(path: String): SchemaRDD =
+new SchemaRDD(this, orc.OrcRelation(path, 
Some(sparkContext.hadoopConfiguration), this))
+
+  /**
+   * :: Experimental ::
+   * Creates an empty orc file with the schema of class `A`, which can be 
registered as a table.
+   * This registered table can be used as the target of future 
`insertInto` operations.
+   *
+   * {{{
+   *   val sqlContext = new SQLContext(...)
--- End diff --

Should be `HiveContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766749
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -28,6 +28,7 @@ import org.apache.spark.sql.catalyst.types.StringType
 import org.apache.spark.sql.execution.{DescribeCommand, OutputFaker, 
SparkPlan}
 import org.apache.spark.sql.hive
 import org.apache.spark.sql.hive.execution._
+import org.apache.spark.sql.hive.orc.{InsertIntoOrcTable, OrcTableScan, 
OrcRelation}
--- End diff --

Sort imported classes in this line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18766771
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -221,4 +222,24 @@ private[hive] trait HiveStrategies {
   case _ = Nil
 }
   }
+
+  object OrcOperations extends Strategy {
+def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
+  case logical.WriteToOrcFile(path, child) =
+val relation =
+  OrcRelation.create(path, child, 
sparkContext.hadoopConfiguration, sqlContext)
+InsertIntoOrcTable(relation, planLater(child), overwrite=true) :: 
Nil
--- End diff --

Spaces around `=`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread ScrapCodes
GitHub user ScrapCodes opened a pull request:

https://github.com/apache/spark/pull/2782

SPARK-3874, Provide stable TaskContext API



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ScrapCodes/spark-1 SPARK-3874/stable-tc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2782.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2782


commit ef633f5e4857400c8711ee800b01016b6bd406b2
Author: Prashant Sharma prashan...@imaginea.com
Date:   2014-10-13T12:41:11Z

SPARK-3874, Provide stable TaskContext API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58886892
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18767300
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
--- End diff --

Remove redundant new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58887166
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21687/consoleFull)
 for   PR 2782 at commit 
[`ef633f5`](https://github.com/apache/spark/commit/ef633f5e4857400c8711ee800b01016b6bd406b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58887515
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21687/consoleFull)
 for   PR 2782 at commit 
[`ef633f5`](https://github.com/apache/spark/commit/ef633f5e4857400c8711ee800b01016b6bd406b2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public abstract class TaskContext implements Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58887519
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21687/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-5896
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768084
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
+  log.info(types is {}, types)
+
+  val colBuff = new StringBuilder
+  val typeBuff = new StringBuilder
+  for (i - 0 until fields.size()) {
+val fieldName = fields.get(i).getFieldName
+val typeName = fields.get(i).getFieldObjectInspector.getTypeName
+colBuff.append(fieldName)
+fieldNameTypeCache.put(fieldName, typeName)
+fieldIdCache.put(fieldName, i)
+colBuff.append(,)
+typeBuff.append(typeName)
+typeBuff.append(:)
+  }
+  colBuff.setLength(colBuff.length - 1)
+  typeBuff.setLength(typeBuff.length - 1)
+  prop.setProperty(columns, colBuff.toString())
+  prop.setProperty(columns.types, typeBuff.toString())
+  val attributes = convertToAttributes(reader, keys, types)
+  attributes
+} else {
+  Seq.empty
+}
+  }
+
+  def convertToAttributes(
+  reader: Reader,
+  keys: java.util.List[String],
+  types: java.util.List[Integer]): Seq[Attribute] = {
+val range = 0.until(keys.size())
+range.map {
+  i = reader.getTypes.get(types.get(i)).getKind match {
+case Kind.BOOLEAN =
+  new AttributeReference(keys.get(i), BooleanType, false)()
+case Kind.STRING =
+  new AttributeReference(keys.get(i), StringType, true)()
+case 

[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58889380
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21688/consoleFull)
 for   PR 2782 at commit 
[`ef633f5`](https://github.com/apache/spark/commit/ef633f5e4857400c8711ee800b01016b6bd406b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768122
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
+  log.info(types is {}, types)
+
+  val colBuff = new StringBuilder
+  val typeBuff = new StringBuilder
+  for (i - 0 until fields.size()) {
+val fieldName = fields.get(i).getFieldName
+val typeName = fields.get(i).getFieldObjectInspector.getTypeName
+colBuff.append(fieldName)
+fieldNameTypeCache.put(fieldName, typeName)
+fieldIdCache.put(fieldName, i)
+colBuff.append(,)
+typeBuff.append(typeName)
+typeBuff.append(:)
+  }
+  colBuff.setLength(colBuff.length - 1)
+  typeBuff.setLength(typeBuff.length - 1)
+  prop.setProperty(columns, colBuff.toString())
+  prop.setProperty(columns.types, typeBuff.toString())
+  val attributes = convertToAttributes(reader, keys, types)
+  attributes
+} else {
+  Seq.empty
+}
+  }
+
+  def convertToAttributes(
+  reader: Reader,
+  keys: java.util.List[String],
+  types: java.util.List[Integer]): Seq[Attribute] = {
+val range = 0.until(keys.size())
+range.map {
+  i = reader.getTypes.get(types.get(i)).getKind match {
+case Kind.BOOLEAN =
+  new AttributeReference(keys.get(i), BooleanType, false)()
+case Kind.STRING =
+  new AttributeReference(keys.get(i), StringType, true)()
+case 

[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58889799
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21688/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58889796
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21688/consoleFull)
 for   PR 2782 at commit 
[`ef633f5`](https://github.com/apache/spark/commit/ef633f5e4857400c8711ee800b01016b6bd406b2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public abstract class TaskContext implements Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1405][MLLIB] topic modeling on Graphx

2014-10-13 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/2388#discussion_r18768316
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/TopicModeling.scala ---
@@ -0,0 +1,682 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, sum = 
brzSum}
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.graphx._
+import org.apache.spark.Logging
+import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, RowMatrix}
+import org.apache.spark.mllib.linalg.{DenseVector = SDV, SparseVector = 
SSV, Vector = SV}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.serializer.KryoRegistrator
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.SparkContext._
+
+import TopicModeling._
+
+class TopicModeling private[mllib](
+  @transient var corpus: Graph[VD, ED],
+  val numTopics: Int,
+  val numTerms: Int,
+  val alpha: Double,
+  val beta: Double,
+  @transient val storageLevel: StorageLevel)
+  extends Serializable with Logging {
+
+  def this(docs: RDD[(TopicModeling.DocId, SSV)],
+numTopics: Int,
+alpha: Double,
+beta: Double,
+storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK,
+computedModel: Broadcast[TopicModel] = null) {
+this(initializeCorpus(docs, numTopics, storageLevel, computedModel),
+  numTopics, docs.first()._2.size, alpha, beta, storageLevel)
+  }
+
+
+  /**
+   * The number of documents in the corpus
+   */
+  val numDocs = docVertices.count()
+
+  /**
+   * The number of terms in the corpus
+   */
+  private val sumTerms = corpus.edges.map(e = 
e.attr.size.toDouble).sum().toLong
+
+  /**
+   * The total counts for each topic
+   */
+  @transient private var globalTopicCounter: BDV[Count] = 
collectGlobalCounter(corpus, numTopics)
+  assert(brzSum(globalTopicCounter) == sumTerms)
+
+  @transient private val sc = corpus.vertices.context
+  @transient private val seed = new Random().nextInt()
+  @transient private var innerIter = 1
+  @transient private var cachedEdges: EdgeRDD[ED, VD] = corpus.edges
+  @transient private var cachedVertices: VertexRDD[VD] = corpus.vertices
+
+  private def termVertices = corpus.vertices.filter(t = t._1 = 0)
+
+  private def docVertices = corpus.vertices.filter(t = t._1  0)
+
+  private def checkpoint(): Unit = {
+if (innerIter % 10 == 0  sc.getCheckpointDir.isDefined) {
+  val edges = corpus.edges.map(t = t)
+  edges.checkpoint()
+  val newCorpus: Graph[VD, ED] = Graph.fromEdges(edges, null,
+storageLevel, storageLevel)
+  corpus = updateCounter(newCorpus, numTopics).cache()
+}
+  }
+
+  private def gibbsSampling(): Unit = {
+val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter,
+  sumTerms, numTerms, numTopics, alpha, beta)
+
+val corpusSampleTopics = sampleTopics(corpusTopicDist, 
globalTopicCounter,
+  sumTerms, innerIter + seed, numTerms, numTopics, alpha, beta)
+corpusSampleTopics.edges.setName(sedges-$innerIter).cache().count()
+Option(cachedEdges).foreach(_.unpersist())
+cachedEdges = corpusSampleTopics.edges
+
+corpus = updateCounter(corpusSampleTopics, numTopics)
+corpus.vertices.setName(svertices-$innerIter).cache()
+globalTopicCounter = collectGlobalCounter(corpus, numTopics)
+assert(brzSum(globalTopicCounter) == sumTerms)
+Option(cachedVertices).foreach(_.unpersist())
+cachedVertices = corpus.vertices
+
+checkpoint()
+innerIter += 1
+  }
+
+  def saveTopicModel(burnInIter: Int): TopicModel = {
+val topicModel = 

[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768380
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
--- End diff --

Please add comments to explain how you get column info from the metadata of 
an ORC file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768433
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
+  log.info(types is {}, types)
+
+  val colBuff = new StringBuilder
+  val typeBuff = new StringBuilder
+  for (i - 0 until fields.size()) {
+val fieldName = fields.get(i).getFieldName
+val typeName = fields.get(i).getFieldObjectInspector.getTypeName
+colBuff.append(fieldName)
+fieldNameTypeCache.put(fieldName, typeName)
+fieldIdCache.put(fieldName, i)
+colBuff.append(,)
+typeBuff.append(typeName)
+typeBuff.append(:)
+  }
+  colBuff.setLength(colBuff.length - 1)
+  typeBuff.setLength(typeBuff.length - 1)
+  prop.setProperty(columns, colBuff.toString())
+  prop.setProperty(columns.types, typeBuff.toString())
+  val attributes = convertToAttributes(reader, keys, types)
+  attributes
+} else {
+  Seq.empty
+}
+  }
+
+  def convertToAttributes(
+  reader: Reader,
+  keys: java.util.List[String],
+  types: java.util.List[Integer]): Seq[Attribute] = {
+val range = 0.until(keys.size())
+range.map {
+  i = reader.getTypes.get(types.get(i)).getKind match {
+case Kind.BOOLEAN =
+  new AttributeReference(keys.get(i), BooleanType, false)()
+case Kind.STRING =
+  new AttributeReference(keys.get(i), StringType, true)()
+case Kind.BYTE =
 

[GitHub] spark pull request: SPARK-3874, Provide stable TaskContext API

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2782#issuecomment-58890576
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21689/consoleFull)
 for   PR 2782 at commit 
[`bbd9e05`](https://github.com/apache/spark/commit/bbd9e057a24cd25336a806dce41b2cbd1ebc3233).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768525
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
+  log.info(types is {}, types)
+
+  val colBuff = new StringBuilder
+  val typeBuff = new StringBuilder
+  for (i - 0 until fields.size()) {
+val fieldName = fields.get(i).getFieldName
+val typeName = fields.get(i).getFieldObjectInspector.getTypeName
+colBuff.append(fieldName)
+fieldNameTypeCache.put(fieldName, typeName)
+fieldIdCache.put(fieldName, i)
+colBuff.append(,)
+typeBuff.append(typeName)
+typeBuff.append(:)
+  }
+  colBuff.setLength(colBuff.length - 1)
+  typeBuff.setLength(typeBuff.length - 1)
+  prop.setProperty(columns, colBuff.toString())
+  prop.setProperty(columns.types, typeBuff.toString())
+  val attributes = convertToAttributes(reader, keys, types)
+  attributes
+} else {
+  Seq.empty
+}
+  }
+
+  def convertToAttributes(
+  reader: Reader,
+  keys: java.util.List[String],
+  types: java.util.List[Integer]): Seq[Attribute] = {
+val range = 0.until(keys.size())
+range.map {
+  i = reader.getTypes.get(types.get(i)).getKind match {
+case Kind.BOOLEAN =
+  new AttributeReference(keys.get(i), BooleanType, false)()
+case Kind.STRING =
+  new AttributeReference(keys.get(i), StringType, true)()
+case Kind.BYTE =
 

[GitHub] spark pull request: [spark-3586][streaming]Support nested director...

2014-10-13 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/2765#issuecomment-58890751
  
Hi @wangxiaojing ,a small suggestion, why not making this improvement 
more flexible by adding a parameter to control the searching depth of 
directories, this will be more general than the current 1-depth searching 
implementation. Like:

```scala
class FileInputDStream[K: ClassTag, V: ClassTag, F : NewInputFormat[K,V] : 
ClassTag](
@transient ssc_ : StreamingContext,
directory: String,
filter: Path = Boolean = FileInputDStream.defaultFilter,
depth: Int = 1,
newFilesOnly: Boolean = true)
```
People can use this parameter to control the searching depth, default 1 
keeps the same semantics as current code.

Besides some while space related code styles should be changed to align 
with Scala style.

 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768639
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
--- End diff --

Field names are...

Also use `logInfo` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768648
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
+
+  override val output = orcSchema
+
+  override lazy val statistics = Statistics(sizeInBytes = 
sqlContext.defaultSizeInBytes)
+
+  def orcSchema: Seq[Attribute] = {
+val origPath = new Path(path)
+val reader = OrcFileOperator.readMetaData(origPath, conf)
+
+if (null != reader) {
+  val inspector = 
reader.getObjectInspector.asInstanceOf[StructObjectInspector]
+  val fields = inspector.getAllStructFieldRefs
+
+  if (fields.size() == 0) {
+return Seq.empty
+  }
+
+  val totalType = reader.getTypes.get(0)
+  val keys = totalType.getFieldNamesList
+  val types = totalType.getSubtypesList
+  log.info(field name is {}, keys)
+  log.info(types is {}, types)
--- End diff --

Types are ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3720][SQL]initial support ORC in spark ...

2014-10-13 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2576#discussion_r18768785
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.orc
+
+import java.util.Properties
+import java.io.IOException
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
+
+import scala.collection.mutable
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.permission.FsAction
+import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector
+import org.apache.hadoop.hive.ql.io.orc._
+import org.apache.hadoop.hive.ql.io.orc.OrcProto.Type.Kind
+
+import org.apache.spark.sql.parquet.FileSystemHelper
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, LeafNode}
+import org.apache.spark.sql.catalyst.analysis.{UnresolvedException, 
MultiInstanceRelation}
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.catalyst.types._
+
+
+private[sql] case class OrcRelation(
+path: String,
+@transient conf: Option[Configuration],
+@transient sqlContext: SQLContext,
+partitioningAttributes: Seq[Attribute] = Nil)
+  extends LeafNode with MultiInstanceRelation {
+  self: Product =
+
+  val prop: Properties = new Properties
+
+  var rowClass: Class[_] = null
+
+  val fieldIdCache: mutable.Map[String, Int] = new mutable.HashMap[String, 
Int]
+
+  val fieldNameTypeCache: mutable.Map[String, String] = new 
mutable.HashMap[String, String]
--- End diff --

Seems that this field is not used anywhere...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >