date:20150205


[ 
https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306810#comment-14306810
 ] 

Apache Spark commented on SPARK-5604:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4390

 Remove setCheckpointDir from LDA and tree Strategy
 --

 Key: SPARK-5604
 URL: https://issues.apache.org/jira/browse/SPARK-5604
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Continue the discussion from the LDA PR. CheckpoingDir is a global Spark 
 configuration, which should not be altered by an ML algorithm. We could check 
 whether checkpointDir is set if checkpointInterval is positive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Shekhar Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308596#comment-14308596
 ] 

Shekhar Bansal commented on SPARK-5081:
---

I faced same problem, moving to lz4 compression did the trick for me.
try spark.io.compression.codec=lz4

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

[
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308598#comment-14308598
]

Andrew Or commented on SPARK-5388:
--

[~tigerquoll] I still don't think we should use DELETE for kill for the
following reason. In normal REST servers that host static resources, if you GET
after a DELETE, you run into a 404. Here, our resources are by no means static,
and if you GET after a DELETE you actually get a different status (that your
driver is now KILLED instead of RUNNING) instead. Because of these side-effects
I think it is safest to use POST.

[~vanzin]
- The action field is actually required especially since many of the responses
look quite alike. We need to know how to deserialize the messages safely in
case the response we get from the server is not the type that we expect it to
be (e.g. ErrorResponse).
- Yes, I could rename the protocolVersion field.
- The issue with having non-String types is that you will need to deal with
numeric and boolean values specially. For instance, if the user does not
explicitly set the field there is no easy way to not include it in the JSON
without doing some Option hack. I went down that route and opted out for
simpler code.
- The unknown fields reporting is added in the PR but is missing in the spec.
In the PR it is reported in its own explicit field.
- Even in the existing interface you can use o.a.s.deploy.Client to kill an
application, and the security guarantees there are the same. I agree that it is
something we need to address that at some point, but I prefer to keep that
outside the scope of this patch.

Provide a stable application submission gateway in standalone cluster mode
--

Key: SPARK-5388
URL: https://issues.apache.org/jira/browse/SPARK-5388
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf

The existing submission gateway in standalone mode is not compatible across
Spark versions. If you have a newer version of Spark submitting to an older
version of the standalone Master, it is currently not guaranteed to work. The
goal is to provide a stable REST interface to replace this channel.
For more detail, please see the most recent design doc attached.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-05 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308615#comment-14308615
 ] 

Guoqiang Li commented on SPARK-5556:


LightLDA's  computational complexity is O(1)
The paper: http://arxiv.org/abs/1412.1576
The code(work in progress): https://github.com/witgo/spark/tree/LightLDA

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5598) Model import/export for ALS

[
https://issues.apache.org/jira/browse/SPARK-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308692#comment-14308692
]

Sean Owen commented on SPARK-5598:
--

[~mengxr] No, no other tool could usefully read such a PMML file. The only
argument for it would be consistency: you probably need *some* file to hold
some metadata about the model, so, you could just use PMML rather than also
invent another format for that too.

The actual data can't feasibly be serialized in PMML since it would be far too
large as XML. I'm not suggesting that text-based serialization of the vectors
should be used; I was pointing more to the PMML container idea.

Yes, if this only concerns data that will only be written/read by Spark, and is
not intended for export, there isn't any value at all in PMML. I thought this
might be covering model export, meaning, for some kind of external consumption.
In that case, there's no good answer, but at least reusing PMML for the
container could have small value.

Model import/export for ALS
---

Key: SPARK-5598
URL: https://issues.apache.org/jira/browse/SPARK-5598
Project: Spark
Issue Type: Sub-task
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng

Please see parent JIRA for details on model import/export plans.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-926) spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null


 [ 
https://issues.apache.org/jira/browse/SPARK-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-926.
-
Resolution: Duplicate

Going to make this one the duplicate since SPARK-5403 has an active PR.

 spark_ec2 script when ssh/scp-ing should pipe UserknowHostFile to /dev/null
 ---

 Key: SPARK-926
 URL: https://issues.apache.org/jira/browse/SPARK-926
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 0.8.0
Reporter: Shay Seng
Priority: Trivial

 The know host file in the local machine gets all kinds of crap after a few 
 cluster launches. When SSHing, or SCPing, please add -o 
 UserKnowHostFile=/dev/null
 Also remove the -t option from SSH, and only add in when necessary - to 
 reduce chatter on console. 
 e.g.
 # Copy a file to a given host through scp, throwing an exception if scp fails
 def scp(host, opts, local_file, dest_file):
   subprocess.check_call(
   scp -q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i 
 %s '%s' '%s@%s:%s' %
   (opts.identity_file, local_file, opts.user, host, dest_file), 
 shell=True)
 # Run a command on a host through ssh, retrying up to two times
 # and then throwing an exception if ssh continues to fail.
 def ssh(host, opts, command, sshopts=):
   tries = 0
   while True:
 try:
   # removed -t option from ssh command, not sure why it is required all 
 the time.
   return subprocess.check_call(
 ssh %s -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null 
 -i %s %s@%s '%s' %
 (sshopts, opts.identity_file, opts.user, host, command), shell=True)
 except subprocess.CalledProcessError as e:
   if (tries  2):
 raise e
   print Couldn't connect to host {0}, waiting 30 seconds.format(e)
   time.sleep(30)
   tries = tries + 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5647) Output metrics do not show up for older hadoop versions ( 2.5)

Kostas Sakellis created SPARK-5647:
--

 Summary: Output metrics do not show up for older hadoop versions 
( 2.5)
 Key: SPARK-5647
 URL: https://issues.apache.org/jira/browse/SPARK-5647
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis


Need to add output metrics for hadoop  2.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5563) LDA with online variational inference

[
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308584#comment-14308584
]

Apache Spark commented on SPARK-5563:
-

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/4419

LDA with online variational inference
-

Key: SPARK-5563
URL: https://issues.apache.org/jira/browse/SPARK-5563
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

Latent Dirichlet Allocation (LDA) parameters can be inferred using online
variational inference, as in Hoffman, Blei and Bach. “Online Learning for
Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very
efficient and should be able to handle much larger datasets than batch
algorithms for LDA.
This algorithm will also be important for supporting Streaming versions of
LDA.
The implementation will ideally use the same API as the existing LDA but use
a different underlying optimizer.
This will require hooking in to the existing mllib.optimization frameworks.
This will require some discussion about whether batch versions of online
variational inference should be supported, as well as what variational
approximation should be used now or in the future.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX

2015-02-05 Thread Jianshi Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308595#comment-14308595
 ] 

Jianshi Huang commented on SPARK-4279:
--

Anyone is working on this?

 Implementing TinkerPop on top of GraphX
 ---

 Key: SPARK-4279
 URL: https://issues.apache.org/jira/browse/SPARK-4279
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Brennon York
Priority: Minor

 [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph 
 databases and has been implemented across various graph database backends. 
 Has anyone thought about integrating the TinkerPop framework with GraphX to 
 enable GraphX as another backend? Not sure if this has been brought up or 
 not, but would certainly volunteer to spearhead this effort if the community 
 thinks it to be a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-05 Thread Pedro Rodriguez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308603#comment-14308603
 ] 

Pedro Rodriguez commented on SPARK-5556:


Posting here as a status update. I will be working on and opening a pull 
request for adding a collapsed Gibbs sampling version which uses FastLDA for 
super linear scaling with number of topics. Below is the design document (same 
as from the original LDA JIRA issue), along with the repository/branch I am 
working on.
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing

https://github.com/EntilZha/spark/tree/LDA-Refactor

Tasks
* Rebase from the merged implementation, refactor appropriately
* Merge/implement the required inheritance/trait/abstract classes to support 
two implementations (EM and Gibbs) using only the entry points exposed in the 
EM version, plus an optional argument to select between EM/Gibbs.
* Do performance tests comparable to those run for EM LDA.

Some details for inheritance/trait/abstract:
General idea would be to create an API which LDA implementations must satisfy 
using a trait/abstract class. All implementation details would be encapsulated 
within a state object satisfying the trait/abstract class. LDA would be 
responsible for creating an EM or Gibbs state object based on a user argument 
switch/flag. Linked below is a sample implementation based on an earlier 
version of the merged EM code (which needs to be updated to reflect the changes 
since then, but it should show the idea well enough):
https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242

Timeline: I have been busier than expected, but rebase/refactoring should be 
done in the next few days, then I will open a PR to get feedback while running 
performance tests.

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4279) Implementing TinkerPop on top of GraphX


[ 
https://issues.apache.org/jira/browse/SPARK-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308651#comment-14308651
 ] 

Sean Owen commented on SPARK-4279:
--

This sounds like something that should live outside Spark, no? I suggest 
closing this. 

 Implementing TinkerPop on top of GraphX
 ---

 Key: SPARK-4279
 URL: https://issues.apache.org/jira/browse/SPARK-4279
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Brennon York
Priority: Minor

 [TinkerPop|https://github.com/tinkerpop] is a great abstraction for graph 
 databases and has been implemented across various graph database backends. 
 Has anyone thought about integrating the TinkerPop framework with GraphX to 
 enable GraphX as another backend? Not sure if this has been brought up or 
 not, but would certainly volunteer to spearhead this effort if the community 
 thinks it to be a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5625) Spark binaries do not incude Spark Core


 [ 
https://issues.apache.org/jira/browse/SPARK-5625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5625.
--
Resolution: Not a Problem

All of these distributions include an assembly JAR with the entire Spark 
codebase. None are supposed to contain individual artifacts.

 Spark binaries do not incude Spark Core
 ---

 Key: SPARK-5625
 URL: https://issues.apache.org/jira/browse/SPARK-5625
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: CDH4
Reporter: DeepakVohra

 Spark binaries for CDH 4 do not include the Spark Core Jar. 
 http://spark.apache.org/downloads.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5635) Allow users to run .scala files directly from spark-submit


[ 
https://issues.apache.org/jira/browse/SPARK-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308776#comment-14308776
 ] 

Sean Owen commented on SPARK-5635:
--

spark-shell uses spark-submit, and spark-shell is the already the thing that 
can ingest source code. You can run a .scala file as you say already this way. 
What is needed beyond this?

 Allow users to run .scala files directly from spark-submit
 --

 Key: SPARK-5635
 URL: https://issues.apache.org/jira/browse/SPARK-5635
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Reporter: Grant Henke
Priority: Minor

 Similar to the python functionality allow users to submit .scala files.
 Currently the way I simulate this is to use spark-shell and run: `spark-shell 
 -i myscript.scala`
 Note: user needs to add exit to the bottom of the script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5645) Track local bytes read for shuffles - update UI

Kostas Sakellis created SPARK-5645:
--

 Summary: Track local bytes read for shuffles - update UI
 Key: SPARK-5645
 URL: https://issues.apache.org/jira/browse/SPARK-5645
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis


Currently we do not track the local bytes read for a shuffle read. The UI only 
shows the remote bytes read. This is pretty confusing to the user because:
1) In local mode all shuffle reads are local
2) the shuffle bytes written from the previous stage might not add up if there 
are some bytes that are read locally on the shuffle read side
3) With https://github.com/apache/spark/pull/4067 we display the total number 
of records so that won't line up with only showing the remote bytes read. 

I propose we track the remote and local bytes read separately. In the UI show 
the total bytes read and in brackets show the remote bytes read for a shuffle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308599#comment-14308599
 ] 

Andrew Or edited comment on SPARK-5388 at 2/6/15 5:35 AM:
--

By the way for the more specific comments it would be good if you can leave 
them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and 
the actual code will diverge after some review so the most up-to-date version 
will likely be there.


was (Author: andrewor14):
By the way for the more specific comments it would be good if you can leave 
them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and 
the actual code will likely diverge after some review so the most up-to-date 
version will likely be there.

 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 For more detail, please see the most recent design doc attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5643) Add a show method to print the content of a DataFrame in columnar format

Reynold Xin created SPARK-5643:
--

 Summary: Add a show method to print the content of a DataFrame in 
columnar format
 Key: SPARK-5643
 URL: https://issues.apache.org/jira/browse/SPARK-5643
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5644) Delete tmp dir when sc is stop

2015-02-05 Thread Weizhong (JIRA)

Weizhong created SPARK-5644:
---

 Summary: Delete tmp dir when sc is stop
 Key: SPARK-5644
 URL: https://issues.apache.org/jira/browse/SPARK-5644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weizhong
Priority: Minor


When we run driver as a service which will not stop. In this service process we 
will create SparkContext and run job and then stop it, because we only call 
sc.stop but not exit this service process so the tmp dirs created by 
HttpFileServer and SparkEnv will not be deleted after SparkContext is stopped, 
and this will lead to creating too many tmp dirs if we create many SparkContext 
to run job in this service process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5031) ml.LogisticRegression score column should be renamed probability


 [ 
https://issues.apache.org/jira/browse/SPARK-5031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5031.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3637
[https://github.com/apache/spark/pull/3637]

 ml.LogisticRegression score column should be renamed probability
 

 Key: SPARK-5031
 URL: https://issues.apache.org/jira/browse/SPARK-5031
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.3.0


 In the spark.ml package, LogisticRegression has an output column score 
 which contains the estimated probability of label 1.  Score is a very 
 overloaded term, so probability would be better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4942) ML Transformers should allow output cols to be turned on,off


 [ 
https://issues.apache.org/jira/browse/SPARK-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4942.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3637
[https://github.com/apache/spark/pull/3637]

 ML Transformers should allow output cols to be turned on,off
 

 Key: SPARK-4942
 URL: https://issues.apache.org/jira/browse/SPARK-4942
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.3.0


 ML Transformers will eventually output multiple columns (e.g., predicted 
 labels, predicted confidences, probabilities, etc.).  These columns should be 
 optional.
 Benefits:
 * more efficient (though Spark SQL may be able to optimize)
 * cleaner column namespace if people do not want all output columns
 Proposal:
 * If a column name parameter (e.g., predictionCol) is an empty string, then 
 do not output that column.
 This will require updating validateAndTransformSchema() to ignore empty 
 output column names in addition to updating transform().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4789) Standardize ML Prediction APIs


 [ 
https://issues.apache.org/jira/browse/SPARK-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4789.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3637
[https://github.com/apache/spark/pull/3637]

 Standardize ML Prediction APIs
 --

 Key: SPARK-4789
 URL: https://issues.apache.org/jira/browse/SPARK-4789
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.3.0


 Create a standard set of abstractions for prediction in spark.ml.  This will 
 follow the design doc specified in [SPARK-3702].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5616) Add examples for PySpark API


[ 
https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308555#comment-14308555
 ] 

Apache Spark commented on SPARK-5616:
-

User 'lazyman500' has created a pull request for this issue:
https://github.com/apache/spark/pull/4417

 Add examples for PySpark API
 

 Key: SPARK-5616
 URL: https://issues.apache.org/jira/browse/SPARK-5616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: dongxu
  Labels: examples, pyspark, python
 Fix For: 1.3.0


 PySpark API examples are less than Spark scala API. For example:  
 1.Boardcast: how to use boardcast operation APi 
 2.Module: how to import a other python file in zip file.
 Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308599#comment-14308599
 ] 

Andrew Or commented on SPARK-5388:
--

By the way for the more specific comments it would be good if you can leave 
them on the PR itself: https://github.com/apache/spark/pull/4216. The specs and 
the actual code will likely diverge after some review so the most up-to-date 
version will likely be there.

 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 For more detail, please see the most recent design doc attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308608#comment-14308608
 ] 

Kevin Jung commented on SPARK-5081:
---

Sorry, I will make an effort to provide another code to replay this problem 
because I don't have the old code anymore.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-05 Thread Patrick Wendell (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308635#comment-14308635
]

Patrick Wendell commented on SPARK-5388:

I think it's reasonable to use DELETE per [~tigerquoll]'s suggestion. It's not
a perfect match with DELETE semantics, but I think it's fine to use it if it's
not too much work. I also think calling it maxProtocolVersion is a good idea if
those are indeed the semantics. For security, yeah the killing is the same as
it is in the current mode, which is that there is no security. One thing we
could do if there is user demand is add a flag that globally disables killing,
but let's see if users request this first.

Provide a stable application submission gateway in standalone cluster mode
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5563) LDA with online variational inference

2015-02-05 Thread Jason Dai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-5563:
-
Assignee: yuhao yang

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5566) Tokenizer for mllib package

2015-02-05 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308733#comment-14308733
 ] 

yuhao yang commented on SPARK-5566:
---

I mean only the underlying implementation. 

 Tokenizer for mllib package
 ---

 Key: SPARK-5566
 URL: https://issues.apache.org/jira/browse/SPARK-5566
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 There exist tokenizer classes in the spark.ml.feature package and in the 
 LDAExample in the spark.examples.mllib package.  The Tokenizer in the 
 LDAExample is more advanced and should be made into a full-fledged public 
 class in spark.mllib.feature.  The spark.ml.feature.Tokenizer class should 
 become a wrapper around the new Tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5644) Delete tmp dir when sc is stop


[ 
https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308719#comment-14308719
 ] 

Apache Spark commented on SPARK-5644:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4412

 Delete tmp dir when sc is stop
 --

 Key: SPARK-5644
 URL: https://issues.apache.org/jira/browse/SPARK-5644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weizhong
Priority: Minor

 When we run driver as a service which will not stop. In this service process 
 we will create SparkContext and run job and then stop it, because we only 
 call sc.stop but not exit this service process so the tmp dirs created by 
 HttpFileServer and SparkEnv will not be deleted after SparkContext is 
 stopped, and this will lead to creating too many tmp dirs if we create many 
 SparkContext to run job in this service process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Kevin Jung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620
 ] 

Kevin Jung commented on SPARK-5081:
---

To test under the same condition, I set this to snappy for all spark version 
but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it 
has better compression ratio.

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-05 Thread Pedro Rodriguez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308619#comment-14308619
 ] 

Pedro Rodriguez commented on SPARK-5556:


I will read that paper, seems interesting. Probably worth discussing at some 
point, how is the philosophy behind supporting different algorithms? It seems 
like there are a good number (at least 2 Gibbs, 1 EM right now). On the same 
line of thought, perhaps it would be better to open two pull requests, one 
which refactors the current LDA to allow multiple algorithms, and a second for 
the Gibbs itself? Thoughts?

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2015-02-05 Thread Florian Verhein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308644#comment-14308644
 ] 

Florian Verhein commented on SPARK-3185:


[~dvohra] Sure, but the exception is thrown by tachyon... so you're not going 
to be able to fix it by changing the spark build

 SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
 JOURNAL_FOLDER
 ---

 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI
 [ec2-user@ip-172-30-1-145 ~]$ uname -a
 Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
 The build I used (and MD5 verified):
 [ec2-user@ip-172-30-1-145 ~]$ wget 
 http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
Reporter: Jeremy Chambers

 {code}
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 {code}
 When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
 exception is thrown when Formatting JOURNAL_FOLDER.
 No exception occurs when I launch on Hadoop 1.
 Launch used:
 {code}
 ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
 --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
 sparkProd
 {code}
 {code}
 log snippet
 Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
 Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
 Exception in thread main java.lang.RuntimeException: 
 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73)
 at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
 at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
 at tachyon.Format.main(Format.java:54)
 Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
 communicate with client version 4
 at org.apache.hadoop.ipc.Client.call(Client.java:1070)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
 at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
 at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:203)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
 at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:69)
 ... 3 more
 Killed 0 processes
 Killed 0 processes
 ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
 ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
 ---end snippet---
 {code}
 *I don't have this problem when I launch without the 
 --hadoop-major-version=2 (which defaults to Hadoop 1.x).*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4808) Spark fails to spill with small number of large objects


[ 
https://issues.apache.org/jira/browse/SPARK-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308677#comment-14308677
 ] 

Apache Spark commented on SPARK-4808:
-

User 'mingyukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/4420

 Spark fails to spill with small number of large objects
 ---

 Key: SPARK-4808
 URL: https://issues.apache.org/jira/browse/SPARK-4808
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2, 1.1.0, 1.2.0, 1.2.1
Reporter: Dennis Lawler

 Spillable's maybeSpill does not allow spill to occur until at least 1000 
 elements have been spilled, and then will only evaluate spill every 32nd 
 element thereafter.  When there is a small number of very large items being 
 tracked, out-of-memory conditions may occur.
 I suspect that this and the every-32nd-element behavior was to reduce the 
 impact of the estimateSize() call.  This method was extracted into 
 SizeTracker, which implements its own exponential backup for size estimation, 
 so now we are only avoiding using the resulting estimated size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5639) Support DataFrame.renameColumn


 [ 
https://issues.apache.org/jira/browse/SPARK-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5639.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Support DataFrame.renameColumn
 --

 Key: SPARK-5639
 URL: https://issues.apache.org/jira/browse/SPARK-5639
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 It is incredibly hard to rename a column using the existing DSL. Let's 
 provide that out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-02-05 Thread Muthupandi K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308627#comment-14308627
 ] 

Muthupandi K edited comment on SPARK-5391 at 2/6/15 5:13 AM:
-

Same error occurred when a table is created with json serde in hive table and 
queried from SparkQL.


was (Author: muthu):
Same error occoured when a table is created with json serde in hive table and 
queried from SparkQL.

 SparkSQL fails to create tables with custom JSON SerDe
 --

 Key: SPARK-5391
 URL: https://issues.apache.org/jira/browse/SPARK-5391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross

 - Using Spark built from trunk on this commit: 
 https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
 - Build for Hive13
 - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
 First download jar locally:
 {code}
 $ curl 
 http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
   /tmp/json-serde-1.3-jar-with-dependencies.jar
 {code}
 Then add it in SparkSQL session:
 {code}
 add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
 {code}
 Finally create table:
 {code}
 create table test_json (c1 boolean) ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe';
 {code}
 Logs for add jar:
 {code}
 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
 query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
 point. hive.execution.engine=mr.
 15/01/23 23:48:34 INFO SessionState: Added 
 /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
 15/01/23 23:48:34 INFO SessionState: Added resource: 
 /tmp/json-serde-1.3-jar-with-dependencies.jar
 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
 /tmp/json-serde-1.3-jar-with-dependencies.jar at 
 http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
 timestamp 1422056914776
 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
 Schema: List()
 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
 Schema: List()
 {code}
 Logs (with error) for create table:
 {code}
 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
 query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe''
 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
 test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
 point. hive.execution.engine=mr.
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.run 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=TimeToSubmit 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
 a lock manager
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=compile 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=parse 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
 test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=parse 
 start=1422056941103 end=1422056941104 duration=1 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=semanticAnalyze 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
 position=13
 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze 
 start=1422056941104 end=1422056941240 duration=136 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
 Schema(fieldSchemas:null, properties:null)
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=compile 
 start=1422056941071 end=1422056941252 duration=181 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.execute 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
 (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit 
 start=1422056941067 end=1422056941258 duration=191

[jira] [Commented] (SPARK-5391) SparkSQL fails to create tables with custom JSON SerDe

2015-02-05 Thread Muthupandi K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308627#comment-14308627
 ] 

Muthupandi K commented on SPARK-5391:
-

Same error occoured when a table is created with json serde in hive table and 
queried from SparkQL.

 SparkSQL fails to create tables with custom JSON SerDe
 --

 Key: SPARK-5391
 URL: https://issues.apache.org/jira/browse/SPARK-5391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: David Ross

 - Using Spark built from trunk on this commit: 
 https://github.com/apache/spark/commit/bc20a52b34e826895d0dcc1d783c021ebd456ebd
 - Build for Hive13
 - Using this JSON serde: https://github.com/rcongiu/Hive-JSON-Serde
 First download jar locally:
 {code}
 $ curl 
 http://www.congiu.net/hive-json-serde/1.3/cdh5/json-serde-1.3-jar-with-dependencies.jar
   /tmp/json-serde-1.3-jar-with-dependencies.jar
 {code}
 Then add it in SparkSQL session:
 {code}
 add jar /tmp/json-serde-1.3-jar-with-dependencies.jar
 {code}
 Finally create table:
 {code}
 create table test_json (c1 boolean) ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe';
 {code}
 Logs for add jar:
 {code}
 15/01/23 23:48:33 INFO thriftserver.SparkExecuteStatementOperation: Running 
 query 'add jar /tmp/json-serde-1.3-jar-with-dependencies.jar'
 15/01/23 23:48:34 INFO session.SessionState: No Tez session required at this 
 point. hive.execution.engine=mr.
 15/01/23 23:48:34 INFO SessionState: Added 
 /tmp/json-serde-1.3-jar-with-dependencies.jar to class path
 15/01/23 23:48:34 INFO SessionState: Added resource: 
 /tmp/json-serde-1.3-jar-with-dependencies.jar
 15/01/23 23:48:34 INFO spark.SparkContext: Added JAR 
 /tmp/json-serde-1.3-jar-with-dependencies.jar at 
 http://192.168.99.9:51312/jars/json-serde-1.3-jar-with-dependencies.jar with 
 timestamp 1422056914776
 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
 Schema: List()
 15/01/23 23:48:34 INFO thriftserver.SparkExecuteStatementOperation: Result 
 Schema: List()
 {code}
 Logs (with error) for create table:
 {code}
 15/01/23 23:49:00 INFO thriftserver.SparkExecuteStatementOperation: Running 
 query 'create table test_json (c1 boolean) ROW FORMAT SERDE 
 'org.openx.data.jsonserde.JsonSerDe''
 15/01/23 23:49:00 INFO parse.ParseDriver: Parsing command: create table 
 test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
 15/01/23 23:49:01 INFO session.SessionState: No Tez session required at this 
 point. hive.execution.engine=mr.
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.run 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=TimeToSubmit 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Concurrency mode is disabled, not creating 
 a lock manager
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=compile 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=parse 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO parse.ParseDriver: Parsing command: create table 
 test_json (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO parse.ParseDriver: Parse Completed
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=parse 
 start=1422056941103 end=1422056941104 duration=1 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=semanticAnalyze 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
 15/01/23 23:49:01 INFO parse.SemanticAnalyzer: Creating table test_json 
 position=13
 15/01/23 23:49:01 INFO ql.Driver: Semantic Analysis Completed
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=semanticAnalyze 
 start=1422056941104 end=1422056941240 duration=136 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Returning Hive schema: 
 Schema(fieldSchemas:null, properties:null)
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=compile 
 start=1422056941071 end=1422056941252 duration=181 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=Driver.execute 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO ql.Driver: Starting command: create table test_json 
 (c1 boolean) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 15/01/23 23:49:01 INFO log.PerfLogger: /PERFLOG method=TimeToSubmit 
 start=1422056941067 end=1422056941258 duration=191 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG method=runTasks 
 from=org.apache.hadoop.hive.ql.Driver
 15/01/23 23:49:01 INFO log.PerfLogger: PERFLOG

[jira] [Resolved] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4


 [ 
https://issues.apache.org/jira/browse/SPARK-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5631.
--
Resolution: Not a Problem

The right place to ask questions and discuss this is the mailing list. This 
means you have mismatched Hadoop versions, either between your Spark and Hadoop 
deployment, or because you included Hadoop code in your app.

 Server IPC version 7 cannot communicate with   client version 4   
 --

 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
 Spark 1.2
 CDH4.2
Reporter: DeepakVohra

 A Spark application generates the error
 Server IPC version 7 cannot communicate with client version 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5531) Spark download .tgz file does not get unpacked


 [ 
https://issues.apache.org/jira/browse/SPARK-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5531.
--
Resolution: Not a Problem

 Spark download .tgz file does not get unpacked
 --

 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra

 The spark-1.2.0-bin-cdh4.tgz file downloaded from 
 http://spark.apache.org/downloads.html does not get unpacked.
 tar xvf spark-1.2.0-bin-cdh4.tgz
 gzip: stdin: not in gzip format
 tar: Child returned status 1
 tar: Error is not recoverable: exiting now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5645) Track local bytes read for shuffles - update UI

2015-02-05 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-5645:
--
Assignee: Kostas Sakellis

 Track local bytes read for shuffles - update UI
 ---

 Key: SPARK-5645
 URL: https://issues.apache.org/jira/browse/SPARK-5645
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Currently we do not track the local bytes read for a shuffle read. The UI 
 only shows the remote bytes read. This is pretty confusing to the user 
 because:
 1) In local mode all shuffle reads are local
 2) the shuffle bytes written from the previous stage might not add up if 
 there are some bytes that are read locally on the shuffle read side
 3) With https://github.com/apache/spark/pull/4067 we display the total number 
 of records so that won't line up with only showing the remote bytes read. 
 I propose we track the remote and local bytes read separately. In the UI show 
 the total bytes read and in brackets show the remote bytes read for a 
 shuffle. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5643) Add a show method to print the content of a DataFrame in columnar format


[ 
https://issues.apache.org/jira/browse/SPARK-5643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308543#comment-14308543
 ] 

Apache Spark commented on SPARK-5643:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4416

 Add a show method to print the content of a DataFrame in columnar format
 

 Key: SPARK-5643
 URL: https://issues.apache.org/jira/browse/SPARK-5643
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5646) Record output metrics for cache

Kostas Sakellis created SPARK-5646:
--

 Summary: Record output metrics for cache
 Key: SPARK-5646
 URL: https://issues.apache.org/jira/browse/SPARK-5646
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis


We currently show the input metrics when coming from the cache but we don't 
track/show the output metrics when we write to the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5646) Record output metrics for cache

2015-02-05 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-5646:
--
Assignee: Kostas Sakellis

 Record output metrics for cache
 ---

 Key: SPARK-5646
 URL: https://issues.apache.org/jira/browse/SPARK-5646
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 We currently show the input metrics when coming from the cache but we don't 
 track/show the output metrics when we write to the cache



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-05 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307196#comment-14307196
 ] 

Travis Galoppo commented on SPARK-5021:
---

[~MechCoder] It is probably better to get something working, submit a PR 
(perhaps mark it [WIP]) and work out the kinks in the review process.

 GaussianMixtureEM should be faster for SparseVector input
 -

 Key: SPARK-5021
 URL: https://issues.apache.org/jira/browse/SPARK-5021
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar

 GaussianMixtureEM currently converts everything to dense vectors.  It would 
 be nice if it were faster for SparseVectors (running in time linear in the 
 number of non-zero values).
 However, this may not be too important since clustering should rarely be done 
 in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2015-02-05 Thread Twinkle Sachdeva (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307406#comment-14307406
 ] 

Twinkle Sachdeva commented on SPARK-4705:
-

Hi [~vanzin],

Regarding adding that for other modes, I just need to override an API, after 
figuring a bit about getting attempt id. I will plan for that.

Thanks for the html stuff, will upload the UI snapshot too.





 Driver retries in yarn-cluster mode always fail if event logging is enabled
 ---

 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin

 yarn-cluster mode will retry to run the driver in certain failure modes. If 
 even logging is enabled, this will most probably fail, because:
 {noformat}
 Exception in thread Driver java.io.IOException: Log directory 
 hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
  already exists!
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:353)
 {noformat}
 The even log path should be more unique. Or perhaps retries of the same app 
 should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4897) Python 3 support

2015-02-05 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307463#comment-14307463
 ] 

Josh Rosen commented on SPARK-4897:
---

Hi [~ianozsvald],

Until now, the main motivation for Python 2.6 support was that it's the default 
system Python on a few Linux distributions.  So far, I think the overhead of 
supporting 2.6 has been fairly minimal, mostly involving a handful of small 
changes such as not treating certain object as context managers (e.g. Zipfile 
objects).

Let's try porting to 2.7 / 3.4 and then re-assess how hard Python 2.6 support 
will be.  If it's really easy (a couple hours of work, max) then I don't see a 
reason to drop it, but if we have to go to increasingly convoluted lengths to 
keep it then it's probably not worth it if we're gaining 3.4 support in return.

I think the main blocker to Python 3.4 support is the fact that nobody has 
really had time to work on it.  I'd be happy to work with anyone who is 
interested in taking this on.



 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4897) Python 3 support

2015-02-05 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4897:
--
Target Version/s: 1.4.0  (was: 1.3.0)

 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4897) Python 3 support

2015-02-05 Thread thom neale (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307474#comment-14307474
 ] 

thom neale commented on SPARK-4897:
---

I'm still very interested in helping with the 3.4 port, have only been
prohibited by lack of free time. I'll ask if work will give me a half day
to work on it.



 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5616) Add examples for PySpark API

2015-02-05 Thread dongxu (JIRA)

dongxu created SPARK-5616:
-

 Summary: Add examples for PySpark API
 Key: SPARK-5616
 URL: https://issues.apache.org/jira/browse/SPARK-5616
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: dongxu
 Fix For: 1.3.0


PySpark API examples are less than Spark scala API. For example:  

1.Boardcast: how to use boardcast operation APi 
2.Module: how to import a other python file in zip file.

Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5617) test failure of SQLQuerySuite

2015-02-05 Thread wangfei (JIRA)

wangfei created SPARK-5617:
--

 Summary: test failure of SQLQuerySuite
 Key: SPARK-5617
 URL: https://issues.apache.org/jira/browse/SPARK-5617
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei


SQLQuerySuite test failure: 
[info] - simple select (22 milliseconds)
[info] - sorting (722 milliseconds)
[info] - external sorting (728 milliseconds)
[info] - limit (95 milliseconds)
[info] - date row *** FAILED *** (35 milliseconds)
[info]   Results do not match for query:
[info]   'Limit 1
[info]'Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] 'UnresolvedRelation [testData], None
[info]   
[info]   == Analyzed Plan ==
[info]   Limit 1
[info]Project [CAST(2015-01-28, DateType) AS c0#3630]
[info] LogicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35
[info]   
[info]   == Physical Plan ==
[info]   Limit 1
[info]Project [16463 AS c0#3630]
[info] PhysicalRDD [key#0,value#1], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:35
[info]   
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 1 ==
[info]   ![2015-01-28]   [2015-01-27] (QueryTest.scala:77)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:77)
[info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:95)
[info]   at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply$mcV$sp(SQLQuerySuite.scala:300)
[info]   at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at 
org.apache.spark.sql.SQLQuerySuite$$anonfun$23.apply(SQLQuerySuite.scala:300)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
[info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info]   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at org.scalatest.SuperEngine$$anonfun$traverseSubNode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5618) Optimise utility code.

2015-02-05 Thread Makoto Fukuhara (JIRA)

Makoto Fukuhara created SPARK-5618:
--

 Summary: Optimise utility code.
 Key: SPARK-5618
 URL: https://issues.apache.org/jira/browse/SPARK-5618
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Makoto Fukuhara
Priority: Minor


I refactored the evaluation timing and unnecessary Regex API call.
Because Regex API is heavy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5631) Server IPC version 7 cannot communicate with client version 4

2015-02-05 Thread DeepakVohra (JIRA)

DeepakVohra created SPARK-5631:
--

 Summary: Server IPC version 7 cannot communicate with   client 
version 4   
 Key: SPARK-5631
 URL: https://issues.apache.org/jira/browse/SPARK-5631
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0
 Environment: Scala 2.10.4
Spark 1.2
CDH4.2
Reporter: DeepakVohra


A Spark application generates the error

Server IPC version 7 cannot communicate with client version 4





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5135) Add support for describe [extended] table to DDL in SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5135.

Resolution: Fixed

 Add support for describe [extended] table to DDL in SQLContext
 --

 Key: SPARK-5135
 URL: https://issues.apache.org/jira/browse/SPARK-5135
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 Support Describe Table Command.
 describe [extended] tableName.
 This also support external datasource table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5615) Fix testPackage in StreamingContextSuite

2015-02-05 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-5615:
--

 Summary: Fix testPackage in StreamingContextSuite
 Key: SPARK-5615
 URL: https://issues.apache.org/jira/browse/SPARK-5615
 Project: Spark
  Issue Type: Bug
Reporter: Liang-Chi Hsieh
Priority: Minor


testPackage in StreamingContextSuite often throws SparkException because its 
ssc is not shut down gracefully. Not affect the unit test but I think we can 
make it graceful.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5608) Improve SEO of Spark documentation site to let Google find latest docs

2015-02-05 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-5608.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Improve SEO of Spark documentation site to let Google find latest docs
 --

 Key: SPARK-5608
 URL: https://issues.apache.org/jira/browse/SPARK-5608
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Matei Zaharia
 Fix For: 1.3.0


 Google currently has trouble finding spark.apache.org/docs/latest, so a lot 
 of the results returned for various queries are from random previous versions 
 of Spark where someone created a link. I'd like to do the following:
 - Add a sitemap.xml to spark.apache.org that lists all the docs/latest pages 
 (already done)
 - Add meta description tags on some of the most important doc pages
 - Shorten the titles of some pages to have more relevant keywords; for 
 example there's no reason to have Spark SQL Programming Guide - Spark 1.2.0 
 documentation, we can just say Spark SQL - Spark 1.2.0 documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2808) update kafka to version 0.8.2

2015-02-05 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307254#comment-14307254
 ] 

koert kuipers commented on SPARK-2808:
--

what is the motivation for this upgrade?

 update kafka to version 0.8.2
 -

 Key: SPARK-2808
 URL: https://issues.apache.org/jira/browse/SPARK-2808
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First kafka_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2808) update kafka to version 0.8.2

2015-02-05 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307254#comment-14307254
 ] 

koert kuipers edited comment on SPARK-2808 at 2/5/15 2:28 PM:
--

what is the motivation for this upgrade? the offset storage in kafka?


was (Author: koert):
what is the motivation for this upgrade?

 update kafka to version 0.8.2
 -

 Key: SPARK-2808
 URL: https://issues.apache.org/jira/browse/SPARK-2808
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 First kafka_2.11 0.8.1 has to be released



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-02-05 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307289#comment-14307289
 ] 

Takeshi Yamamuro commented on SPARK-5480:
-

These codes didn't throw such exceptions in my environments.
What's the predicate in subgraph() and the input graph?

 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at

[jira] [Created] (SPARK-5632) not able to resolve dot('.') in field name

2015-02-05 Thread Lishu Liu (JIRA)

Lishu Liu created SPARK-5632:


 Summary: not able to resolve dot('.') in field name
 Key: SPARK-5632
 URL: https://issues.apache.org/jira/browse/SPARK-5632
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
Reporter: Lishu Liu


My cassandra table task_trace has a field sm.result which contains dot in the 
name. So SQL tried to look up sm instead of full name 'sm.result'. 

Here is my code: 
scala import org.apache.spark.sql.cassandra.CassandraSQLContext
scala val cc = new CassandraSQLContext(sc)
scala val task_trace = cc.jsonFile(/task_trace.json)
scala task_trace.registerTempTable(task_trace)
scala cc.setKeyspace(cerberus_data_v4)
scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, 
task_body.sm.result FROM task_trace WHERE task_id = 
'fff7304e-9984-4b45-b10c-0423a96745ce')
res: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[57] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
cerberus_id, couponId, coupon_code, created, description, domain, expires, 
message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity

The full schema look like this:
scala task_trace.printSchema()
root
 |-- received_datetime: long (nullable = true)
 |-- task_body: struct (nullable = true)
 ||-- cerberus_batch_id: string (nullable = true)
 ||-- cerberus_id: string (nullable = true)
 ||-- couponId: integer (nullable = true)
 ||-- coupon_code: string (nullable = true)
 ||-- created: string (nullable = true)
 ||-- description: string (nullable = true)
 ||-- domain: string (nullable = true)
 ||-- expires: string (nullable = true)
 ||-- message_id: string (nullable = true)
 ||-- neverShowAfter: string (nullable = true)
 ||-- neverShowBefore: string (nullable = true)
 ||-- offerTitle: string (nullable = true)
 ||-- screenshots: array (nullable = true)
 |||-- element: string (containsNull = false)
 ||-- sm.result: struct (nullable = true)
 |||-- cerberus_batch_id: string (nullable = true)
 |||-- cerberus_id: string (nullable = true)
 |||-- code: string (nullable = true)
 |||-- couponId: integer (nullable = true)
 |||-- created: string (nullable = true)
 |||-- description: string (nullable = true)
 |||-- domain: string (nullable = true)
 |||-- expires: string (nullable = true)
 |||-- message_id: string (nullable = true)
 |||-- neverShowAfter: string (nullable = true)
 |||-- neverShowBefore: string (nullable = true)
 |||-- offerTitle: string (nullable = true)
 |||-- result: struct (nullable = true)
 ||||-- post: struct (nullable = true)
 |||||-- alchemy_out_of_stock: struct (nullable = true)
 ||||||-- ci: double (nullable = true)
 ||||||-- value: boolean (nullable = true)
 |||||-- meta: struct (nullable = true)
 ||||||-- None_tx_value: array (nullable = true)
 |||||||-- element: string (containsNull = false)
 ||||||-- exceptions: array (nullable = true)
 |||||||-- element: string (containsNull = false)
 ||||||-- no_input_value: array (nullable = true)
 |||||||-- element: string (containsNull = false)
 ||||||-- not_mapped: array (nullable = true)
 |||||||-- element: string (containsNull = false)
 ||||||-- not_transformed: array (nullable = true)
 |||||||-- element: array (containsNull = false)
 ||||||||-- element: string (containsNull = false)
 |||||-- now_price_checkout: struct (nullable = true)
 ||||||-- ci: double (nullable = true)
 ||||||-- value: double (nullable = true)
 |||||-- shipping_price: struct (nullable = true)
 ||||||-- ci: double (nullable = true)
 ||||||-- value: double (nullable = true)
 |||||-- tax: struct (nullable = true)
 ||||||-- ci: double (nullable = true)
 ||||||-- value: double (nullable = true)
 |||||-- total: struct (nullable = true)
 ||||||-- ci: double (nullable = true)
 ||||||-- value: double (nullable = true)
 ||||-- pre: struct (nullable = true)
 |||||-- alchemy_out_of_stock: struct (nullable = true)
 |||

[jira] [Created] (SPARK-5633) pyspark saveAsTextFile support for compression codec

2015-02-05 Thread Vladimir Vladimirov (JIRA)

Vladimir Vladimirov created SPARK-5633:
--

 Summary: pyspark saveAsTextFile support for compression codec
 Key: SPARK-5633
 URL: https://issues.apache.org/jira/browse/SPARK-5633
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Vladimir Vladimirov
Priority: Minor


Scala and Java API allows to provide compression codec with 
saveAsTextFile(path, codec)
PySpark saveAsTextFile API does not support passing codec class.

This story is about adding saveAsTextFile(path, codec) support into pyspark.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5634) History server shows misleading message when there are no incomplete apps

Marcelo Vanzin created SPARK-5634:
-

 Summary: History server shows misleading message when there are no 
incomplete apps
 Key: SPARK-5634
 URL: https://issues.apache.org/jira/browse/SPARK-5634
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Marcelo Vanzin
Priority: Minor


If you go to the history server, and click on Show incomplete applications, 
but there are no incomplete applications, you get a misleading message:

{noformat}
No completed applications found!

Did you specify the correct logging directory? (etc etc)
{noformat}

That's the same message used when no complete applications are found; it should 
probably be tweaked for the incomplete apps case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5633) pyspark saveAsTextFile support for compression codec

2015-02-05 Thread Vladimir Vladimirov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308128#comment-14308128
 ] 

Vladimir Vladimirov commented on SPARK-5633:


Here is a workaround before proposed functionality will be accepted:
{code}
def saveAsTextFileCompressed(t, path, 
codec=org.apache.hadoop.io.compress.GzipCodec):
def func(split, iterator):
for x in iterator:
if not isinstance(x, basestring):
x = unicode(x)
if isinstance(x, unicode):
x = x.encode(utf-8)
yield x
keyed = t.mapPartitionsWithIndex(func)
keyed._bypass_serializer = True
codecClass = SparkContext._jvm.java.lang.Class.forName(codec)
keyed._jrdd.map(t.ctx._jvm.BytesToString()).saveAsTextFile(path, codecClass)
{code}

 pyspark saveAsTextFile support for compression codec
 

 Key: SPARK-5633
 URL: https://issues.apache.org/jira/browse/SPARK-5633
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Vladimir Vladimirov
Priority: Minor

 Scala and Java API allows to provide compression codec with 
 saveAsTextFile(path, codec)
 PySpark saveAsTextFile API does not support passing codec class.
 This story is about adding saveAsTextFile(path, codec) support into pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server


[ 
https://issues.apache.org/jira/browse/SPARK-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308141#comment-14308141
 ] 

Apache Spark commented on SPARK-5622:
-

User 'alexliu68' has created a pull request for this issue:
https://github.com/apache/spark/pull/4406

 Add connector/handler hive configuration settings to hive-thrift-server
 ---

 Key: SPARK-5622
 URL: https://issues.apache.org/jira/browse/SPARK-5622
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0, 1.1.1
Reporter: Alex Liu

 When integrate Cassandra Storage handler to Spark SQL, we need pass some 
 configuration settings to Hive-thrift-server hiveConf during server starting 
 process.
 e.g.
 {code}
 ./sbin/start-thriftserver.sh  --hiveconf cassandra.username=cassandra 
 --hiveconf cassandra.password=cassandra
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5493) Support proxy users under kerberos


[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308131#comment-14308131
 ] 

Apache Spark commented on SPARK-5493:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4405

 Support proxy users under kerberos
 --

 Key: SPARK-5493
 URL: https://issues.apache.org/jira/browse/SPARK-5493
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Brock Noland

 When using kerberos, services may want to use spark-submit to submit jobs as 
 a separate user. For example a service like oozie might want to submit jobs 
 as a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups

2015-02-05 Thread Nicholas Chammas (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308133#comment-14308133
]

Nicholas Chammas commented on SPARK-5335:
-

For the record: [AWS
says|https://forums.aws.amazon.com/thread.jspa?messageID=572559] you must use
the group ID (as opposed to the name) when deleting groups within a VPC.

Destroying cluster in VPC with --delete-groups fails to remove security
groups

Key: SPARK-5335
URL: https://issues.apache.org/jira/browse/SPARK-5335
Project: Spark
Issue Type: Bug
Components: EC2
Reporter: Vladimir Grigor

When I try to remove security groups using option of the script, it fails
because in VPC one should remove security groups by id, not name as it is now.
{code}
$ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups
destroy SparkByScript
Are you sure you want to destroy the cluster SparkByScript?
The following instances will be terminated:
Searching for existing cluster SparkByScript...
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster SparkByScript (y/N): y
Terminating master...
Terminating slaves...
Deleting security groups (this will take some time)...
Waiting for cluster to enter 'terminated' state.
Cluster is now in 'terminated' state. Waited 0 seconds.
Attempt 1
Deleting rules in security group SparkByScript-slaves
Deleting rules in security group SparkByScript-master
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC
security groups by name. Please use the corresponding id for this
operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response
Failed to delete security group SparkByScript-slaves
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
value 'SparkByScript-master' for groupName. You may not reference Amazon VPC
security groups by name. Please use the corresponding id for this
operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response
Failed to delete security group SparkByScript-master
Attempt 2

{code}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5604) Remove setCheckpointDir from LDA and tree Strategy


[ 
https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308166#comment-14308166
 ] 

Apache Spark commented on SPARK-5604:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4407

 Remove setCheckpointDir from LDA and tree Strategy
 --

 Key: SPARK-5604
 URL: https://issues.apache.org/jira/browse/SPARK-5604
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Continue the discussion from the LDA PR. CheckpoingDir is a global Spark 
 configuration, which should not be altered by an ML algorithm. We could check 
 whether checkpointDir is set if checkpointInterval is positive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5633) pyspark saveAsTextFile support for compression codec

2015-02-05 Thread Vladimir Vladimirov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308077#comment-14308077
 ] 

Vladimir Vladimirov commented on SPARK-5633:


Here is pull request that adds mentioned functionality
https://github.com/apache/spark/pull/4403

 pyspark saveAsTextFile support for compression codec
 

 Key: SPARK-5633
 URL: https://issues.apache.org/jira/browse/SPARK-5633
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Vladimir Vladimirov
Priority: Minor

 Scala and Java API allows to provide compression codec with 
 saveAsTextFile(path, codec)
 PySpark saveAsTextFile API does not support passing codec class.
 This story is about adding saveAsTextFile(path, codec) support into pyspark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5620) Group methods in generated unidoc


[ 
https://issues.apache.org/jira/browse/SPARK-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308084#comment-14308084
 ] 

Apache Spark commented on SPARK-5620:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4404

 Group methods in generated unidoc
 -

 Key: SPARK-5620
 URL: https://issues.apache.org/jira/browse/SPARK-5620
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Having methods show up in groups makes the doc more readable. For ML, we have 
 many parameters and their setters/getters, it is necessary to group them. 
 Same applies to the new DataFrame API.
 The grouping disappeared in recent versions of sbt-unidoc. We may miss some 
 compiler options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name

2015-02-05 Thread Lishu Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lishu Liu updated SPARK-5632:
-
Description: 
My cassandra table task_trace has a field sm.result which contains dot in the 
name. So SQL tried to look up sm instead of full name 'sm.result'. 

Here is my code: 
scala import org.apache.spark.sql.cassandra.CassandraSQLContext
scala val cc = new CassandraSQLContext(sc)
scala val task_trace = cc.jsonFile(/task_trace.json)
scala task_trace.registerTempTable(task_trace)
scala cc.setKeyspace(cerberus_data_v4)
scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, 
task_body.sm.result FROM task_trace WHERE task_id = 
'fff7304e-9984-4b45-b10c-0423a96745ce')
res: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[57] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
cerberus_id, couponId, coupon_code, created, description, domain, expires, 
message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity

The full schema look like this:
scala task_trace.printSchema()
root
 \|-- received_datetime: long (nullable = true)
 \|-- task_body: struct (nullable = true)
 \|\|-- cerberus_batch_id: string (nullable = true)
 \|\|-- cerberus_id: string (nullable = true)
 \|\|-- couponId: integer (nullable = true)
 \|\|-- coupon_code: string (nullable = true)
 \|\|-- created: string (nullable = true)
 \|\|-- description: string (nullable = true)
 \|\|-- domain: string (nullable = true)
 \|\|-- expires: string (nullable = true)
 \|\|-- message_id: string (nullable = true)
 \|\|-- neverShowAfter: string (nullable = true)
 \|\|-- neverShowBefore: string (nullable = true)
 \|\|-- offerTitle: string (nullable = true)
 \|\|-- screenshots: array (nullable = true)
 \|\|\|-- element: string (containsNull = false)
 \|\|-- sm.result: struct (nullable = true)
 \|\|\|-- cerberus_batch_id: string (nullable = true)
 \|\|\|-- cerberus_id: string (nullable = true)
 \|\|\|-- code: string (nullable = true)
 \|\|\|-- couponId: integer (nullable = true)
 \|\|\|-- created: string (nullable = true)
 \|\|\|-- description: string (nullable = true)
 \|\|\|-- domain: string (nullable = true)
 \|\|\|-- expires: string (nullable = true)
 \|\|\|-- message_id: string (nullable = true)
 \|\|\|-- neverShowAfter: string (nullable = true)
 \|\|\|-- neverShowBefore: string (nullable = true)
 \|\|\|-- offerTitle: string (nullable = true)
 \|\|\|-- result: struct (nullable = true)
 \|\|\|\|-- post: struct (nullable = true)
 \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: boolean (nullable = true)
 \|\|\|\|\|-- meta: struct (nullable = true)
 \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- exceptions: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- no_input_value: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- not_mapped: array (nullable = true)
 \|\|\|\|\|\|\|-- element: string (containsNull = false)
 \|\|\|\|\|\|-- not_transformed: array (nullable = true)
 \|\|\|\|\|\|\|-- element: array (containsNull = false)
 \|\|\|\|\|\|\|\|-- element: string (containsNull = 
false)
 \|\|\|\|\|-- now_price_checkout: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- shipping_price: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- tax: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|\|-- total: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double (nullable = true)
 \|\|\|\|\|\|-- value: double (nullable = true)
 \|\|\|\|-- pre: struct (nullable = true)
 \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
 \|\|\|\|\|\|-- ci: double

[jira] [Resolved] (SPARK-5528) Support schema merging while reading Parquet files


 [ 
https://issues.apache.org/jira/browse/SPARK-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5528.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4308
[https://github.com/apache/spark/pull/4308]

 Support schema merging while reading Parquet files
 --

 Key: SPARK-5528
 URL: https://issues.apache.org/jira/browse/SPARK-5528
 Project: Spark
  Issue Type: Improvement
Reporter: Cheng Lian
 Fix For: 1.3.0


 Spark 1.2.0 and prior versions only reads Parquet schema from {{_metadata}} 
 or a random Parquet part-file, and assumes all part-files share exactly the 
 same schema.
 In practice, it's common that users append new columns to existing Parquet 
 schema. Parquet has native schema merging support for such scenarios. Spark 
 SQL should also support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308263#comment-14308263
 ] 

Marcelo Vanzin commented on SPARK-5388:
---

Also, a fun fact about the Jersey dependency. Here's an excerpt of the output 
of mvn dependency:tree for the yarn module:

{noformat}
[INFO] +- org.apache.hadoop:hadoop-yarn-common:jar:2.4.0:compile
[INFO] |  +- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] |  |  +- javax.xml.stream:stax-api:jar:1.0-2:compile
[INFO] |  |  \- javax.activation:activation:jar:1.1:compile
[INFO] |  +- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO] |  |  \- org.tukaani:xz:jar:1.0:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.5:compile
[INFO] |  +- com.sun.jersey:jersey-core:jar:1.9:compile
{noformat}


 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: stable-spark-submit-in-standalone-mode-2-4-15.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 For more detail, please see the most recent design doc attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5509) EqualTo operator doesn't handle binary type properly


 [ 
https://issues.apache.org/jira/browse/SPARK-5509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5509.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4308
[https://github.com/apache/spark/pull/4308]

 EqualTo operator doesn't handle binary type properly
 

 Key: SPARK-5509
 URL: https://issues.apache.org/jira/browse/SPARK-5509
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1
Reporter: Cheng Lian
 Fix For: 1.3.0


 Binary type is mapped to {{Array\[Byte\]}}, which can't be compared with 
 {{==}} directly. However, {{EqualTo.eval()}} uses plain {{==}} to compare 
 values. Run the following {{spark-shell}} snippet with Spark 1.2.0 to 
 reproduce this issue: 
 {code}
 import org.apache.spark.sql.SQLContext
 import sc._
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 case class KV(key: Int, value: Array[Byte])
 def toBinary(s: String): Array[Byte] = s.toString.getBytes(UTF-8)
 registerFunction(toBinary, toBinary _)
 parallelize(1 to 1024).map(i = KV(i, 
 toBinary(i.toString))).registerTempTable(bin)
 // OK
 sql(select * from bin where value  toBinary('100')).collect()
 // Oops, returns nothing
 sql(select * from bin where value = toBinary('100')).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5635) Allow users to run .scala files directly from spark-submit

2015-02-05 Thread Grant Henke (JIRA)

Grant Henke created SPARK-5635:
--

 Summary: Allow users to run .scala files directly from spark-submit
 Key: SPARK-5635
 URL: https://issues.apache.org/jira/browse/SPARK-5635
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Spark Shell
Reporter: Grant Henke
Priority: Minor


Similar to the python functionality allow users to submit .scala files.

Currently the way I simulate this is to use spark-shell and run: `spark-shell 
-i myscript.scala`

Note: user needs to add exit to the bottom of the script.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5637) Expose spark_ec2 as as StarCluster Plugin

2015-02-05 Thread Alex Rothberg (JIRA)

Alex Rothberg created SPARK-5637:


 Summary: Expose spark_ec2 as as StarCluster Plugin
 Key: SPARK-5637
 URL: https://issues.apache.org/jira/browse/SPARK-5637
 Project: Spark
  Issue Type: Improvement
Reporter: Alex Rothberg
Priority: Minor


Starcluster has a lot features in place for stating EC2 instances and it would 
be great to have an option to leverage that as a plugin.

See: http://star.mit.edu/cluster/docs/latest/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5638) Add a config flag to disable eager analysis of DataFrames

Reynold Xin created SPARK-5638:
--

 Summary: Add a config flag to disable eager analysis of DataFrames
 Key: SPARK-5638
 URL: https://issues.apache.org/jira/browse/SPARK-5638
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


SInce DataFrames are eagerly analyzed, there is no way to construct a DataFrame 
that is invalid anymore (which can be very useful for debugging invalid 
queries).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5638) Add a config flag to disable eager analysis of DataFrames


[ 
https://issues.apache.org/jira/browse/SPARK-5638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308286#comment-14308286
 ] 

Apache Spark commented on SPARK-5638:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4408

 Add a config flag to disable eager analysis of DataFrames
 -

 Key: SPARK-5638
 URL: https://issues.apache.org/jira/browse/SPARK-5638
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 SInce DataFrames are eagerly analyzed, there is no way to construct a 
 DataFrame that is invalid anymore (which can be very useful for debugging 
 invalid queries).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5639) Support DataFrame.renameColumn

Reynold Xin created SPARK-5639:
--

 Summary: Support DataFrame.renameColumn
 Key: SPARK-5639
 URL: https://issues.apache.org/jira/browse/SPARK-5639
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


It is incredibly hard to rename a column using the existing DSL. Let's provide 
that out of the box.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI

[
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308324#comment-14308324
]

Marcelo Vanzin commented on SPARK-3454:
---

Hi [~imranr],

There are two ways I can see to solve the routing problem:

The first is the one you mention. I like it because, as you say, it keeps the
API consistent across different UIs. You always look at /json, not some
subpath that depends on which daemon you're looking at.

The second is to remove the notion of an application list from this spec.
That means the /json tree would be mounted under the application's path, not
at the root of the web server. The downside is that when you add an API to list
applications to the master / history server, things will look weird (you have
/json/v1/applications and /{applicationId}/json/v1 instead of a single
tree). Clients would have to adapt depending on whether they're talking to an
app directly, or to the master / history server.

So yeah, I like your suggestion better.

Expose JSON representation of data shown in WebUI
-

Key: SPARK-3454
URL: https://issues.apache.org/jira/browse/SPARK-3454
Project: Spark
Issue Type: Improvement
Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Attachments: sparkmonitoringjsondesign.pdf

If WebUI support to JSON format extracting, it's helpful for user who want to
analyse stage / task / executor information.
Fortunately, WebUI has renderJson method so we can implement the method in
each subclass.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5639) Support DataFrame.renameColumn


[ 
https://issues.apache.org/jira/browse/SPARK-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308328#comment-14308328
 ] 

Apache Spark commented on SPARK-5639:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4410

 Support DataFrame.renameColumn
 --

 Key: SPARK-5639
 URL: https://issues.apache.org/jira/browse/SPARK-5639
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It is incredibly hard to rename a column using the existing DSL. Let's 
 provide that out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5620) Group methods in generated unidoc


 [ 
https://issues.apache.org/jira/browse/SPARK-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5620.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Group methods in generated unidoc
 -

 Key: SPARK-5620
 URL: https://issues.apache.org/jira/browse/SPARK-5620
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 Having methods show up in groups makes the doc more readable. For ML, we have 
 many parameters and their setters/getters, it is necessary to group them. 
 Same applies to the new DataFrame API.
 The grouping disappeared in recent versions of sbt-unidoc. We may miss some 
 compiler options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5604) Remove setCheckpointDir from LDA and tree Strategy


 [ 
https://issues.apache.org/jira/browse/SPARK-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5604.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4390
[https://github.com/apache/spark/pull/4390]

 Remove setCheckpointDir from LDA and tree Strategy
 --

 Key: SPARK-5604
 URL: https://issues.apache.org/jira/browse/SPARK-5604
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 Continue the discussion from the LDA PR. CheckpoingDir is a global Spark 
 configuration, which should not be altered by an ML algorithm. We could check 
 whether checkpointDir is set if checkpointInterval is positive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5182) Partitioning support for tables created by the data source API


 [ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5182.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4308
[https://github.com/apache/spark/pull/4308]

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet


 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3575.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4308
[https://github.com/apache/spark/pull/4308]

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.3.0


 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5624) Can't find new column

2015-02-05 Thread Alex Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308261#comment-14308261
 ] 

Alex Liu commented on SPARK-5624:
-

Test it on the latest master branch it doesn't have this issue.

 Can't find new column 
 --

 Key: SPARK-5624
 URL: https://issues.apache.org/jira/browse/SPARK-5624
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Alex Liu
Priority: Minor

 The following test fails
 {code}
 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS alter_test_table;
 +-+
 | Result  |
 +-+
 +-+
 No rows selected (0.175 seconds)
 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS alter_test_table_ctas;
 +-+
 | Result  |
 +-+
 +-+
 No rows selected (0.155 seconds)
 0: jdbc:hive2://localhost:1 DROP TABLE IF EXISTS 
 alter_test_table_renamed;
 +-+
 | Result  |
 +-+
 +-+
 No rows selected (0.162 seconds)
 0: jdbc:hive2://localhost:1 CREATE TABLE alter_test_table (foo INT, bar 
 STRING) COMMENT 'table to test DDL ops' PARTITIONED BY (ds STRING) STORED AS 
 TEXTFILE;
 +-+
 | result  |
 +-+
 +-+
 No rows selected (0.247 seconds)
 0: jdbc:hive2://localhost:1 LOAD DATA LOCAL INPATH 
 '/Users/alex/project/automaton/resources/tests/data/files/kv1.txt' OVERWRITE 
 INTO TABLE alter_test_table PARTITION (ds='2008-08-08');  
 +-+
 | result  |
 +-+
 +-+
 No rows selected (0.367 seconds)
 0: jdbc:hive2://localhost:1 CREATE TABLE alter_test_table_ctas as SELECT 
 * FROM alter_test_table;
 +--+--+-+
 | foo  | bar  | ds  |
 +--+--+-+
 +--+--+-+
 No rows selected (0.641 seconds)
 0: jdbc:hive2://localhost:1 ALTER TABLE alter_test_table ADD COLUMNS 
 (new_col1 INT);
 +-+
 | result  |
 +-+
 +-+
 No rows selected (0.226 seconds)
 0: jdbc:hive2://localhost:1 INSERT OVERWRITE TABLE alter_test_table 
 PARTITION (ds='2008-08-15') SELECT foo, bar, 3 FROM 
 alter_test_table_ctas WHERE ds='2008-08-08';
 +--+--+--+
 | foo  | bar  | c_2  |
 +--+--+--+
 +--+--+--+
 No rows selected (0.522 seconds)
 0: jdbc:hive2://localhost:1 select * from alter_test_table ;
 Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 35.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 35.0 (TID 66, 127.0.0.1): java.lang.RuntimeException: cannot find field 
 new_col1 from [0:foo, 1:bar]
 
 org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:367)
 
 org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldRef(LazySimpleStructObjectInspector.java:168)
 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275)
 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$9.apply(TableReader.scala:275)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.hive.HadoopTableReader$.fillObject(TableReader.scala:275)
 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:193)
 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$3$$anonfun$apply$1.apply(TableReader.scala:187)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

[
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308260#comment-14308260
]

Marcelo Vanzin commented on SPARK-5388:
---

Hi [~andrewor14],

Thanks for updating the spec! This one looks much, much better. I think most of
my concerns have been addressed. Adherence to RESTfulness is not super
important since this is an internal API, although I really would suggest
picking a better name for the Scala package (e.g.
org.apache.spark.deploy.proto or something, instead of rest).

A few questions:

- is the action field required? Since you have different URIs handling
different messages, it seems redundant now. And responses having an action is
kinda weird.
- what is the protocolVersion field in ErrorResponse? From the spec, it
sounds like the maximum protocol version supported by the server. If that's the
case, can the property be renamed to maxProtocolVersion?
- the message definitions use strings for all data, is that intentional? It
would feel more natural to have proper types, e.g.: jars : [ one.jar,
two.jar ], driverCores: 8, superviseDriver: false.
- The spec says the server should report unknown fields back to the client.
There's nothing in the response type that supports that; is the server expected
to embed that information in the message field? Feels like it would be better
to have an explicit field for that.
- Is the kill endpoint protected in any way? Right now it seems like anyone
can post to that and kill a driver, if they know (or guess) the submission ID.
If there's no special protection, I'd say in the spec that the submission ID
should be, at least, cryptographically secure. At that point, as long as the
server has SSL enabled, it should be hard enough to kill a random driver.

Provide a stable application submission gateway in standalone cluster mode
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5636) Lower dynamic allocation add interval

Andrew Or created SPARK-5636:


 Summary: Lower dynamic allocation add interval
 Key: SPARK-5636
 URL: https://issues.apache.org/jira/browse/SPARK-5636
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or


The current default of 1 min is a little long especially since a recent patch 
causes the number of executors to start at 0 by default. We should ramp up much 
more quickly in the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5557) spark-shell failed to start

2015-02-05 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308346#comment-14308346
 ] 

Patrick Wendell commented on SPARK-5557:


I can send a fix for this shortly. It also works fine if you build with Hadoop 
2 support.

 spark-shell failed to start
 ---

 Key: SPARK-5557
 URL: https://issues.apache.org/jira/browse/SPARK-5557
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Guoqiang Li
Priority: Blocker

 the log:
 {noformat}
 5/02/03 19:06:39 INFO spark.HttpServer: Starting HTTP Server
 Exception in thread main java.lang.NoClassDefFoundError: 
 javax/servlet/http/HttpServletResponse
   at 
 org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:75)
   at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62)
   at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62)
   at 
 org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1774)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1765)
   at org.apache.spark.HttpServer.start(HttpServer.scala:62)
   at org.apache.spark.repl.SparkIMain.init(SparkIMain.scala:130)
   at 
 org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.init(SparkILoop.scala:185)
   at 
 org.apache.spark.repl.SparkILoop.createInterpreter(SparkILoop.scala:214)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:946)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at 
 org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:403)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:77)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException: 
 javax.servlet.http.HttpServletResponse
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   ... 25 more
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5621) Cannot fetch dependencies for mllib

2015-02-05 Thread Luca Venturini (JIRA)

Luca Venturini created SPARK-5621:
-

 Summary: Cannot fetch dependencies for mllib
 Key: SPARK-5621
 URL: https://issues.apache.org/jira/browse/SPARK-5621
 Project: Spark
  Issue Type: Bug
Reporter: Luca Venturini


The mllib docs say to include com.github.fommil.netlib:all:1.1.2, but I cannot 
fetch any jar for this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2015-02-05 Thread Philippe Girolami (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307580#comment-14307580
 ] 

Philippe Girolami commented on SPARK-1867:
--

Has anyone figured this out ? I'm seeing this happen when running spark-shell 
off the master branch (at cd5da42), using the same example as [~ansonism]. 
Works fine in 1.2.0, downloaded from the website.
{code}
val source = sc.textFile(/tmp/testfile.txt)
source.saveAsTextFile(/tmp/test_spark_output)
{code}

I built master using
{code}
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver 
-Pbigtop-dist -DskipTests  clean package install
{code} on MacOS using Sun Java 7
{quote}
java version 1.7.0_60
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
{quote}

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala

[jira] [Commented] (SPARK-5081) Shuffle write increases


[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307584#comment-14307584
 ] 

Kostas Sakellis commented on SPARK-5081:


Can you add a sample of the code too?

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server

2015-02-05 Thread Alex Liu (JIRA)

Alex Liu created SPARK-5622:
---

 Summary: Add connector/handler hive configuration settings to 
hive-thrift-server
 Key: SPARK-5622
 URL: https://issues.apache.org/jira/browse/SPARK-5622
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.1.0
Reporter: Alex Liu


When integrate Cassandra Storage handler to Spark SQL, we need pass some 
configuration settings to Hive-thrift-server hiveConf during server starting 
process.

e.g.
{code}
./sbin/start-thriftserver.sh  --hiveconf cassandra.username=cassandra 
--hiveconf cassandra.password=cassandra

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5623) Replace an obsolete mapReduceTriplets with a new aggregateMessages in GraphSuite

2015-02-05 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-5623:
---

 Summary: Replace an obsolete mapReduceTriplets with a new 
aggregateMessages in GraphSuite
 Key: SPARK-5623
 URL: https://issues.apache.org/jira/browse/SPARK-5623
 Project: Spark
  Issue Type: Test
  Components: GraphX
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5013) User guide for Gaussian Mixture Model


[ 
https://issues.apache.org/jira/browse/SPARK-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307605#comment-14307605
 ] 

Apache Spark commented on SPARK-5013:
-

User 'tgaloppo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4401

 User guide for Gaussian Mixture Model
 -

 Key: SPARK-5013
 URL: https://issues.apache.org/jira/browse/SPARK-5013
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Travis Galoppo

 Add GMM user guide with code examples in Scala/Java/Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:

2015-02-05 Thread Stephane Maarek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307628#comment-14307628
 ] 

Stephane Maarek commented on SPARK-5480:


It happened once after one of my server failed, but the graph vertices and 
edges count did work. Doesn't happen systematically... having issues 
reproducing it

val subgraph = graph.subgraph (
  vpred = (id,article) = article._1.toLowerCase.contains(stringToSearchFor)
|| article._3.exists(keyword = keyword.contains(stringToSearchFor))
|| (article._2 match {
case None = false
case Some(articleAbstract) = 
articleAbstract.toLowerCase.contains(stringToSearchFor)
  })
).cache()

 GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: 
 ---

 Key: SPARK-5480
 URL: https://issues.apache.org/jira/browse/SPARK-5480
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: Yarn client
Reporter: Stephane Maarek

 Running the following code:
 val subgraph = graph.subgraph (
   vpred = (id,article) = //working predicate)
 ).cache()
 println( sSubgraph contains ${subgraph.vertices.count} nodes and 
 ${subgraph.edges.count} edges)
 val prGraph = subgraph.staticPageRank(5).cache
 val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) {
   (v, title, rank) = (rank.getOrElse(0.0), title)
 }
 titleAndPrGraph.vertices.top(13) {
   Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1)
 }.foreach(t = println(t._2._2._1 + :  + t._2._1 + , id: + t._1))
 Returns a graph with 5000 nodes and 4000 edges.
 Then it crashes during the PageRank with the following:
 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 
 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes)
 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 
 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
 org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75)
 at 
 org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110)
 at 
 org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at

[jira] [Resolved] (SPARK-5621) Cannot fetch dependencies for mllib


 [ 
https://issues.apache.org/jira/browse/SPARK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5621.
--
Resolution: Not a Problem

It certainly exists : 
http://search.maven.org/#artifactdetails%7Ccom.github.fommil.netlib%7Call%7C1.1.2%7Cpom
The docs actually suggest using the {{netlib-lgpl}} profile, and if you have a 
look at both of these you'll see that it's a pom-only artifact, so you need 
{{typepom/type}}.

 Cannot fetch dependencies for mllib
 ---

 Key: SPARK-5621
 URL: https://issues.apache.org/jira/browse/SPARK-5621
 Project: Spark
  Issue Type: Bug
Reporter: Luca Venturini

 The mllib docs say to include com.github.fommil.netlib:all:1.1.2, but I 
 cannot fetch any jar for this dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5610) Generate Java docs without package private classes and methods


[ 
https://issues.apache.org/jira/browse/SPARK-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307599#comment-14307599
 ] 

Sean Owen commented on SPARK-5610:
--

From looking at the Javadoc 8 + unidoc issue, I recall that we had a problem 
where {{private[foo]}} classes were being rendered as private top-level Java 
classes, which isn't legal to Javadoc 8. This bit of code you change is the 
bit that fixed that particular problem. Is this going to cause such classes to 
be private again? that's feels a bit wrong, since such classes aren't really 
meaningful in Java anyway.

You can certainly tell javadoc to only generate docs for public / protected 
classes. In fact that should be the default. So I wonder if the right-er change 
is to render such classes as package-private in Java? it doesn't mean quite the 
same thing but may be entirely close enough for genjavadoc purposes.

 Generate Java docs without package private classes and methods
 --

 Key: SPARK-5610
 URL: https://issues.apache.org/jira/browse/SPARK-5610
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The current generated Java doc is a mixed of public and package private 
 classes and methods. We can update genjavadoc to hide them.
 Upstream PR: https://github.com/typesafehub/genjavadoc/pull/47



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data


 [ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1867.
--
Resolution: Not a Problem

I think there are a number of manifestations of the same basic problem here: 
mismatching versions of Spark. In each case it seems like bits and pieces of 
Hadoop and Spark have been built into an app, or the cluster's version of Spark 
was not matched with the Hadoop version. I am not 100% sure since there is a 
load of stuff being talked about here, but I do not see a clear problem in 
Spark or actionable change.

Philippe I imagine you are reporting something different: SPARK-5557

 Spark Documentation Error causes java.lang.IllegalStateException: unread 
 block data
 ---

 Key: SPARK-1867
 URL: https://issues.apache.org/jira/browse/SPARK-1867
 Project: Spark
  Issue Type: Bug
Reporter: sam

 I've employed two System Administrators on a contract basis (for quite a bit 
 of money), and both contractors have independently hit the following 
 exception.  What we are doing is:
 1. Installing Spark 0.9.1 according to the documentation on the website, 
 along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
 cluster
 I've also included code snippets, and sbt deps at the bottom.
 When I've Googled this, there seems to be two somewhat vague responses:
 a) Mismatching spark versions on nodes/user code
 b) Need to add more jars to the SparkConf
 Now I know that (b) is not the problem having successfully run the same code 
 on other clusters while only including one jar (it's a fat jar).
 But I have no idea how to check for (a) - it appears Spark doesn't have any 
 version checks or anything - it would be nice if it checked versions and 
 threw a mismatching version exception: you have user code using version X 
 and node Y has version Z.
 I would be very grateful for advice on this.
 The exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted: Task 
 0.0:1 failed 32 times (most recent failure: Exception failure: 
 java.lang.IllegalStateException: unread block data)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
 java.lang.IllegalStateException: unread block data [duplicate 59]
 My code snippet:
 val conf = new SparkConf()
.setMaster(clusterMaster)
.setAppName(appName)
.setSparkHome(sparkHome)
.setJars(SparkContext.jarOfClass(this.getClass))
 println(count =  + new SparkContext(conf).textFile(someHdfsPath).count())
 My SBT dependencies:
 // relevant
 org.apache.spark % spark-core_2.10 % 0.9.1,
 org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0,
 // standard, probably unrelated
 com.github.seratch %% awscala % [0.2,),
 org.scalacheck %% scalacheck % 1.10.1 % test,
 org.specs2 %% specs2 % 1.14 % test,
 org.scala-lang % scala-reflect % 2.10.3,
 org.scalaz %% scalaz-core % 7.0.5,
 net.minidev % json-smart %

[jira] [Commented] (SPARK-2827) Add DegreeDist function support


[ 
https://issues.apache.org/jira/browse/SPARK-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307499#comment-14307499
 ] 

Apache Spark commented on SPARK-2827:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/4399

 Add DegreeDist function support
 ---

 Key: SPARK-2827
 URL: https://issues.apache.org/jira/browse/SPARK-2827
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Lu Lu

 Add degree distribution operators in GraphOps for GraphX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5620) Group methods in generated unidoc

Xiangrui Meng created SPARK-5620:


 Summary: Group methods in generated unidoc
 Key: SPARK-5620
 URL: https://issues.apache.org/jira/browse/SPARK-5620
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Having methods show up in groups makes the doc more readable. For ML, we have 
many parameters and their setters/getters, it is necessary to group them. Same 
applies to the new DataFrame API.

The grouping disappeared in recent versions of sbt-unidoc. We may miss some 
compiler options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4897) Python 3 support

2015-02-05 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307482#comment-14307482
 ] 

Josh Rosen commented on SPARK-4897:
---

By the way, it might be nice to see if we can figure out a good way of 
subdividing this task across multiple PRs so that the pieces that we have 
already figured out don't end up bitrotting / becoming merge-conflicts.  For 
instance, if we can test the `cloudpickle.py` file separately from the other 
modules, then we could submit a PR that only adds 3.4 support to that file.  If 
you can spot any other natural subproblems here, leave a comment or create a 
sub-task on this JIRA ticket.

 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5557) Servlet API classes now missing after jetty shading