Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
I would be open to working on Dataset documentation if no one else isn't already working on it. Thoughts? On Fri, Jun 17, 2016 at 11:44 PM, Cheng Lian wrote: > As mentioned in the PR description, this is just an initial PR to bring > existing contents up to date, so that

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
As mentioned in the PR description, this is just an initial PR to bring existing contents up to date, so that people can add more contents incrementally. We should definitely cover more about Dataset. Cheng On 6/17/16 10:28 PM, Pedro Rodriguez wrote: The updates look great! Looks like

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
The updates look great! Looks like many places are updated to the new APIs, but there still isn't a section for working with Datasets (most of the docs work with Dataframes). Are you planning on adding more? I am thinking something that would address common questions like the one I posted on the

Re: Spark 2.0 Dataset Documentation

2016-06-17 Thread Cheng Lian
Hey Pedro, SQL programming guide is being updated. Here's the PR, but not merged yet: https://github.com/apache/spark/pull/13592 Cheng On 6/17/16 9:13 PM, Pedro Rodriguez wrote: Hi All, At my workplace we are starting to use Datasets in 1.6.1 and even more with Spark 2.0 in place of

Question about equality of o.a.s.sql.Row

2016-06-17 Thread Kazuaki Ishizaki
Dear all, I have three questions about equality of org.apache.spark.sql.Row. (1) If a Row has a complex type (e.g. Array), is the following behavior expected? If two Rows has the same array instance, Row.equals returns true in the second assert. If two Rows has different array instances (a1

Re: Skew data

2016-06-17 Thread Pedro Rodriguez
I am going to take a guess that this means that your partitions within an RDD are not balanced (one or more partitions are much larger than the rest). This would mean a single core would need to do much more work than the rest leading to poor performance. In general, the way to fix this is to

Spark 2.0 Dataset Documentation

2016-06-17 Thread Pedro Rodriguez
Hi All, At my workplace we are starting to use Datasets in 1.6.1 and even more with Spark 2.0 in place of Dataframes. I looked at the 1.6.1 documentation then the 2.0 documentation and it looks like not much time has been spent writing a Dataset guide/tutorial. Preview Docs:

Re: Hello

2016-06-17 Thread Michael Armbrust
Another good signal is the "target version" (which by convention is only set by committers). When I set this for the upcoming version it means I think its important enough that I will prioritize reviewing a patch for it. On Fri, Jun 17, 2016 at 3:22 PM, Pedro Rodriguez

Re: Hello

2016-06-17 Thread Pedro Rodriguez
What is the best way to determine what the library maintainers believe is important work to be done? I have looked through the JIRA and its unclear what are priority items one could do work on. I am guessing this is in part because things are a little hectic with final work for 2.0, but it would

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Ted Yu
Docker Integration Tests failed on Linux: http://pastebin.com/Ut51aRV3 Here was the command I used: mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.0 package Has anyone seen similar error ? Thanks On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Marcelo Vanzin
-1 (non-binding) SPARK-16017 shows a severe perf regression in YARN compared to 1.6.1. On Thu, Jun 16, 2016 at 9:49 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.2! > > The vote is open until Sunday, June 19, 2016 at

Re: [VOTE] Release Apache Spark 1.6.2 (RC1)

2016-06-17 Thread Jonathan Kelly
+1 (non-binding) On Thu, Jun 16, 2016 at 9:49 PM Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.2! > > The vote is open until Sunday, June 19, 2016 at 22:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. >

Re: Regarding on the dataframe stat frequent

2016-06-17 Thread Sean Owen
If you have a clean test case demonstrating the desired behavior, and a change which makes it work that way, yes make a JIRA and PR. On Fri, Jun 17, 2016 at 1:35 AM, Luyi Wang wrote: > Hey there: > > The frequent item in dataframe stat package seems not accurate. In the >

Re: Spark internal Logging trait potential thread unsafe

2016-06-17 Thread Sean Owen
I think that's OK to change, yes. I don't see why it's necessary to init log_ the way it is now. initializeLogIfNecessary() has a purpose though. On Fri, Jun 17, 2016 at 2:39 AM, Prajwal Tuladhar wrote: > Hi, > > The way log instance inside Logger trait is current being

testing the kafka 0.10 connector

2016-06-17 Thread Reynold Xin
Cody has graciously worked on a new connector for dstream for Kafka 0.10. Can people that use Kafka test this connector out? The patch is at https://github.com/apache/spark/pull/11863 Although we have stopped merging new features into branch-2.0, this connector is very decoupled from rest of

Re: ImportError: No module named numpy

2016-06-17 Thread Bhupendra Mishra
Issue has been fixed after lots of R around finally found preety simple things causing this problem It was related to permission issue on the python libraries. The user I am logged in was not having enough permission to read/execute the following python liabraries.