Re: BlockManager issues

2014-09-21 Thread Reynold Xin
It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit

Re: Question about SparkSQL and Hive-on-Spark

2014-09-23 Thread Reynold Xin
On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian tianyi.asiai...@gmail.com wrote: Hi all, I have some questions about the SparkSQL and Hive-on-Spark Will SparkSQL support all the hive feature in the future? or just making hive as a datasource of Spark? Most likely not *ALL* Hive features, but

Re: thank you for reviewing our patches

2014-09-26 Thread Reynold Xin
Keep the patches coming :) On Fri, Sep 26, 2014 at 1:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I recently came across this mailing list post by Linus Torvalds https://lkml.org/lkml/2004/12/20/255 about the value of reviewing even “trivial” patches. The following passages

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event

Re: FYI: i've doubled the jenkins executors for every build node

2014-09-29 Thread Reynold Xin
Thanks. We might see more failures due to contention on resources. Fingers acrossed ... At some point it might make sense to run the tests in a VM or container. On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote: we were running at 8 executors per node, and BARELY even

Re: Extending Scala style checks

2014-10-01 Thread Reynold Xin
There is scalariform but it can be disruptive. Last time I ran it on Spark it didn't compile due to some xml interpolation problem. On Wednesday, October 1, 2014, Nicholas Chammas nicholas.cham...@gmail.com wrote: Does anyone know if Scala has something equivalent to autopep8

Re: Unneeded branches/tags

2014-10-07 Thread Reynold Xin
Those branches are no longer active. However, I don't think we can delete branches from github due to the way ASF mirroring works. I might be wrong there. On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just curious: Are there branches and/or tags on the

Re: Scalastyle improvements / large code reformatting

2014-10-12 Thread Reynold Xin
I actually think we should just take the bite and follow through with the reformatting. Many rules are simply not possible to enforce only on deltas (e.g. import ordering). That said, maybe there are better windows to do this, e.g. during the QA period. On Sun, Oct 12, 2014 at 9:37 PM, Josh

Re: accumulators

2014-10-17 Thread Reynold Xin
is to have pagination of these and always sort them by the last update time. --  Reynold Xin On October 16, 2014 at 12:11:00 PM, Sean McNamara (sean.mcnam...@webtrends.com) wrote: Accumulators on the stage info page show the rolling life time value of accumulators as well as per task which

Re: Get attempt number in a closure

2014-10-20 Thread Reynold Xin
I also ran into this earlier. It is a bug. Do you want to file a jira? I think part of the problem is that we don't actually have the attempt id on the executors. If we do, that's great. If not, we'd need to propagate that over. On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai huaiyin@gmail.com

Re: Get attempt number in a closure

2014-10-20 Thread Reynold Xin
/SPARK-4014. On Mon, Oct 20, 2014 at 1:57 PM, Reynold Xin r...@databricks.com wrote: I also ran into this earlier. It is a bug. Do you want to file a jira? I think part of the problem is that we don't actually have the attempt id on the executors. If we do, that's great. If not, we'd

Re: Building and Running Spark on OS X

2014-10-20 Thread Reynold Xin
I usually use SBT on Mac and that one doesn't require any setup ... On Mon, Oct 20, 2014 at 4:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If one were to put together a short but comprehensive guide to setting up Spark to run locally on OS X, would it look like this? # Install

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. I want to thank Reynold Xin

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Reynold Xin
Steve, I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I updated the blog post to actually include CPU / disk / network measures. You should see that in any measure that matters to this benchmark, the old 2100 node cluster is vastly superior. The data even fit in memory! On

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Reynold Xin
+1 (binding) We are already doing this implicitly. In my experience, this can create longer term personal commitment, which usually leads to better design decisions if somebody knows they would need to look after something for a while. On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin
to maintain the features described within the TinkerPop API as that might change in the future. From: Kushal Datta kushal.da...@gmail.com Date: Thursday, November 6, 2014 at 4:00 PM To: York, Brennon brennon.y...@capitalone.com Cc: Kyle Ellrott kellr...@soe.ucsc.edu, Reynold Xin r

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Reynold Xin
Greg, Thanks a lot for commenting on this, but I feel we are splitting hairs here. Matei did mention -1, followed by or give feedback. The original process outlined by Matei was exactly about review, rather than fighting. Nobody wants to spend their energy fighting. Everybody is doing it to

Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Reynold Xin
Technically you can already do custom serializer for each shuffle operation (it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues (or github) in the past a storage policy in which you can specify how data should be stored. I think that would be a great API to have in the long

Re: Terasort example

2014-11-11 Thread Reynold Xin
This is great. I think the consensus from last time was that we would put performance stuff into spark-perf, so it is easy to test different Spark versions. On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs ewan.hi...@ugent.be wrote: Hi all, I saw that Reynold Xin had a Terasort example PR

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
Do people usually important o.a.spark.rdd._ ? Also in order to maintain source and binary compatibility, we would need to keep both right? On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu zsxw...@gmail.com wrote: I saw many people asked how to convert a RDD to a PairRDDFunctions. I would like to

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin
`rddToPairRDDFunctions` in the SparkContext but remove `implicit`. The disadvantage is there are two copies of same codes. Best Regards, Shixiong Zhu 2014-11-14 3:57 GMT+08:00 Reynold Xin r...@databricks.com: Do people usually important o.a.spark.rdd._ ? Also in order to maintain source and binary

Re: send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread Reynold Xin
The current design is not ideal, but the size of dependencies should be fairly small since we only send the path and timestamp, not the jars themselves. Executors can come and go. This is essentially a state replication problem that you gotta be very careful with consistency. On Sun, Nov 16,

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Reynold Xin
That's a great idea and it is also a pain point for some users. However, it is not possible to solve this problem at compile time, because the content of serialization can only be determined at runtime. There are some efforts in Scala to help users avoid mistakes like this. One example project

Re: Regarding RecordReader of spark

2014-11-16 Thread Reynold Xin
I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote:

Re: Apache infra github sync down

2014-11-18 Thread Reynold Xin
This basically stops us from merging patches. I'm wondering if it is possible for ASF to give some Spark committers write permission to github repo. In that case, if the sync tool is down, we can manually push periodically. On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell pwend...@gmail.com

Re: Eliminate copy while sending data : any Akka experts here ?

2014-11-20 Thread Reynold Xin
but not yet ACKed? The buffer will be cheap since the mapOutputStatuses messages are same and the memory cost is only a few pointers. Best Regards, Shixiong Zhu 2014-09-20 16:24 GMT+08:00 Reynold Xin r...@databricks.com: BTW - a partial solution here: https://github.com/apache/spark/pull/2470

Re: Troubleshooting JVM OOM during Spark Unit Tests

2014-11-22 Thread Reynold Xin
What does /tmp/jvm-21940/hs_error.log tell you? It might give hints to what threads are allocating the extra off-heap memory. On Fri, Nov 21, 2014 at 1:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks, I’m trying to understand why I’m getting “insufficient memory”

Re: How to resolve Spark site issues?

2014-11-25 Thread Reynold Xin
The website is hosted on some svn server by ASF and unfortunately it doesn't have a github mirror, so we will have to manually patch it ... On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon brennon.y...@capitalone.com wrote: For JIRA tickets like SPARK-4046

Re: How to resolve Spark site issues?

2014-11-26 Thread Reynold Xin
is to make a diff and attach it to the JIRA. How old school. On Tue, Nov 25, 2014 at 7:30 PM, Reynold Xin r...@databricks.com wrote: The website is hosted on some svn server by ASF and unfortunately it doesn't have a github mirror, so we will have to manually patch it ... On Tue, Nov 25

Re: Standalone scheduling - document inconsistent

2014-11-27 Thread Reynold Xin
The 1st was referring to different Spark applications connecting to the standalone cluster manager, and the 2nd one was referring to within a single Spark application, the jobs can be scheduled using a fair scheduler. On Thu, Nov 27, 2014 at 3:47 AM, Praveen Sripati praveensrip...@gmail.com

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Reynold Xin
Krishna, Docs don't block the rc voting because docs can be updated in parallel with release candidates, until the point a release is made. On Fri, Nov 28, 2014 at 9:55 PM, Krishna Sankar ksanka...@gmail.com wrote: Looks like the documentation hasn't caught up with the new features. On the

Re: Can the Scala classes in the spark source code, be inherited in Java classes?

2014-12-01 Thread Reynold Xin
Oops my previous response wasn't sent properly to the dev list. Here you go for archiving. Yes you can. Scala classes are compiled down to classes in bytecode. Take a look at this: https://twitter.github.io/scala_school/java.html Note that questions like this are not exactly what this dev list

Re: HA support for Spark

2014-12-10 Thread Reynold Xin
This would be plausible for specific purposes such as Spark streaming or Spark SQL, but I don't think it is doable for general Spark driver since it is just a normal JVM process with arbitrary program state. On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com wrote: Do we have any

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Reynold Xin
+1 Tested on OS X. On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):

Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin
I don't think the lineage thing is even turned on in Tachyon - it was mostly a research prototype, so I don't think it'd make sense for us to use that. On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote: I'm interested in understanding this as well. One of the main ways

Re: Scala's Jenkins setup looks neat

2014-12-16 Thread Reynold Xin
without giving us push access. - Patrick On Tue, Dec 16, 2014 at 6:06 PM, Reynold Xin r...@databricks.com wrote: It's worth trying :) On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: News flash! From the latest version of the GitHub API

Re: What RDD transformations trigger computations?

2014-12-18 Thread Reynold Xin
Alessandro was probably referring to some transformations whose implementations depend on some actions. For example: sortByKey requires sampling the data to get the histogram. There is a ticket tracking this: https://issues.apache.org/jira/browse/SPARK-2992 On Thu, Dec 18, 2014 at 11:52 AM,

Re: Highly interested in contributing to spark

2015-01-01 Thread Reynold Xin
Hi Manoj, Thanks for the email. Yes - you should start with the starter task before attempting larger ones. Last year I signed up as a mentor for GSoC, but no student signed up. I don't think I'd have time to be a mentor this year, but others might. On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar

ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin
Haven't sync-ed anything for the last 4 hours. Seems like this little piece of infrastructure always stops working around our own code freeze time ...

Re: ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin
I filed an INFRA ticket: https://issues.apache.org/jira/browse/INFRA-9115 I wish ASF can reconsider requests like this in order to handle downtime gracefully https://issues.apache.org/jira/browse/INFRA-8738 On Tue, Feb 3, 2015 at 9:09 PM, Reynold Xin r...@databricks.com wrote: Haven't sync

Re: multi-line comment style

2015-02-04 Thread Reynold Xin
We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has

Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util methods. The case sensitivity issues seem orthogonal, and would be great to be able to control that with a flag. On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Hey Spark developers, Is

Re: SparkSubmit.scala and stderr

2015-02-03 Thread Reynold Xin
We can use ScalaTest's privateMethodTester also instead of exposing that. On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Jay, On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles jayhutf...@gmail.com wrote: // Exposed for testing private[spark] var printStream:

Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Reynold Xin
We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due to the potential problems the change the cause. The main problem I see is that column pruning/predicate pushdowns are

Re: Spark SQL Window Functions

2015-02-08 Thread Reynold Xin
This is the original ticket: https://issues.apache.org/jira/browse/SPARK-1442 I believe it will happen, one way or another :) On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Currently there's no standard way of handling time series data in Spark. We were kicking

Re: Graphx TripletFields written in Java?

2015-01-15 Thread Reynold Xin
The static fields - Scala can't express JVM static fields unfortunately. Those will be important once we provide the Java API. On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles jayhutf...@gmail.com wrote: Hi all, Does anyone know the reasoning behind implementing

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin
that the APIs to programmatically construct SchemaRDDs from an RDD[Row] and a StructType remain public. All the SparkSQL data type objects should be exposed by the API, and the jekyll build should not hide the docs as it does now. Thanks. Alex On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r

Re: Join implementation in SparkSQL

2015-01-15 Thread Reynold Xin
It's a bunch of strategies defined here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala In most common use cases (e.g. inner equi join), filters are pushed below the join or into the join. Doing a cartesian product followed

Re: RDD order guarantees

2015-01-18 Thread Reynold Xin
.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html -Ewan On 01/16/2015 07:41 PM, Reynold Xin wrote: You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort

Re: GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Reynold Xin
We will merge https://issues.apache.org/jira/browse/SPARK-3650 for 1.3. Thanks for reminding! On Sun, Jan 18, 2015 at 8:34 PM, Michael Malak michaelma...@yahoo.com.invalid wrote: According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting Note that

Re: Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Reynold Xin
It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far as I know, Hive doesn't really have vectorization because the vectorization in Hive is simply

Re: Will Spark-SQL support vectorized query engine someday?

2015-01-20 Thread Reynold Xin
them in JIRA? On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote: It will probably eventually make its way into part of the query engine, one way or another. Note that there are in general a lot of other lower hanging fruits before you have to do vectorization. As far

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Reynold Xin
Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com wrote: Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and

Re: RDD order guarantees

2015-01-16 Thread Reynold Xin
You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM,

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread Reynold Xin
Chris, This is really cool. Congratulations and thanks for sharing the news. On Wed, Jan 14, 2015 at 6:08 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Spark Devs, Just wanted to FYI that I was funded on a 2 year NASA proposal to build out the concept of a

Spark SQL API changes and stabilization

2015-01-14 Thread Reynold Xin
Hi Spark devs, Given the growing number of developers that are building on Spark SQL, we would like to stabilize the API in 1.3 so users and developers can be confident to build on it. This also gives us a chance to improve the API. In particular, we are proposing the following major changes.

Re: not found: type LocalSparkContext

2015-01-20 Thread Reynold Xin
You don't need the LocalSparkContext. It is only for Spark's own unit test. You can just create a SparkContext and use it in your unit tests, e.g. val sc = new SparkContext(local, my test app, new SparkConf) On Tue, Jan 20, 2015 at 7:27 PM, James alcaid1...@gmail.com wrote: I could not

Re: Semantics of LGTM

2015-01-18 Thread Reynold Xin
Maybe just to avoid LGTM as a single token when it is not actually according to Patrick's definition, but anybody can still leave comments like: The direction of the PR looks good to me. or +1 on the direction The build part looks good to me ... On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin
this makes sense. Thanks, Aniket On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote: We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around it by removing the use of destructuring binding. On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com wrote: I'd suggest marking the HiveContext as

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
this through the tuple extraction. This is only a workaround. We can also remove the tuple extraction. On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote: Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Reynold Xin
Evan articulated it well. On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable.

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Can you use the new aggregateNeighbors method? I suspect the null is coming from automatic join elimination, which detects bytecode to see if you need the src or dst vertex data. Occasionally it can fail to detect. In the new aggregateNeighbors API, the caller needs to explicitly specifying that,

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Then maybe you actually had a null in your vertex attribute? On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote: I changed the mapReduceTriplets() func to aggregateMessages(), but it still failed. 2015-02-13 6:52 GMT+08:00 Reynold Xin r...@databricks.com: Can you use

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Reynold Xin
Yes, that's a bug and should be using the standard serializer. On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote: That looks, at the least, inconsistent. As far as I know this should be changed so that the zero value is always cloned via the non-closure serializer. Any

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
(mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January

Re: talk on interface design

2015-01-27 Thread Reynold Xin
/~blanchet/api-design.pdf Chapter 4's way of showing a principle and then an example from Qt is particularly instructional. On Tue, Jan 27, 2015 at 1:05 AM, Reynold Xin r...@databricks.com wrote: Hi all, In Spark, we have done reasonable well historically in interface and API design

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Reynold Xin
+1 Tested on Mac OS X On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests 2. Tested pyspark, mlib -

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Reynold Xin
Hopefully problems like this will go away entirely in the next couple of releases. https://issues.apache.org/jira/browse/SPARK-5293 On Wed, Jan 28, 2015 at 3:12 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. Where is akka coming from in spark ? I see the distribution referenced

Re: renaming SchemaRDD - DataFrame

2015-01-28 Thread Reynold Xin
DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com: Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both

Re: emergency jenkins restart soon

2015-01-28 Thread Reynold Xin
Thanks for doing that, Shane! On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote: jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015

renaming SchemaRDD - DataFrame

2015-01-26 Thread Reynold Xin
Hi, We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to get the community's opinion. The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Reynold Xin
It's an interesting idea, but there are major challenges with per row schema. 1. Performance - query optimizer and execution use assumptions about schema and data to generate optimized query plans. Having to re-reason about schema for each row can substantially slow down the engine, but due to

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin
there a straightforward way of creating RDD[Row] out of it without writing a custom RDD? ie - a utility method Thanks Malith On Tue, Jan 13, 2015 at 2:29 PM, Reynold Xin r...@databricks.com wrote: Depends on what the other side is doing. You can create your own RDD implementation by subclassing RDD

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin
Depends on what the other side is doing. You can create your own RDD implementation by subclassing RDD, or it might work if you use sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data and return an iterator */ ) where n is the number of partitions. On Tue, Jan 13, 2015 at

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Reynold Xin
://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote: Isn't that just null in SQL? On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com wrote: I believe that most DataFrame implementations out

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Reynold Xin
10, 2015 at 2:58 PM, Reynold Xin r...@databricks.com wrote: Koert, Don't get too hang up on the name SQL. This is exactly what you want: a collection with record-like objects with field names and runtime types. Almost all of the 40 methods are transformations for structured data

Re: multi-line comment style

2015-02-09 Thread Reynold Xin
it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Reynold Xin
It seems to me having a version that is 2+ is good for that? Once we move to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1 or 2.1.0 . On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote: Patrick and I were chatting about how to handle several issues

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin
Most likely no. We are using the embedded mode of Jetty, rather than using servlets. Even if it is possible, you probably wouldn't want to embed Spark in your application server ... On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of

Re: Spark Hive

2015-02-15 Thread Reynold Xin
Spark SQL is not the same as Hive on Spark. Spark SQL is a query engine that is designed from ground up for Spark without the historic baggage of Hive. It also does more than SQL now -- it is meant for structured data processing (e.g. the new DataFrame API) and SQL. Spark SQL is mostly compatible

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin
server inside Spark? Is it used for Spark core functionality or is it there for Spark jobs UI purposes? cheers On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote: Most likely no. We are using the embedded mode of Jetty, rather than using servlets. Even if it is possible

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a

Re: Spilling when not expected

2015-03-17 Thread Reynold Xin
), it seems to us that it is accepting it. Also, in IBM's J9 health center, I see it reserve the 900g, and use up to 68g. Thanks, Tom On 13 March 2015 at 02:05, Reynold Xin r...@databricks.com wrote: How did you run the Spark command? Maybe the memory setting didn't actually apply? How much memory

Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Reynold Xin
This is an interesting idea. Are there well known libraries for doing this? Config is the one place where it would be great to have something ridiculously simple, so it is more or less bug free. I'm concerned about the complexity in this patch and subtle bugs that it might introduce to config

Re: Jira Issues

2015-03-25 Thread Reynold Xin
Igor, Welcome -- everything is open here: https://issues.apache.org/jira/browse/SPARK You should be able to see them even if you are not an ASF member. On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote: Hi there Guys. I want to be more collaborative to Spark, but I

Re: [sql] How to uniquely identify Dataframe?

2015-03-30 Thread Reynold Xin
The only reason I can think of right now is that you might want to change the config parameter to change the behavior of the optimizer and regenerate the plan. However, maybe that's not a strong enough reasons to regenerate the RDD everytime. On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian

Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Reynold Xin
Reviving this to see if others would like to chime in about this expression language for config options. On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com wrote: Mridul,I may have added some confusion by giving examples in completely different areas. For example the number

Re: Migrating from 1.2.1 to 1.3.0 - org.apache.spark.sql.api.java.Row

2015-04-01 Thread Reynold Xin
Yup - we merged the Java and Scala API so there is now a single set of API to support both languages. See more at http://spark.apache.org/docs/latest/sql-programming-guide.html#unification-of-the-java-and-scala-apis On Tue, Mar 31, 2015 at 11:40 PM, Niranda Perera niranda.per...@gmail.com

Re: Some praise and comments on Spark

2015-02-25 Thread Reynold Xin
Thanks for the email and encouragement, Devl. Responses to the 3 requests: -tonnes of configuration properties and go faster type flags. For example Hadoop and Hbase users will know that there are a whole catalogue of properties for regions, caches, network properties, block sizes, etc etc.

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
Once the data frame API is released for 1.3, you can write your thing in Python and get the same performance. It can't express everything, but for basic things like projection, filter, join, aggregate and simple numeric computation, it should work pretty well. On Thu, Jan 29, 2015 at 12:45 PM,

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
are we talking about pandas or this is something internal to spark py api. If you could elaborate a bit on this or point me to alternate documentation. Thanks much --sasha On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin r...@databricks.com wrote: Once the data frame API is released for 1.3, you can

Re: enum-like types in Spark

2015-03-23 Thread Reynold Xin
If scaladoc can show the Java enum types, I do think the best way is then just Java enum types. On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in

Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Reynold Xin
I created a ticket to separate the API refactoring from the implementation. Would be great to have these as two separate patches to make it easier to review (similar to the way we are doing RPC refactoring -- first introducing an internal RPC api, port akka to it, and then add an alternative

Re: Integrating Spark with Ignite File System

2015-04-11 Thread Reynold Xin
Welcome, Dmitriy, to the Spark dev list! On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org wrote: Hello Everyone, I am one of the committers to Apache Ignite and have noticed some talks on this dev list about integrating Ignite In-Memory File System (IgniteFS) with

Re: Dataframe.fillna from 1.3.0

2015-04-20 Thread Reynold Xin
:) Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit : You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun

Re: [pyspark] Drop __getattr__ on DataFrame

2015-04-21 Thread Reynold Xin
I replied on JIRA. Let's move the discussion there. On Tue, Apr 21, 2015 at 8:13 AM, Karlson ksonsp...@siberie.de wrote: I think the __getattr__ method should be removed from the DataFrame API in pyspark. May I draw the Python folk's attention to the issue

<    1   2   3   4   5   6   7   8   9   10   >