It seems like you just need to raise the ulimit?
On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote:
Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the
workloads. Tried tracing the problem through change set analysis. Looks
like the offending commit
On Tue, Sep 23, 2014 at 12:47 AM, Yi Tian tianyi.asiai...@gmail.com wrote:
Hi all,
I have some questions about the SparkSQL and Hive-on-Spark
Will SparkSQL support all the hive feature in the future? or just making
hive as a datasource of Spark?
Most likely not *ALL* Hive features, but
Keep the patches coming :)
On Fri, Sep 26, 2014 at 1:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I recently came across this mailing list post by Linus Torvalds
https://lkml.org/lkml/2004/12/20/255 about the value of reviewing even
“trivial” patches. The following passages
Hi Spark users and developers,
Some of the most active Spark developers (including Matei Zaharia, Michael
Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for
Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to
host a meetup event. This might be the event
Thanks. We might see more failures due to contention on resources. Fingers
acrossed ... At some point it might make sense to run the tests in a VM or
container.
On Mon, Sep 29, 2014 at 2:20 PM, shane knapp skn...@berkeley.edu wrote:
we were running at 8 executors per node, and BARELY even
There is scalariform but it can be disruptive. Last time I ran it on Spark
it didn't compile due to some xml interpolation problem.
On Wednesday, October 1, 2014, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Does anyone know if Scala has something equivalent to autopep8
Those branches are no longer active. However, I don't think we can delete
branches from github due to the way ASF mirroring works. I might be wrong
there.
On Tue, Oct 7, 2014 at 6:25 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Just curious: Are there branches and/or tags on the
I actually think we should just take the bite and follow through with the
reformatting. Many rules are simply not possible to enforce only on deltas
(e.g. import ordering).
That said, maybe there are better windows to do this, e.g. during the QA
period.
On Sun, Oct 12, 2014 at 9:37 PM, Josh
is to have
pagination of these and always sort them by the last update time.
--
Reynold Xin
On October 16, 2014 at 12:11:00 PM, Sean McNamara (sean.mcnam...@webtrends.com)
wrote:
Accumulators on the stage info page show the rolling life time value of
accumulators as well as per task which
I also ran into this earlier. It is a bug. Do you want to file a jira?
I think part of the problem is that we don't actually have the attempt id
on the executors. If we do, that's great. If not, we'd need to propagate
that over.
On Mon, Oct 20, 2014 at 7:17 AM, Yin Huai huaiyin@gmail.com
/SPARK-4014.
On Mon, Oct 20, 2014 at 1:57 PM, Reynold Xin r...@databricks.com
wrote:
I also ran into this earlier. It is a bug. Do you want to file a jira?
I think part of the problem is that we don't actually have the
attempt id
on the executors. If we do, that's great. If not, we'd
I usually use SBT on Mac and that one doesn't require any setup ...
On Mon, Oct 20, 2014 at 4:43 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
If one were to put together a short but comprehensive guide to setting up
Spark to run locally on OS X, would it look like this?
# Install
/10/spark-breaks-previous-large-scale-sort-record.html.
Summary: while Hadoop MapReduce held last year's 100 TB world record by
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
I want to thank Reynold Xin
Steve,
I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I
updated the blog post to actually include CPU / disk / network measures.
You should see that in any measure that matters to this benchmark, the old
2100 node cluster is vastly superior. The data even fit in memory!
On
+1 (binding)
We are already doing this implicitly. In my experience, this can create
longer term personal commitment, which usually leads to better design
decisions if somebody knows they would need to look after something for a
while.
On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia
cc Matthias
In the past we talked with Matthias and there were some discussions about
this.
On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com
wrote:
All, was wondering if there had been any discussion around this topic yet?
TinkerPop https://github.com/tinkerpop is a
to maintain the
features described within the TinkerPop API as that might change in the
future.
From: Kushal Datta kushal.da...@gmail.com
Date: Thursday, November 6, 2014 at 4:00 PM
To: York, Brennon brennon.y...@capitalone.com
Cc: Kyle Ellrott kellr...@soe.ucsc.edu, Reynold Xin
r
Greg,
Thanks a lot for commenting on this, but I feel we are splitting hairs
here. Matei did mention -1, followed by or give feedback. The original
process outlined by Matei was exactly about review, rather than fighting.
Nobody wants to spend their energy fighting. Everybody is doing it to
Technically you can already do custom serializer for each shuffle operation
(it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
(or github) in the past a storage policy in which you can specify how
data should be stored. I think that would be a great API to have in the
long
This is great. I think the consensus from last time was that we would put
performance stuff into spark-perf, so it is easy to test different Spark
versions.
On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs ewan.hi...@ugent.be wrote:
Hi all,
I saw that Reynold Xin had a Terasort example PR
Do people usually important o.a.spark.rdd._ ?
Also in order to maintain source and binary compatibility, we would need to
keep both right?
On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu zsxw...@gmail.com wrote:
I saw many people asked how to convert a RDD to a PairRDDFunctions. I would
like to
`rddToPairRDDFunctions` in the
SparkContext but remove `implicit`. The disadvantage is there are two
copies of same codes.
Best Regards,
Shixiong Zhu
2014-11-14 3:57 GMT+08:00 Reynold Xin r...@databricks.com:
Do people usually important o.a.spark.rdd._ ?
Also in order to maintain source and binary
The current design is not ideal, but the size of dependencies should be
fairly small since we only send the path and timestamp, not the jars
themselves.
Executors can come and go. This is essentially a state replication problem
that you gotta be very careful with consistency.
On Sun, Nov 16,
That's a great idea and it is also a pain point for some users. However, it
is not possible to solve this problem at compile time, because the content
of serialization can only be determined at runtime.
There are some efforts in Scala to help users avoid mistakes like this. One
example project
I don't think the code is immediately obvious.
Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add more documentation on the whole
thing?
Thanks.
On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com
wrote:
This basically stops us from merging patches. I'm wondering if it is
possible for ASF to give some Spark committers write permission to github
repo. In that case, if the sync tool is down, we can manually push
periodically.
On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell pwend...@gmail.com
but not yet ACKed? The buffer
will be cheap since the mapOutputStatuses messages are same and the memory
cost is only a few pointers.
Best Regards,
Shixiong Zhu
2014-09-20 16:24 GMT+08:00 Reynold Xin r...@databricks.com:
BTW - a partial solution here: https://github.com/apache/spark/pull/2470
What does /tmp/jvm-21940/hs_error.log tell you? It might give hints to what
threads are allocating the extra off-heap memory.
On Fri, Nov 21, 2014 at 1:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Howdy folks,
I’m trying to understand why I’m getting “insufficient memory”
The website is hosted on some svn server by ASF and unfortunately it
doesn't have a github mirror, so we will have to manually patch it ...
On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon brennon.y...@capitalone.com
wrote:
For JIRA tickets like SPARK-4046
is to make a diff and attach it to the JIRA.
How old school.
On Tue, Nov 25, 2014 at 7:30 PM, Reynold Xin r...@databricks.com wrote:
The website is hosted on some svn server by ASF and unfortunately it
doesn't have a github mirror, so we will have to manually patch it ...
On Tue, Nov 25
The 1st was referring to different Spark applications connecting to the
standalone cluster manager, and the 2nd one was referring to within a
single Spark application, the jobs can be scheduled using a fair scheduler.
On Thu, Nov 27, 2014 at 3:47 AM, Praveen Sripati praveensrip...@gmail.com
Krishna,
Docs don't block the rc voting because docs can be updated in parallel with
release candidates, until the point a release is made.
On Fri, Nov 28, 2014 at 9:55 PM, Krishna Sankar ksanka...@gmail.com wrote:
Looks like the documentation hasn't caught up with the new features.
On the
Oops my previous response wasn't sent properly to the dev list. Here you go
for archiving.
Yes you can. Scala classes are compiled down to classes in bytecode. Take a
look at this: https://twitter.github.io/scala_school/java.html
Note that questions like this are not exactly what this dev list
This would be plausible for specific purposes such as Spark streaming or
Spark SQL, but I don't think it is doable for general Spark driver since it
is just a normal JVM process with arbitrary program state.
On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu liuj...@cn.ibm.com wrote:
Do we have any
+1
Tested on OS X.
On Wednesday, December 10, 2014, Patrick Wendell pwend...@gmail.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.2.0!
The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
I don't think the lineage thing is even turned on in Tachyon - it was
mostly a research prototype, so I don't think it'd make sense for us to use
that.
On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash and...@andrewash.com wrote:
I'm interested in understanding this as well. One of the main ways
without giving us
push access.
- Patrick
On Tue, Dec 16, 2014 at 6:06 PM, Reynold Xin r...@databricks.com wrote:
It's worth trying :)
On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
News flash!
From the latest version of the GitHub API
Alessandro was probably referring to some transformations whose
implementations depend on some actions. For example: sortByKey requires
sampling the data to get the histogram.
There is a ticket tracking this:
https://issues.apache.org/jira/browse/SPARK-2992
On Thu, Dec 18, 2014 at 11:52 AM,
Hi Manoj,
Thanks for the email.
Yes - you should start with the starter task before attempting larger ones.
Last year I signed up as a mentor for GSoC, but no student signed up. I
don't think I'd have time to be a mentor this year, but others might.
On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar
Haven't sync-ed anything for the last 4 hours. Seems like this little piece
of infrastructure always stops working around our own code freeze time ...
I filed an INFRA ticket: https://issues.apache.org/jira/browse/INFRA-9115
I wish ASF can reconsider requests like this in order to handle downtime
gracefully https://issues.apache.org/jira/browse/INFRA-8738
On Tue, Feb 3, 2015 at 9:09 PM, Reynold Xin r...@databricks.com wrote:
Haven't sync
We should update the style doc to reflect what we have in most places
(which I think is //).
On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
FWIW I like the multi-line // over /* */ from a purely style standpoint.
The Google Java style guide[1] has
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
methods.
The case sensitivity issues seem orthogonal, and would be great to be able
to control that with a flag.
On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
Hey Spark developers,
Is
We can use ScalaTest's privateMethodTester also instead of exposing that.
On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Jay,
On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles jayhutf...@gmail.com wrote:
// Exposed for testing
private[spark] var printStream:
We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due to the potential problems the change the cause.
The main problem I see is that column pruning/predicate pushdowns are
This is the original ticket:
https://issues.apache.org/jira/browse/SPARK-1442
I believe it will happen, one way or another :)
On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Currently there's no standard way of handling time series data in Spark. We
were kicking
The static fields - Scala can't express JVM static fields unfortunately.
Those will be important once we provide the Java API.
On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles jayhutf...@gmail.com wrote:
Hi all,
Does anyone know the reasoning behind implementing
that the APIs to
programmatically construct SchemaRDDs from an RDD[Row] and a StructType
remain public. All the SparkSQL data type objects should be exposed by the
API, and the jekyll build should not hide the docs as it does now.
Thanks.
Alex
On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r
It's a bunch of strategies defined here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
In most common use cases (e.g. inner equi join), filters are pushed below
the join or into the join. Doing a cartesian product followed
.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
-Ewan
On 01/16/2015 07:41 PM, Reynold Xin wrote:
You are running on a local file system right? HDFS orders the file based
on names, but local file system often don't. I think that's why the
difference.
We might be able to do a sort
We will merge https://issues.apache.org/jira/browse/SPARK-3650 for 1.3.
Thanks for reminding!
On Sun, Jan 18, 2015 at 8:34 PM, Michael Malak
michaelma...@yahoo.com.invalid wrote:
According to:
https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting
Note that
It will probably eventually make its way into part of the query engine, one
way or another. Note that there are in general a lot of other lower hanging
fruits before you have to do vectorization.
As far as I know, Hive doesn't really have vectorization because the
vectorization in Hive is simply
them in JIRA?
On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin r...@databricks.com wrote:
It will probably eventually make its way into part of the query engine,
one way or another. Note that there are in general a lot of other lower
hanging fruits before you have to do vectorization.
As far
Definitely go for a pull request!
On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com
wrote:
Looking at Parquet code - it looks like hooks are already in place to
support this.
In particular PrimitiveConverter has methods hasDictionarySupport and
You are running on a local file system right? HDFS orders the file based on
names, but local file system often don't. I think that's why the difference.
We might be able to do a sort and order the partitions when we create a RDD
to make this universal though.
On Fri, Jan 16, 2015 at 8:26 AM,
Chris,
This is really cool. Congratulations and thanks for sharing the news.
On Wed, Jan 14, 2015 at 6:08 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Hi Spark Devs,
Just wanted to FYI that I was funded on a 2 year NASA proposal
to build out the concept of a
Hi Spark devs,
Given the growing number of developers that are building on Spark SQL, we
would like to stabilize the API in 1.3 so users and developers can be
confident to build on it. This also gives us a chance to improve the API.
In particular, we are proposing the following major changes.
You don't need the LocalSparkContext. It is only for Spark's own unit test.
You can just create a SparkContext and use it in your unit tests, e.g.
val sc = new SparkContext(local, my test app, new SparkConf)
On Tue, Jan 20, 2015 at 7:27 PM, James alcaid1...@gmail.com wrote:
I could not
Maybe just to avoid LGTM as a single token when it is not actually
according to Patrick's definition, but anybody can still leave comments
like:
The direction of the PR looks good to me. or +1 on the direction
The build part looks good to me
...
On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout
this
makes sense.
Thanks,
Aniket
On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote:
We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due
Michael - it is already transient. This should probably considered a bug in
the scala compiler, but we can easily work around it by removing the use of
destructuring binding.
On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com
wrote:
I'd suggest marking the HiveContext as
this
through the tuple extraction. This is only a workaround. We can also
remove the tuple extraction.
On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote:
Michael - it is already transient. This should probably considered a bug
in the scala compiler, but we can easily work around
Evan articulated it well.
On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.
Can you use the new aggregateNeighbors method? I suspect the null is coming
from automatic join elimination, which detects bytecode to see if you
need the src or dst vertex data. Occasionally it can fail to detect. In the
new aggregateNeighbors API, the caller needs to explicitly specifying that,
Then maybe you actually had a null in your vertex attribute?
On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote:
I changed the mapReduceTriplets() func to aggregateMessages(), but it
still failed.
2015-02-13 6:52 GMT+08:00 Reynold Xin r...@databricks.com:
Can you use
Yes, that's a bug and should be using the standard serializer.
On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote:
That looks, at the least, inconsistent. As far as I know this should
be changed so that the zero value is always cloned via the non-closure
serializer. Any
on
this
idea (mostly from Patrick and Reynold :-).
https://www.youtube.com/watch?v=YWppYPWznSQ
From: Patrick Wendell pwend...@gmail.com
To: Reynold Xin r...@databricks.com
Cc: dev@spark.apache.org dev@spark.apache.org
Sent: Monday, January 26, 2015 4:01 PM
(mostly from Patrick and Reynold :-).
https://www.youtube.com/watch?v=YWppYPWznSQ
From: Patrick Wendell pwend...@gmail.com
To: Reynold Xin r...@databricks.com
Cc: dev@spark.apache.org dev@spark.apache.org
Sent: Monday, January
/~blanchet/api-design.pdf
Chapter 4's way of showing a principle and then an example from Qt is
particularly instructional.
On Tue, Jan 27, 2015 at 1:05 AM, Reynold Xin r...@databricks.com wrote:
Hi all,
In Spark, we have done reasonable well historically in interface and API
design
+1
Tested on Mac OS X
On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar ksanka...@gmail.com
wrote:
+1
1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
2. Tested pyspark, mlib -
Hopefully problems like this will go away entirely in the next couple of
releases. https://issues.apache.org/jira/browse/SPARK-5293
On Wed, Jan 28, 2015 at 3:12 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark. Where is akka coming from in spark ?
I see the distribution referenced
DataFrame and SchemaRDD
2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:
Dirceu,
That is not possible because one cannot overload return types.
SQLContext.parquetFile (and many other methods) needs to return some
type,
and that type cannot be both
Thanks for doing that, Shane!
On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote:
jenkins is back up and all builds have been retriggered... things are
building and looking good, and i'll keep an eye on the spark master builds
tonite and tomorrow.
On Wed, Jan 28, 2015
Hi,
We are considering renaming SchemaRDD - DataFrame in 1.3, and wanted to
get the community's opinion.
The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline
It's an interesting idea, but there are major challenges with per row
schema.
1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
there a straightforward way of creating RDD[Row] out of it
without writing a custom RDD?
ie - a utility method
Thanks
Malith
On Tue, Jan 13, 2015 at 2:29 PM, Reynold Xin r...@databricks.com wrote:
Depends on what the other side is doing. You can create your own RDD
implementation by subclassing RDD
Depends on what the other side is doing. You can create your own RDD
implementation by subclassing RDD, or it might work if you use
sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
and return an iterator */ ) where n is the number of partitions.
On Tue, Jan 13, 2015 at
://www.r-bloggers.com/r-na-vs-null/
On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com
wrote:
Isn't that just null in SQL?
On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan
velvia.git...@gmail.com
wrote:
I believe that most DataFrame implementations out
10, 2015 at 2:58 PM, Reynold Xin r...@databricks.com wrote:
Koert,
Don't get too hang up on the name SQL. This is exactly what you want: a
collection with record-like objects with field names and runtime types.
Almost all of the 40 methods are transformations for structured data
it is easier for IDEs to
recognize it as a block comment. If you press enter in the comment
block with the `//` style, IDEs won't add `//` for you. -Xiangrui
On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com
wrote:
We should update the style doc to reflect what we have
It seems to me having a version that is 2+ is good for that? Once we move
to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
or 2.1.0 .
On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote:
Patrick and I were chatting about how to handle several issues
Most likely no. We are using the embedded mode of Jetty, rather than using
servlets.
Even if it is possible, you probably wouldn't want to embed Spark in your
application server ...
On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera niranda.per...@gmail.com
wrote:
Hi,
We are thinking of
Spark SQL is not the same as Hive on Spark.
Spark SQL is a query engine that is designed from ground up for Spark
without the historic baggage of Hive. It also does more than SQL now -- it
is meant for structured data processing (e.g. the new DataFrame API) and
SQL. Spark SQL is mostly compatible
server inside Spark? Is it used for Spark core functionality or is it there
for Spark jobs UI purposes?
cheers
On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote:
Most likely no. We are using the embedded mode of Jetty, rather than
using servlets.
Even if it is possible
Depending on your use cases. If the use case is to extract small amount of
data out of teradata, then you can use the JdbcRDD and soon a jdbc input
source based on the new Spark SQL external data source API.
On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote:
Hi,
I have a
), it seems to us that it
is accepting it. Also, in IBM's J9 health center, I see it reserve the
900g, and use up to 68g.
Thanks,
Tom
On 13 March 2015 at 02:05, Reynold Xin r...@databricks.com wrote:
How did you run the Spark command? Maybe the memory setting didn't
actually apply? How much memory
This is an interesting idea.
Are there well known libraries for doing this? Config is the one place
where it would be great to have something ridiculously simple, so it is
more or less bug free. I'm concerned about the complexity in this patch and
subtle bugs that it might introduce to config
Igor,
Welcome -- everything is open here:
https://issues.apache.org/jira/browse/SPARK
You should be able to see them even if you are not an ASF member.
On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote:
Hi there Guys.
I want to be more collaborative to Spark, but I
The only reason I can think of right now is that you might want to change
the config parameter to change the behavior of the optimizer and regenerate
the plan. However, maybe that's not a strong enough reasons to regenerate
the RDD everytime.
On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian
Reviving this to see if others would like to chime in about this
expression language for config options.
On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com
wrote:
Mridul,I may have added some confusion by giving examples in completely
different areas. For example the number
Yup - we merged the Java and Scala API so there is now a single set of API
to support both languages.
See more at
http://spark.apache.org/docs/latest/sql-programming-guide.html#unification-of-the-java-and-scala-apis
On Tue, Mar 31, 2015 at 11:40 PM, Niranda Perera niranda.per...@gmail.com
Thanks for the email and encouragement, Devl. Responses to the 3 requests:
-tonnes of configuration properties and go faster type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Hi all,
The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could help vote for Spark talks
so that Spark has a good showing at this event. You can make three votes on
each track. Below I've listed 3 talks that are important to
Once the data frame API is released for 1.3, you can write your thing in
Python and get the same performance. It can't express everything, but for
basic things like projection, filter, join, aggregate and simple numeric
computation, it should work pretty well.
On Thu, Jan 29, 2015 at 12:45 PM,
are we talking about pandas or this is something
internal to spark py api.
If you could elaborate a bit on this or point me to alternate
documentation.
Thanks much --sasha
On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin r...@databricks.com wrote:
Once the data frame API is released for 1.3, you can
If scaladoc can show the Java enum types, I do think the best way is then
just Java enum types.
On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in
I created a ticket to separate the API refactoring from the implementation.
Would be great to have these as two separate patches to make it easier to
review (similar to the way we are doing RPC refactoring -- first
introducing an internal RPC api, port akka to it, and then add an
alternative
Welcome, Dmitriy, to the Spark dev list!
On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org
wrote:
Hello Everyone,
I am one of the committers to Apache Ignite and have noticed some talks on
this dev list about integrating Ignite In-Memory File System (IgniteFS)
with
:)
Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit :
You can just create fillna function based on the 1.3.1 implementation of
fillna, no?
On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
a UDF might be a good idea no ?
Le lun
I replied on JIRA. Let's move the discussion there.
On Tue, Apr 21, 2015 at 8:13 AM, Karlson ksonsp...@siberie.de wrote:
I think the __getattr__ method should be removed from the DataFrame API in
pyspark.
May I draw the Python folk's attention to the issue
101 - 200 of 1256 matches
Mail list logo