Hi Alexander,
I just noticed the error in my logic. There will always be a shuffle due to
the `cogroup`. `join` also uses cogroup, therefore a shuffle is inevitable.
However, the reduceByKey will not cause a shuffle. I forgot about how
cogroup will try to match things, even if they don't exist.
A
Hi Burak,
I’ve modified my code as you suggested, however it still leads to shuffling.
Could you suggest what’s wrong with my code or provide an example code with
block matrices multiplication that preserves data locality and does not cause
shuffling?
Modified code:
import org.apache.spark.ml
Hi All,
I'm happy to announce the Spark 1.4.1 maintenance release.
We recommend all users on the 1.4 branch upgrade to
this release, which contain several important bug fixes.
Download Spark 1.4.1 - http://spark.apache.org/downloads.html
Release notes - http://spark.apache.org/releases/spark-rele
I haven't gotten a response on user@ yet for these questions, but these are
probably better questions for dev@ anyway, aren't they? Could somebody on dev@
please respond?
Thanks,
Jonathan
From: Jonathan Kelly mailto:jonat...@amazon.com>>
Date: Wednesday, July 15, 2015 at 12:18 PM
To: "u...@spar
Per recent comments on SPARK-6442, I'd recommend not working on that one
for now. Instead, even if tasks are not that interesting to you, you
should try some small tasks at first to get used to contributing. I am
quite sure we'll want to solve SPARK-3703 by May 2016; that's pretty far in
the futu
Granted the 1=0 thing is ugly and assumes constant-folding support or reads
way too much data.
Submitted JIRA SPARK-9078 (thanks for pointers) and expounded on possible
solutions a little bit more there.
Cheers, and thanks, Bob
--
View this message in context:
http://apache-spark-developers-
Or, alternatively, the bus could catch that error and ignore / log it,
instead of stopping the context...
On Wed, Jul 15, 2015 at 12:20 PM, Marcelo Vanzin
wrote:
> Hmm, the Java listener was added in 1.3, so I think it will work for my
> needs.
>
> Might be worth it to make it clear in the Spark
Hmm, the Java listener was added in 1.3, so I think it will work for my
needs.
Might be worth it to make it clear in the SparkListener documentation that
people should avoid using it directly. Or follow Reynold's suggestion.
On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell
wrote:
> One related
It's bad that expose a trait - even though we want to mixin stuff. We
should really audit all of these and expose only abstract classes for
anything beyond an extremely simple interface. That itself however would
break binary compatibility.
On Wed, Jul 15, 2015 at 12:15 PM, Patrick Wendell
wrote
Actually the java one is a concrete class.
On Wed, Jul 15, 2015 at 12:14 PM, Patrick Wendell wrote:
> One related note here is that we have a Java version of this that is
> an abstract class - in the doc it says that it exists more or less to
> allow for binary compatibility (it says it's for Jav
One related note here is that we have a Java version of this that is
an abstract class - in the doc it says that it exists more or less to
allow for binary compatibility (it says it's for Java users, but
really Scala could use this also):
https://github.com/apache/spark/blob/master/core/src/main/j
Why does Spark need to depend on it? I'm missing that bit. If an
openstack artifact is needed for openstack, shouldn't openstack add
it? otherwise everybody gets it in their build.
On Wed, Jul 15, 2015 at 7:52 PM, Gil Vernik wrote:
> I mean currently users that wish to use Spark and configure Spa
I mean currently users that wish to use Spark and configure Spark to use
OpenStack Swift need to manually edit pom.xml of Spark ( main, core, yarn
) and add hadoop-openstack.jar to it and then compile Spark.
My question is why not to include this dependency in Spark for Hadoop
profiles 2.4 and
Hey all,
Just noticed this when some of our tests started to fail. SPARK-4072 added
a new method to the "SparkListener" trait, and even though it has a default
implementation, it doesn't seem like that applies retroactively.
Namely, if you have an existing, compiled app that has an implementation
Hi Bob,
Thanks for the email. You can select Spark as the project when you file a
JIRA ticket at https://issues.apache.org/jira/browse/SPARK
For "select 1 from $table where 0=1" -- if the database's optimizer doesn't
do constant folding and short-circuit execution, could the query end up
scanni
You shouldn't get dependencies you need from Spark, right? you declare
direct dependencies. Are we talking about re-scoping or excluding this
dep from Hadoop transitively?
On Wed, Jul 15, 2015 at 7:33 PM, Gil Vernik wrote:
> Right, it's not currently dependence in Spark.
> If we already mention i
tableExists in
spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses
non-standard SQL (specifically, the LIMIT keyword) to determine whether a
table exists in a JDBC data source. This will cause an exception in
many/most JDBC databases that doesn't support LIMIT keyword. See
Right, it's not currently dependence in Spark.
If we already mention it, is it possible to make it part of current
dependence, but only for Hadoop profiles 2.4 and up?
This will solve a lot of headache to those who use Spark + OpenStack Swift
and need every time to manually edit pom.xml to add de
Yea - I'd just add a bunch of columns. Doesn't seem like that big of a deal.
On Wed, Jul 15, 2015 at 10:53 AM, RJ Nowling wrote:
> I'm considering a few approaches -- one of which is to provide new
> functions like mapLeft, mapRight, filterLeft, etc.
>
> But this all falls shorts with DataFrame
I'm considering a few approaches -- one of which is to provide new
functions like mapLeft, mapRight, filterLeft, etc.
But this all falls shorts with DataFrames. RDDs can easily be extended
from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add
special columns?
On Wed, Jul 15, 2015
How about just using two fields, one boolean field to mark good/bad, and
another to get the source file?
On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling wrote:
> Hi all,
>
> I'm working on an ETL task with Spark. As part of this work, I'd like to
> mark records with some info such as:
>
> 1. Whet
Hi all,
I'm working on an ETL task with Spark. As part of this work, I'd like to
mark records with some info such as:
1. Whether the record is good or bad (e.g, Either)
2. Originating file and lines
Part of my motivation is to prevent errors with individual records from
stopping the entire pipe
I think we should start without map-side-combine for Scala, because
it's easy to OOM
in JVM than in Python (we don't have hard limit in Python yet).
On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah wrote:
> Should we actually enable map-side-combine for groupByKey in Scala RDD as
> well, then? If we i
I know Django users would love to use Apache Spark with Python 3 since 1.4
release is out. However, I was thinking of making a PHP library for Spark.
Anyone know what can help me get started. I am kinda new to developing
libraries I have only done library designing for Amazon MWS.
Should we actually enable map-side-combine for groupByKey in Scala RDD as
well, then? If we implement external-group-by should we implement it with
the map-side-combine semantics that Pyspark does?
-Matt Cheah
On 7/15/15, 8:21 AM, "Davies Liu" wrote:
>If the map-side-combine is not that necessa
If I understand correctly, hadoop-openstack is not currently dependence in
Spark.
> On Jul 15, 2015, at 8:21 AM, Josh Rosen wrote:
>
> We may be able to fix this from the Spark side by adding appropriate
> exclusions in our Hadoop dependencies, right? If possible, I think that we
> should
We may be able to fix this from the Spark side by adding appropriate
exclusions in our Hadoop dependencies, right? If possible, I think that we
should do this.
On Wed, Jul 15, 2015 at 7:10 AM, Ted Yu wrote:
> I attached a patch for HADOOP-12235
>
> BTW openstack was not mentioned in the first e
If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).
On Tu
I attached a patch for HADOOP-12235
BTW openstack was not mentioned in the first email from Gil.
My email and Gil's second email were sent around the same moment.
Cheers
On Wed, Jul 15, 2015 at 2:06 AM, Steve Loughran
wrote:
>
> On 14 Jul 2015, at 12:22, Ted Yu wrote:
>
> Looking at Jenkins
Thanks Akhil! For the one where I change the rest client, how likely would it
be that a change like that goes thru? Would it be rejected as an uncommon
scenario? I really don't want to have this as a separate form of the branch.
Thanks,
Joel
From: Akhil Das
Sent
Hey everyone,
Consider the following use of spark.sql.shuffle.partitions:
case class Data(A:String = f"${(math.random*1e8).toLong}%09.0f", B: String
= f"${(math.random*1e8).toLong}%09.0f")
val dataFrame = (1 to 1000).map(_ => Data()).toDF
dataFrame.registerTempTable("data")
sqlContext.setConf( "
Either way is fine. Relay proxy would be much easier, adding authentication
to the REST client would require you to rebuild and test the piece of code
that you wrote for authentication.
Thanks
Best Regards
On Wed, Jul 15, 2015 at 4:51 AM, Joel Zambrano wrote:
> Hi! We have a gateway with basic
On 14 Jul 2015, at 12:22, Ted Yu
mailto:yuzhih...@gmail.com>> wrote:
Looking at Jenkins, master branch compiles.
Can you try the following command ?
mvn -Phive -Phadoop-2.6 -DskipTests clean package
What version of Java are you using ?
Ted, Giles has stuck in hadoop-openstack, it's that whic
The code can continue to be a good reference implementation, no matter
where it lives. In fact, it can be a better more complete one, and
easier to update.
I agree that ec2/ needs to retain some kind of pointer to the new
location. Yes, maybe a script as well that does the checkout as you
say. We
Hi, devs
I found that the case of 'Expression.resolved !=
(Expression.childrenResolved && checkInputDataTypes().isSuccess)'
occurs in the output of Analyzer.
That is, some tests in o.a.s.sql.* fail if the codes below are added in
CheckAnalysis:
https://github.com/maropu/spark/commit/a488eee8351f5
35 matches
Mail list logo