is
intended to produce a side effect and map for something that will
return a new dataset.
On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas wrote:
> Patrick,
>
> I was wondering why one would choose for rdd.map vs rdd.foreach to execute a
> side-effecting function on an RDD.
>
> -kr, Gera
various types of execution services
for spark apps.
- Patrick
On Fri, Dec 12, 2014 at 10:06 AM, Manoj Samel wrote:
> Thanks Marcelo.
>
> Spark Gurus/Databricks team - do you have something in roadmap for such a
> spark server ?
>
> Thanks,
>
> On Thu, Dec 11, 2014 at 5:43 P
Yeah the main way to do this would be to have your own static cache of
connections. These could be using an object in Scala or just a static
variable in Java (for instance a set of connections that you can
borrow from).
- Patrick
On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote:
>
The second choice is better. Once you call collect() you are pulling
all of the data onto a single node, you want to do most of the
processing in parallel on the cluster, which is what map() will do.
Ideally you'd try to summarize the data or reduce it before calling
collect().
On Fri, Dec 5, 201
Thanks for flagging this. I reverted the relevant YARN fix in Spark
1.2 release. We can try to debug this in master.
On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote:
> I created a ticket for this:
>
> https://issues.apache.org/jira/browse/SPARK-4757
>
>
> Jianshi
>
> On Fri, Dec 5, 2014 at
asses present it can cause issues.
On Sun, Nov 30, 2014 at 10:53 PM, Judy Nash
wrote:
> Thanks Patrick and Cheng for the suggestions.
>
> The issue was Hadoop common jar was added to a classpath. After I removed
> Hadoop common jar from both master and slave, I was able to bypass the
I recently posted instructions on loading Spark in Intellij from scratch:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-BuildingSparkinIntelliJIDEA
You need to do a few extra steps for the YARN project to work.
Also, for questions like this that re
uot;org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V
50: invokestatic #502// Method
"org/spark-project/guava/common/base/Preconitions".checkArgument:(ZLjava/lang/Object;)V
On Wed, Nov 26, 2014 at 11:08 AM, Patri
hould not do this.
- Patrick
On Wed, Nov 26, 2014 at 1:45 AM, Judy Nash
wrote:
> Looks like a config issue. I ran spark-pi job and still failing with the
> same guava error
>
> Command ran:
>
> .\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
> org.apa
Dear all,
Currently, I am running spark standalone cluster with ~100 nodes.
Multiple users can connect to the cluster by Spark-shell or PyShell.
However, I can't find an efficient way to control the resources among multiple
users.
I can set "spark.deploy.defaultCores" in the server side to lim
It looks like you are trying to directly import the toLocalIterator
function. You can't import functions, it should just appear as a
method of an existing RDD if you have one.
- Patrick
On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan
wrote:
> Hi,
>
> I am using Spark 1.0.0 an
Hi There,
Because Akka versions are not binary compatible with one another, it
might not be possible to integrate Play with Spark 1.1.0.
- Patrick
On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya wrote:
> Hi,
>
> Sorry if this has been asked before; I didn't find a satisfactory
The doc build appears to be broken in master. We'll get it patched up
before the release:
https://issues.apache.org/jira/browse/SPARK-4326
On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta
wrote:
> Nichols and Patrick,
>
> Thanks for your help, but, no, it still does not wo
n one or two cases we've exposed functions that rely
on this:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334
I would expect more robust support for online aggregation to show up
in a future version of Spark.
- Patrick
On T
://issues.apache.org/jira/browse/SPARK-4114
This is a very important issue for Spark SQL, so I'd welcome comments
on that JIRA from anyone who is familiar with Hive/HCatalog internals.
- Patrick
On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao wrote:
> Hi, all
>
>I have some PRs
is in the assembled jar file. Please see the mails below,
which I sent to the Akka group for details.
Is there something I am doing wrong? Is there a way to get the Akka
Cluster to load the reference.conf from Camel?
Any help greatly appreciated!
Best regards,
Patrick
On 27 October 2014 11:3
orks. When deployed to the Spark Cluster the following
error is logged by the worker who tries to use Akka Camel:
-- Forwarded message --
From: Patrick McGloin
Date: 24 October 2014 15:09
Subject: Re: [akka-user] Akka Camel plus Spark Streaming
To: akka-u...@googlegroups.com
Hi
It shows the amount of memory used to store RDD blocks, which are created
when you run .cache()/.persist() on an RDD.
On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang wrote:
> Hi, please take a look at the attached screen-shot. I wonders what's the
> "Memory Used" column mean.
>
>
>
> I give 2GB me
maven it's more clunky but if you do a "mvn install" first then (I
think) you can test sub-modules independently:
mvn test -pl streaming ...
- Patrick
On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams
wrote:
> I started building Spark / running Spark tests this weekend and on
Spark will need to connect both to the hive metastore and to all HDFS
nodes (NN and DN's). If that is all in place then it should work. In
this case it looks like maybe it can't connect to a datanode in HDFS
to get the raw data. Keep in mind that the performance might not be
very good if you are tr
IIRC - the random is seeded with the index, so it will always produce
the same result for the same index. Maybe I don't totally follow
though. Could you give a small example of how this might change the
RDD ordering in a way that you don't expect? In general repartition()
will not preserve the orde
FYI, in case anybody else has this problem, we switched to Spark 1.1
(outside CDH) and the same Spark application worked first time (once
recompiled with Spark 1.1 libs of course). I assume this is because Spark
1.1 is compiled with Hive.
On 29 September 2014 17:41, Patrick McGloin
wrote:
>
ar) but the Executor
doesn't find the class. Here is the command:
sudo ./spark-submit --class aac.main.SparkDriver --master
spark://localhost:7077 --jars AAC-assembly-1.0.jar aacApp_2.10-1.0.jar
Any pointers would be appreciated!
Best regards,
Patrick
Hey Grzegorz,
EMR is a service that is not maintained by the Spark community. So
this list isn't the right place to ask EMR questions.
- Patrick
On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek
wrote:
> Hi,
> I would like to run Spark application on Amazon EMR. I have some questi
I agree, that's a good idea Marcelo. There isn't AFAIK any reason the
client needs to hang there for correct operation.
On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin wrote:
> Yes, what Sandy said.
>
> On top of that, I would suggest filing a bug for a new command line
> argument for spark-submi
wrote:
> Patrick,
>
> If I understand this correctly, I won't be able to do this in the closure
> provided to mapPartitions() because that's going to be stateless, in the
> sense that a hash map that I create within the closure would only be useful
> for one call of MapPartitio
If each partition can fit in memory, you can do this using
mapPartitions and then building an inverse mapping within each
partition. You'd need to construct a hash map within each partition
yourself.
On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya wrote:
> I have a use case where my RDD is set up
Yeah that issue has been fixed by adding better docs, it just didn't make
it in time for the release:
https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54
On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo
wrote:
> resolved:
>
> ./make-distribution.sh --name spark-hadoop-2.3.0
Hey SK,
Yeah, the documented format is the same (we expect users to add the
jar at the end) but the old spark-submit had a bug where it would
actually accept inputs that did not match the documented format. Sorry
if this was difficult to find!
- Patrick
On Fri, Sep 12, 2014 at 1:50 PM, SK
[moving to user@]
This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.
On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
wrote:
> Hi,
>
> We have a use case where we are plannin
g.
Thanks, and congratulations!
- Patrick
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
I would say that the first three are all used pretty heavily. Mesos
was the first one supported (long ago), the standalone is the
simplest and most popular today, and YARN is newer but growing a lot
in activity.
SIMR is not used as much... it was designed mostly for environments
where users had a
Changing this is not supported, it si immutable similar to other spark
configuration settings.
On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 wrote:
> Dear all:
>
> Spark uses memory to cache RDD and the memory size is specified by
> "spark.storage.memoryFraction".
>
> One the Executor starts, does Spark su
Yeah - each batch will produce a new RDD.
On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar
wrote:
> Thanks.
>
> Just to double check, rdd.id would be unique for a batch in a DStream?
>
>
> On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote:
>>
>> You can use RDD id as the seed, which is unique
any new entries here:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
- Patrick
Hey Andrew,
We might create a new JIRA for it, but it doesn't exist yet. We'll create
JIRA's for the major 1.2 issues at the beginning of September.
- Patrick
On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash wrote:
> Hi Patrick,
>
> For the spilling within on key work y
Yep - that's correct. As an optimization we save the shuffle output and
re-use if if you execute a stage twice. So this can make A:B tests like
this a bit confusing.
- Patrick
On Friday, August 22, 2014, Nieyuan wrote:
> Because map-reduce tasks like join will save shuffle data to d
The reason is that some operators get pipelined into a single stage.
rdd.map(XX).filter(YY) - this executes in a single stage since there is no
data movement needed in between these operations.
If you call toDeubgString on the final RDD it will give you some
information about the exact lineage. In
For large objects, it will be more efficient to broadcast it. If your array
is small it won't really matter. How many centers do you have? Unless you
are finding that you have very large tasks (and Spark will print a warning
about this), it could be okay to just reference it directly.
On Wed, Aug
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact
effect of caching. In rdd3, in addition to the fact that rdd will be
cached, you are also doing a bunch of extra random number generation. So it
will be hard to isolate the effect of caching.
On Wed, Aug 20, 2014 at 7:48 AM, Gr
collection of
types I had.
Best regards,
Patrick
On 6 August 2014 07:58, Amit Kumar wrote:
> Hi All,
>
> I am having some trouble trying to write generic code that uses sqlContext
> and RDDs. Can you suggest what might be wrong?
>
> class SparkTable[T : ClassTag](val sqlConte
ay each
group out sequentially on disk on one big file, you can call `sortByKey`
with a hashed suffix as well. The sort functions are externalized in Spark
1.1 (which is in pre-release).
- Patrick
On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti wrote:
> Patrick Wendell wrote
> > In
ingle task.
In the latest version of Spark we've added documentation to make this
distinction more clear to users:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
- Patrick
On Tue, Aug 5, 2014 at 6:13 AM, Jens Kristian Geyti wro
gt;
> Thanks,
> Ron
>
> On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! wrote:
>
> That failed since it defaulted the versions for yarn and hadoop
> I'll give it a try with just 2.4.0 for both yarn and hadoop...
>
> Thanks,
> Ron
>
> On Aug 4, 2014, at 9:44
4 -Dhadoop.version=2.4.0.2.1.1.0-385
> -DskipTests clean package
>
> I haven¹t tried building a distro, but it should be similar.
>
>
> - SteveN
>
> On 8/4/14, 1:25, "Sean Owen" wrote:
>
> For any Hadoop 2.4 distro, yes, set hadoop.version but also set
> -Phadoop
You are hitting this issue:
https://issues.apache.org/jira/browse/SPARK-2075
On Mon, Jul 28, 2014 at 5:40 AM, lmk
wrote:
> Hi
> I was using saveAsTextFile earlier. It was working fine. When we migrated
> to
> spark-1.0, I started getting the following error:
> java.lang.ClassNotFoundException:
Are you directly caching files from Hadoop or are you doing some
transformation on them first? If you are doing a groupBy or some type of
transformation, then you could be causing data skew that way.
On Sun, Aug 3, 2014 at 1:19 PM, iramaraju wrote:
> I am running spark 1.0.0, Tachyon 0.5 and Ha
For hortonworks, I believe it should work to just link against the
corresponding upstream version. I.e. just set the Hadoop version to "2.4.0"
Does that work?
- Patrick
On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo!
wrote:
> Hi,
> Not sure whose issue this is, but if I
BTW - the reason why the workaround could help is because when persisting
to DISK_ONLY, we explicitly avoid materializing the RDD partition in
memory... we just pass it through to disk
On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell wrote:
> It seems possible that you are running out of mem
thub.com/apache/spark/pull/1165
A (potential) workaround would be to first persist your data to disk, then
re-partition it, then cache it. I'm not 100% sure whether that will work
though.
val a =
sc.textFile("s3n://some-path/*.json").persist(DISK_ONLY).repartition(larger
nr of parti
I'll let TD chime on on this one, but I'm guessing this would be a welcome
addition. It's great to see community effort on adding new
streams/receivers, adding a Java API for receivers was something we did
specifically to allow this :)
- Patrick
On Sat, Aug 2, 2014 at 10:
Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Fri, Aug 1, 2014 at 3:31 PM, Patrick Wendell
> wrote:
> > I've had intermiddent access to the artifacts themselves, but for me the
> > directory listing always 404's.
> >
>
If you want to customize the logging behavior - the simplest way is to copy
conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and
modify the log level in there. The spark shells should pick this up.
On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen wrote:
> That's just a templat
're unsure of the best
practice for loading data into Parquet tables. Is the way we are doing the
Spark part correct in your opinion?
Best regards,
Patrick
On 1 August 2014 19:32, Michael Armbrust wrote:
> So is the only issue that impala does not see changes until you refresh
I've had intermiddent access to the artifacts themselves, but for me the
directory listing always 404's.
I think if sbt hits a 404 on the directory, it sends a somewhat confusing
error message that it can't download the artifact.
- Patrick
On Fri, Aug 1, 2014 at 3:28 PM, Shivar
This is a Scala bug - I filed something upstream, hopefully they can fix it
soon and/or we can provide a work around:
https://issues.scala-lang.org/browse/SI-8772
- Patrick
On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau wrote:
> Currently scala 2.10.2 can't be pulled in from maven ce
How should we insert data from SparkSQL into a Parquet table which can be
directly queried by Impala?
Best regards,
Patrick
On 1 August 2014 16:18, Patrick McGloin wrote:
> Hi,
>
> We would like to use Spark SQL to store data in Parquet format and then
> query that data using Impa
Hi,
We would like to use Spark SQL to store data in Parquet format and then
query that data using Impala.
We've tried to come up with a solution and it is working but it doesn't
seem good. So I was wondering if you guys could tell us what is the
correct way to do this. We are using Spark 1.0 an
All of the scripts we use to publish Spark releases are in the Spark
repo itself, so you could follow these as a guideline. The publishing
process in Maven is similar to in SBT:
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65
On Mon, Jul 28, 2014 at 12:39 PM,
Adding new build modules is pretty high overhead, so if this is a case
where a small amount of duplicated code could get rid of the
dependency, that could also be a good short-term option.
- Patrick
On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia wrote:
> Yeah, I'd just add a spark-util
> -Brad
>
> On Fri, Jul 11, 2014 at 8:44 PM, Henry Saputra
> wrote:
>> Congrats to the Spark community !
>>
>> On Friday, July 11, 2014, Patrick Wendell wrote:
>>>
>>> I am happy to announce the availability of Spark 1.0.1! This release
>>
I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for J
Hey Mikhail,
I think (hope?) the -em and -dm options were never in an official
Spark release. They were just in the master branch at some point. Did
you use these during a previous Spark release or were you just on
master?
- Patrick
On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov wrote
It fulfills a few different functions. The main one is giving users a
way to inject Spark as a runtime dependency separately from their
program and make sure they get exactly the right version of Spark. So
a user can bundle an application and then use spark-submit to send it
to different types of c
There isn't currently a way to do this, but it will start dropping
older applications once more than 200 are stored.
On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang wrote:
> Besides restarting the Master, is there any other way to clear the
> Completed Applications in Master web UI?
Hi There,
There is an issue with PySpark-on-YARN that requires users build with
Java 6. The issue has to do with how Java 6 and 7 package jar files
differently.
Can you try building spark with Java 6 and trying again?
- Patrick
On Fri, Jun 27, 2014 at 5:00 PM, sdeb wrote:
> Hello,
>
&g
Hey There,
I'd like to start voting on this release shortly because there are a
few important fixes that have queued up. We're just waiting to fix an
akka issue. I'd guess we'll cut a vote in the next few days.
- Patrick
On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim wro
I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.
On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
wrote:
> I've created an issue for this but if anyone has any advice, please let me
> know.
>
> Basically, on about 10 GBs of data, saveAsTextFile()
Out of curiosity - are you guys using speculation, shuffle
consolidation, or any other non-default option? If so that would help
narrow down what's causing this corruption.
On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman
wrote:
> Matt/Ryan,
>
> Did you make any headway on this? My team is
These paths get passed directly to the Hadoop FileSystem API and I
think the support globbing out-of-the box. So AFAIK it should just
work.
On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW wrote:
> Hi Jianshi,
>
> I have used wild card characters (*) in my program and it worked..
> My code was like
which will be present in the 1.0 branch of Spark.
- Patrick
On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee
wrote:
> I am about to spin up some new clusters, so I may give that a go... any
> special instructions for making them work? I assume I use the "
> --spark-git-repo=" option
By the way, in case it's not clear, I mean our maintenance branches:
https://github.com/apache/spark/tree/branch-1.0
On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell wrote:
> Hey Jeremy,
>
> This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
> make a 1.
Hey Jeremy,
This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
make a 1.0.1 release soon (this patch being one of the main reasons),
but if you are itching for this sooner, you can just checkout the head
of branch-1.0 and you will be able to use r3.XXX instances.
- Patric
I you run locally then Spark doesn't launch remote executors. However,
in this case you can set the memory with --spark-driver-memory flag to
spark-submit. Does that work?
- Patrick
On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui wrote:
> Hi,
>
> I'm trying to run the Simple
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:
https://issues.apache.org/jira/browse/SPARK-2075
On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown wrote:
>
> Hi, Patrick --
>
> Java 7 on the development machines:
>
> » java -version
> 1 ↵
>
Also I should add - thanks for taking time to help narrow this down!
On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell wrote:
> Paul,
>
> Could you give the version of Java that you are building with and the
> version of Java you are running with? Are they the same?
>
> Just off
the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.
- Patrick
On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown wrote:
> Moving over to the dev list, as this isn't a user-scope issue.
>
> I just ran into this issue with the missing saveAsTestFile, an
ke it work. I think it's being tracked by this JIRA:
https://issues.apache.org/jira/browse/HIVE-5733
- Patrick
On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito
wrote:
> Is there a repo somewhere with the code for the Hive dependencies
> (hive-exec, hive-serde, & hive-metastore) u
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell.
On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov
wrote:
> Thank you, Hassan!
>
>
> On 6 June 2014 03:23, hassan wrote:
>>
>> just use -Dspark.executor.memory=
>>
>>
>>
>> --
>> View this message in context:
>> http://apac
that's still an issue, one thing to try is just changing the name
of the cluster. We create groups that are identified with the cluster
name, and there might be something that just got screwed up with the
original group creation and AWS isn't happy.
- Patrick
On Wed, Jun 4, 2014 at 12:
same):
https://github.com/pwendell/kafka-spark-example
You'll want to make an uber jar that includes these packages (run sbt
assembly) and then submit that jar to spark-submit. Also, I'd try
running it locally first (if you aren't already) just to make the
debugging simpler.
- Patrick
Hey Chirag,
Those init scripts are part of the Cloudera Spark package (they are
not in the Spark project itself) so you might try e-mailing their
support lists directly.
- Patrick
On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani wrote:
> I recently spun up an AWS cluster with cdh 5 us
Hey There,
This is only possible in Scala right now. However, this is almost
never needed since the core API is fairly flexible. I have the same
question as Andrew... what are you trying to do with your RDD?
- Patrick
On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash wrote:
> Just curious, what
Hey, thanks a lot for reporting this. Do you mind making a JIRA with
the details so we can track it?
- Patrick
On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka
wrote:
> Exactly the same story - it used to work with 0.9.1 and does not work
> anymore with 1.0.0.
> I ran tests using spark
You can set an arbitrary properties file by adding --properties-file
argument to spark-submit. It would be nice to have spark-submit also
look in SPARK_CONF_DIR as well by default. If you opened a JIRA for
that I'm sure someone would pick it up.
On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi wrote:
Good catch! Yes I meant 1.0 and later.
On Mon, Jun 2, 2014 at 8:33 PM, Kexin Xie wrote:
> +1 on Option (B) with flag to allow semantics in (A) for back compatibility.
>
> Kexin
>
>
>
> On Tue, Jun 3, 2014 at 1:18 PM, Nicholas Chammas
> wrote:
>>
>> On M
standard installation guide didn't say
> anything about java 7 and suggested to do "-DskipTests" for the build..
> http://spark.apache.org/docs/latest/building-with-maven.html
>
> So, I didn't see the warning message...
>
>
> On Mon, Jun 2, 2014 at 3:48 PM, Pat
) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
Spark will delete/clobber an existing destination directory if it
exists, then fully over-write it with new data.
I'm fine to add a flag that allows (B) for backwards-compatibility
reasons, but my point was I'd prefer not t
easily lead to users deleting data by mistake if they don't understand
the exact semantics.
2. It would introduce a third set of semantics here for saveAsXX...
3. It's trivial for users to implement this with two lines of code (if
output dir exists, delete it) before calling saveAsHadoopFile.
- P
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:
https://github.com/apache/spark/blob/master/make-distribution.sh#L102
Any luck if you use JDK 6 to compile?
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu wrote:
> Hi, Patrick,
>
> I think https://issues.apache.org/jira/browse/SPARK-1
/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
However, it would be very easy to add an option that allows preserving
the old behavior. Is anyone here interested in contributing that? I
created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-1993
- Patrick
On Mon, Jun 2
;>
>> I am using the hadoop 2 prebuild package. Probably it doesn't have the
>> latest yarn client.
>>
>> -Simon
>>
>>
>>
>>
>> On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell
>> wrote:
>>>
>>> As a debugging step,
.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/opt/hadoop/conf
> -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
> org.apache.spark.deploy.SparkSubmit spark-shell --master yarn-client --class
> org.apache.spark.repl.Main
>
> I do see "/opt/hadoop/conf" included
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350
On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell wrote:
> One potential issue here is that mesos is using classifiers now to
> publish there jars. It might be that sbt-pack has trouble with
> dependencies
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.
On Sun,
Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.
Spark-ec2 itself has a flag "-a" that allows you to give a spec
spath is being
set-up correctly.
- Patrick
On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen wrote:
> Hi all,
>
> I tried a couple ways, but couldn't get it to work..
>
> The following seems to be what the online document
> (http://spark.apache.org/docs/latest/running
is pseudo-code):
files = fs.listStatus("s3n://bucket/stuff/*.gz")
files = files.filter(not the bad file)
fileStr = files.map(f => f.getPath.toString).mkstring(",")
sc.textFile(fileStr)...
- Patrick
On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas <
nicholas.cham..
> 1) Is there a guarantee that a partition will only be processed on a node
> which is in the "getPreferredLocations" set of nodes returned by the RDD ?
No there isn't, by default Spark may schedule in a "non preferred"
location after `spark.locality.wait` has expired.
http://spark.apache.org/doc
201 - 300 of 386 matches
Mail list logo