Whoa, wait, the docker scripts are only used for testing purposes right
now. They have not been designed with the intention of replacing the
spark-ec2 scripts. For instance, there isn't an ssh server running so you
can stop and restart the cluster (like sbin/stop-all.sh). Also, we
currently mount
Spark 0.9.0 does include standalone scheduler HA, but it requires running
multiple masters. The docs are located here:
https://spark.apache.org/docs/0.9.0/spark-standalone.html#high-availability
0.9.0 also includes driver HA (for long-running normal or streaming jobs),
allowing you to submit a
Looks like everything from 0.8.0 and before errors similarly (though Spark
0.3 for Scala 2.9 has a malformed link as well).
On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat walrusthe...@gmail.comwrote:
Sup,
Where can I get Spark 0.7.3? It's 404 here:
http://spark.apache.org/downloads.html
The amplab spark internals talk you mentioned is actually referring to the
RDD persistence levels, where by default we do not persist RDDs to disk (
https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
).
spark.shuffle.spill refers to a different behavior -- if the
This could be related to the hash collision bug in ExternalAppendOnlyMap in
0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045
You might try setting spark.shuffle.spill to false and see if that runs any
longer (turning off shuffle spill is dangerous, though, as it may cause
Spark to OOM
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.
Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?
On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:
FWIW I've seen correctness
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.
On Sun, Mar 23, 2014 at 11:48 AM,
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the df command to check to free space of /tmp or root if tmp
isn't listed.
3 GB also seems pretty low for the remaining free space of a disk.
many open files failures on any of the slaves
nor on the master :(
Thanks
Ognen
On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would
To be clear on what your configuration will do:
- SPARK_DAEMON_MEMORY=8g will make your standalone master and worker
schedulers have a lot of memory. These do not impact the actual amount of
useful memory given to executors or your driver, however, so you probably
don't need to set this.
-
1. Note sure on this, I don't believe we change the defaults from Java.
2. SPARK_JAVA_OPTS can be used to set the various Java properties (other
than memory heap size itself)
3. If you want to have 8 GB executors then, yes, only two can run on each
16 GB node. (In fact, you should also keep a
Assuming you're using a new enough version of Spark, you should use
spark.executor.memory to set the memory for your executors, without
changing the driver memory. See the docs for your version of Spark.
On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote:
Hi,
My worker
PM, Aaron Davidson ilike...@gmail.comwrote:
Spark will only use each core for one task at a time, so doing
sc.textFile(s3 location, num reducers)
where you set num reducers to at least as many as the total number of
cores in your cluster, is about as fast you can get out of the box. Same
nicholas.cham...@gmail.com wrote:
OK sweet. Thanks for walking me through that.
I wish this were StackOverflow so I could bestow some nice rep on all you
helpful people.
On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.comwrote:
Note that you may have minSplits set to more than
Koert's answer is very likely correct. This implicit definition which
converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is
available for K:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124
To fully understand what's
Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is repartition(),
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
Yup, sorry about that. This error message should not produce incorrect
behavior, but it is annoying. Posted a patch to fix it:
https://github.com/apache/spark/pull/361
Thanks for reporting it!
On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote:
when i start spark-shell i
It is as Jagat said. The Masters do not need to know about one another, as
ZooKeeper manages their implicit communication. As for Workers (and
applications, such as spark-shell), once a Worker is registered with
*some *Master,
its metadata is stored in ZooKeeper such that if another Master is
This is likely because hdfs's core-site.xml (or something similar) provides
an fs.default.name which changes the default FileSystem and Spark uses
the Hadoop FileSystem API to resolve paths. Anyway, your solution is
definitely a good one -- another would be to remote hdfs from Spark's
classpath if
This was actually a bug in the log message itself, where the Master would
print its own ip and port instead of the registered worker's. It has been
fixed in 0.9.1 and 1.0.0 (here's the patch:
https://github.com/apache/spark/commit/c0795cf481d47425ec92f4fd0780e2e0b3fdda85
).
Sorry about the
successfully) ;)
br, Gerd
On 13 April 2014 10:17, Aaron Davidson ilike...@gmail.com wrote:
By the way, 64 MB of RAM per machine is really small, I'm surprised Spark
can even start up on that! Perhaps you meant to set SPARK_DAEMON_MEMORY so
that the actual worker process itself would
It's likely the Ints are getting boxed at some point along the journey
(perhaps starting with parallelize()). I could definitely see boxed Ints
being 7 times larger than primitive ones.
If you wanted to be very careful, you could try making an RDD[Array[Int]],
where each element is simply a
, this feature is not supported in YARN or Mesos fine-grained mode.
On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel manojsamelt...@gmail.comwrote:
Could you please elaborate how drivers can be restarted automatically ?
Thanks,
On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike
Cool! It's pretty rare to actually get logs from a wild hardware failure.
The problem is as you said, that the executor keeps failing, but the worker
doesn't get the hint, so it keeps creating new, bad executors.
However, this issue should not have caused your cluster to fail to start
up. In the
if the same error happens.
On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson ilike...@gmail.comwrote:
Cool! It's pretty rare to actually get logs from a wild hardware failure.
The problem is as you said, that the executor keeps failing, but the worker
doesn't get the hint, so it keeps creating new, bad
Hey, I was talking about something more like:
val size = 1024 * 1024
val numSlices = 8
val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size /
numSlices) }
val rdd = sc.parallelize(arr, numSlices).cache()
val size2 = rdd.map(_.length).sum()
assert( size2 ==
Ah, I think I can see where your issue may be coming from. In spark-shell,
the MASTER is local[*], which just means it uses a pre-set number of
cores. This distinction only matters because the default number of slices
created from sc.parallelize() is based on the number of cores.
So when you run
Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.
[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
On Tue, Apr 15, 2014 at 8:44 AM, Diana
This is probably related to the Scala bug that :cp does not work:
https://issues.scala-lang.org/browse/SI-6502
On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote:
Actually altering the classpath in the REPL causes the provided
SparkContext to disappear:
scala
change for spark also ??
On Tue, Apr 15, 2014 at 9:53 PM, Manoj Samel manojsamelt...@gmail.comwrote:
Thanks Aaron, this is useful !
- Manoj
On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson ilike...@gmail.comwrote:
Launching drivers inside the cluster was a feature added in 0.9
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL.
Assuming you've downloaded Spark, just run 'mvn install' to publish Spark
locally, and depend on Spark version 1.0.0-SNAPSHOT.
On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
diplomaticg...@gmail.comwrote:
It's a simple
Could it be that you're using the default number of partitions of
parallelize() is too small in this case? Try something like
spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it
should already be 30, but perhaps that's not the case in YARN mode...)
On Fri, Apr 25, 2014 at
I'd just like to update this thread by pointing to the PR based on our
initial design: https://github.com/apache/spark/pull/640
This solution is a little more general and avoids catching IOException
altogether. Long live exception propagation!
On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell
If you're using standalone mode, you need to make sure the Spark Workers
know about the extra memory. This can be configured in spark-env.sh on the
workers as
export SPARK_WORKER_MEMORY=4g
On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira ianferre...@hotmail.comwrote:
Hi there,
Why can’t I seem
I didn't get the original message, only the reply. Ruh-roh.
On Sun, May 11, 2014 at 8:09 AM, Azuryy azury...@gmail.com wrote:
Got.
But it doesn't indicate all can receive this test.
Mail list is unstable recently.
Sent from my iPhone5s
On 2014年5月10日, at 13:31, Matei Zaharia
One way to ensure Spark writes more partitions is by using
RDD#repartition() to make each partition smaller. One Spark partition
always corresponds to one file in the underlying store, and it's usually a
good idea to have each partition size range somewhere between 64 MB to 256
MB. Too few
Out of curiosity, do you have a library in mind that would make it easy to
setup a bit torrent network and distribute files in an rsync (i.e., apply a
diff to a tree, ideally) fashion? I'm not familiar with this space, but we
do want to minimize the complexity of our standard ec2 launch scripts to
, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote:
One issue with using Spark itself is that this rsync is required to get
Spark to work...
Also note that a similar strategy is used for *updating* the spark
cluster on ec2, where the diff aspect is much more important, as you
might
One issue is that new jars can be added during the lifetime of a
SparkContext, which can mean after executors are already started. Off-heap
storage is always serialized, correct.
On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote:
just for my clarification: off heap cannot
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job failures. We're running pretty close to
master, though; perhaps it is related to an uncaught exception in the
Worker from a prior version of Spark.
On Tue, May 20, 2014 at 11:36 AM,
:
spark.storage.memoryFraction spark.shuffle.memoryFraction ??
On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden
email]http://user/SendEmail.jtp?type=nodenode=6134i=1
wrote:
This is very likely because the serialized map output locations buffer
exceeds the akka frame size. Please try setting
Unfortunately, those errors are actually due to an Executor that exited,
such that the connection between the Worker and Executor failed. This is
not a fatal issue, unless there are analogous messages from the Worker to
the Master (which should be present, if they exist, at around the same
point
How many partitions are in your input data set? A possibility is that your
input data has 10 unsplittable files, so you end up with 10 partitions. You
could improve this by using RDD#repartition().
Note that mapPartitionsWithIndex is sort of the main processing loop for
many Spark functions. It
() in shark CLI;
if shark support SET mapred.max.split.size to control file size ?
if yes, after i create table, i can control file num, then I can
control task number.
if not , do anyone know other way to control task number in shark CLI?
2014-05-26 9:36 GMT+08:00 Aaron Davidson ilike
=6400, it does not work,too.
is there other way to control task number in shark CLI ?
2014-05-26 10:38 GMT+08:00 Aaron Davidson ilike...@gmail.com:
You can try setting mapred.map.tasks to get Hive to do the right thing.
On Sun, May 25, 2014 at 7:27 PM, qingyang li liqingyang1...@gmail.comwrote
I suppose you actually ran publish-local and not publish local like
your example showed. That being the case, could you show the compile error
that occurs? It could be related to the hadoop version.
On Sun, May 25, 2014 at 7:51 PM, ABHISHEK abhi...@gmail.com wrote:
Hi,
I'm trying to install
-configuration) Repository for publishing is not
specified.
[error] Total time: 58 s, completed May 25, 2014 8:20:46 PM
On Mon, May 26, 2014 at 8:46 AM, Aaron Davidson ilike...@gmail.comwrote:
I suppose you actually ran publish-local and not publish local like
your example showed. That being
Sorry, to clarify: Spark *does* effectively turn Akka's failure detector
off.
On Tue, May 27, 2014 at 10:47 AM, Aaron Davidson ilike...@gmail.com wrote:
Spark should effectively turn Akka's failure detector off, because we
historically had problems with GCs and other issues causing
Spark should effectively turn Akka's failure detector off, because we
historically had problems with GCs and other issues causing
disassociations. The only thing that should cause these messages nowadays
is if the TCP connection (which Akka sustains between Actor Systems on
different machines)
Currently there is not a way to do this using textFile(). However, you
could pretty straightforwardly define your own subclass of HadoopRDD [1] in
order to get access to this information (likely using
mapPartitionsWithIndex to look up the InputSplit for a particular
partition).
Note that
Also, the Spark examples can run out of the box on a single machine, as
well as a cluster. See the Master URLs heading here:
http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
On Fri, May 30, 2014 at 9:24 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
With
There is no fundamental issue if you're running on data that is larger than
cluster memory size. Many operations can stream data through, and thus
memory usage is independent of input data size. Certain operations require
an entire *partition* (not dataset) to fit in memory, but there are not
many
--
Chanwit Kaewkasi
linkedin.com/in/chanwit
On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com
wrote:
Spark should effectively turn Akka's failure detector off, because we
historically had problems with GCs and other issues causing
disassociations.
The only thing that should
straightforward ways of
producing assembly jars.
On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com
wrote:
Thanks for the fast reply.
I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
standalone mode.
On Saturday, May 31, 2014, Aaron Davidson ilike
require that I know
where $SPARK_HOME is. However, I have no idea. Any idea where that might be?
On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com
wrote:
Gotcha. The easiest way to get your dependencies to your Executors would
probably be to construct your SparkContext
. Then when I run rdd.first, I get this over and over:
14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient memory
On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike
In addition to setting the Standalone memory, you'll also need to tell your
SparkContext to claim the extra resources. Set spark.executor.memory to
1600m as well. This should be a system property set in SPARK_JAVA_OPTS in
conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g.,
export
+1 please re-add this feature
On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote:
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at
/parcels/CDH/lib/hadoop/lib/native
-Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
spark://hivecluster2:7077
On Sun, Jun 1, 2014 at 7:41 PM, Aaron Davidson ilike...@gmail.com
wrote:
Sounds like you have two shells running, and the first one is talking
all
your resources. Do
- files may be left over from previous
saves, which is dangerous.
Is this correct?
On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com wrote:
+1 please re-add this feature
On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com
wrote:
Thanks for pointing
It is not a very good idea to save the results in the exact same place as
the data. Any failures during the job could lead to corrupted data, because
recomputing the lost partitions would involve reading the original
(now-nonexistent) data.
As such, the only safe way to do this would be to do as
Looks like your problem is local mode:
https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430
For some reason, someone decided not to do retries when running in local
mode. Not exactly sure why, feel free to
The scripts for Spark 1.0 actually specify this property in
/root/spark/conf/spark-defaults.conf
I didn't know that this would override the --executor-memory flag, though,
that's pretty odd.
On Thu, Jun 12, 2014 at 6:02 PM, Aliaksei Litouka
aliaksei.lito...@gmail.com wrote:
Yes, I am
Note also that Java does not work well with very large JVMs due to this
exact issue. There are two commonly used workarounds:
1) Spawn multiple (smaller) executors on the same machine. This can be done
by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone
mode[1]).
2) Use Tachyon
I remember having to do a similar thing in the spark docker scripts for
testing purposes. Were you able to modify the /etc/hosts directly? I
remember issues with that as docker apparently mounts it as part of its
read-only filesystem.
On Tue, Jun 17, 2014 at 4:36 PM, Mohit Jaggi
repartition() is actually just an alias of coalesce(), but which the
shuffle flag to set to true. This shuffle is probably what you're seeing as
taking longer, but it is required when you go from a smaller number of
partitions to a larger.
When actually decreasing the number of partitions,
Yup, alright, same solution then :)
On Tue, Jun 17, 2014 at 7:39 PM, Mohit Jaggi mohitja...@gmail.com wrote:
I used --privileged to start the container and then unmounted /etc/hosts.
Then I created a new /etc/hosts file
On Tue, Jun 17, 2014 at 4:58 PM, Aaron Davidson ilike...@gmail.com
Note that regarding a long load time, data format means a whole lot in
terms of query performance. If you load all your data into compressed,
columnar Parquet files on local hardware, Spark SQL would also perform far,
far better than it would reading from gzipped S3 files. You must also be
careful
Please note that this:
for (sentence - sourcerdd) {
...
}
is actually Scala syntactic sugar which is converted into
sourcerdd.foreach { sentence = ... }
What this means is that this will actually run on the cluster, which is
probably not what you want if you're trying to print them.
Try
Is it possible you have blank lines in your input? Not that this should be
an error condition, but it may be what's causing it.
On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote:
Hi Zongheng Yang,
thanks for your response. Reading your answer, I did some more tests and
If you're using the spark-ec2 scripts, you may have to change
/root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
is added to the classpath before Spark's own conf.
On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote:
I have a log4j.xml in
I don't have specific solutions for you, but the general things to try are:
- Decrease task size by broadcasting any non-trivial objects.
- Increase duration of tasks by making them less fine-grained.
How many tasks are you sending? I've seen in the past something like 25
seconds for ~10k total
A simple throughput test is also repartition()ing a large RDD. This also
stresses the disks, though, so you might try to mount your spark temporary
directory as a ramfs.
On Fri, Jun 27, 2014 at 5:57 PM, danilopds danilob...@gmail.com wrote:
Hi,
According with the research paper bellow of
If you want to stick with Java serialization and need to serialize a
non-Serializable object, your best choices are probably to either subclass
it with a Serializable one or wrap it in a class of your own which
implements its own writeObject/readObject methods (see here:
Where are you running the spark-class version? Hopefully also on the
workers.
If you're trying to centrally start/stop all workers, you can add a
slaves file to the spark conf/ directory which is just a list of your
hosts, one per line. Then you can just use ./sbin/start-slaves.sh to
start the
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of
that kin? This error suggests the worker is trying to bind a server on the
master's IP, which clearly doesn't make sense
On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:
Hi,
I did
There is a difference from actual GC overhead, which can be reduced by
reusing objects, versus this error, which actually means you ran out of
memory. This error can probably be relieved by increasing your executor
heap size, unless your data is corrupt and it is allocating huge arrays, or
you are
and they
all compete for running GCs which slow things down and trigger the error
you saw. By reducing the number of cores, there are more cpu resources
available to a task so the GC could finish before the error gets throw.
HTH,
Jerry
On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike
Hmm, looks like the Executor is trying to connect to the driver on
localhost, from this line:
14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler
What is your setup? Standalone mode with 4 separate machines? Are
Not sure exactly what is happening but perhaps there are ways to
restructure your program for it to work better. Spark is definitely able to
handle much, much larger workloads.
+1 @Reynold
Spark can handle big big data. There are known issues with informing the
user about what went wrong
You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I
thought you might have preemptively done so. I think the issue is actually
related to your network configuration, as Spark probably failed to find
your driver's ip address. Do you see a warning on the driver that looks
By the way, you can run the sc.getConf.get(spark.driver.host) thing
inside spark-shell, whether or not the Executors actually start up
successfully.
On Tue, Jul 8, 2014 at 8:23 PM, Aaron Davidson ilike...@gmail.com wrote:
You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I
Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.
(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the
The netlib.BLAS: Failed to load implementation warning only means that
the BLAS implementation may be slower than using a native one. The reason
why it only shows up at the end is that the library is only used for the
finalization step of the KMeans algorithm, so your job should've been
wrapping
If you don't really care about the batchedDegree, but rather just want to
do operations over some set of elements rather than one at a time, then
just use mapPartitions().
Otherwise, if you really do want certain sized batches and you are able to
relax the constraints slightly, is to construct
I think this is probably dying on the driver itself, as you are probably
materializing the whole dataset inside your python driver. How large is
spark_data_array compared to your driver memory?
On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote:
I put the same dataset into
Also check the web ui for that. Each iteration will have one or more stages
associated with it in the driver web ui.
On Sat, Jul 12, 2014 at 6:47 PM, crater cq...@ucmerced.edu wrote:
Hi Xiangrui,
Thanks for the information. Also, it is possible to figure out the
execution
time per
I also did a quick glance through the code and couldn't find anything
worrying that should be included in the task closures. The only possibly
unsanitary part is the Updater you pass in -- what is your Updater and is
it possible it's dragging in a significant amount of extra state?
On Sat, Jul
at 9:13 AM, Guanhua Yan gh...@lanl.gov wrote:
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list
concatenation operations, and found that the performance becomes even
worse. So groupByKey is not that bad in my code.
Best regards,
- Guanhua
From: Aaron Davidson ilike
for running union? Could that create larger task sizes?
Kyle
On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson ilike...@gmail.com
wrote:
I also did a quick glance through the code and couldn't find anything
worrying that should be included in the task closures. The only possibly
unsanitary part
Hm, this is not a public API, but you should theoretically be able to use
TestBlockId if you like. Internally, we just use the BlockId's natural
hashing and equality to do lookups and puts, so it should work fine.
However, since it is in no way public API, it may change even in
maintenance
What's the exception you're seeing? Is it an OOM?
On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote:
Hi,
unfortunately it is not so straightforward
xxx_parquet.db
is a folder of managed database created by hive/impala, so, every sub
element in it is a table in
In particular, take a look at the TorrentBroadcast, which should be much
more efficient than HttpBroadcast (which was the default in 1.0) for large
files.
If you find that TorrentBroadcast doesn't work for you, then another way to
solve this problem is to place the data on all nodes' local disks,
information here about executors but is
ambiguous about whether there are single executors or multiple executors
on
each machine.
This message from Aaron Davidson implies that the executor memory
should be set to total available memory on the machine divided by the
number of cores:
*http
I see. There should not be a significant algorithmic difference between
those two cases, as far as I can think, but there is a good bit of
local-mode-only logic in Spark.
One typical problem we see on large-heap, many-core JVMs, though, is much
more time spent in garbage collection. I'm not sure
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.
On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote:
Is there a way to get iterator from RDD?
understand that returned byte array somehow corresponds to actual data,
but how can I get it?
On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson ilike...@gmail.com wrote:
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than
The driver must initially compute the partitions and their preferred
locations for each part of the file, which results in a serial
getFileBlockLocations() on each part. However, I would expect this to take
several seconds, not minutes, to perform on 1000 parts. Is your driver
inside or outside of
Yes, good point, I believe the masterLock is now unnecessary altogether.
The reason for its initial existence was that changeMaster() originally
could be called out-of-band of the actor, and so we needed to make sure the
master reference did not change out from under us. Now it appears that all
This is likely due to a bug in shuffle file consolidation (which you have
enabled) which was hopefully fixed in 1.1 with this patch:
https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
Until 1.0.3 or 1.1 are released, the simplest solution is to disable
1 - 100 of 149 matches
Mail list logo