Re: State of spark docker script

2014-03-09 Thread Aaron Davidson
Whoa, wait, the docker scripts are only used for testing purposes right now. They have not been designed with the intention of replacing the spark-ec2 scripts. For instance, there isn't an ssh server running so you can stop and restart the cluster (like sbin/stop-all.sh). Also, we currently mount

Re: is spark 0.9.0 HA?

2014-03-10 Thread Aaron Davidson
Spark 0.9.0 does include standalone scheduler HA, but it requires running multiple masters. The docs are located here: https://spark.apache.org/docs/0.9.0/spark-standalone.html#high-availability 0.9.0 also includes driver HA (for long-running normal or streaming jobs), allowing you to submit a

Re: links for the old versions are broken

2014-03-13 Thread Aaron Davidson
Looks like everything from 0.8.0 and before errors similarly (though Spark 0.3 for Scala 2.9 has a malformed link as well). On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat walrusthe...@gmail.comwrote: Sup, Where can I get Spark 0.7.3? It's 404 here: http://spark.apache.org/downloads.html

Re: Local Standalone Application and shuffle spills

2014-03-13 Thread Aaron Davidson
The amplab spark internals talk you mentioned is actually referring to the RDD persistence levels, where by default we do not persist RDDs to disk ( https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence ). spark.shuffle.spill refers to a different behavior -- if the

Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote: FWIW I've seen correctness

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson
These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM,

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the df command to check to free space of /tmp or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk.

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
many open files failures on any of the slaves nor on the master :( Thanks Ognen On 3/23/14, 8:38 PM, Aaron Davidson wrote: By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
To be clear on what your configuration will do: - SPARK_DAEMON_MEMORY=8g will make your standalone master and worker schedulers have a lot of memory. These do not impact the actual amount of useful memory given to executors or your driver, however, so you probably don't need to set this. -

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
1. Note sure on this, I don't believe we change the defaults from Java. 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other than memory heap size itself) 3. If you want to have 8 GB executors then, yes, only two can run on each 16 GB node. (In fact, you should also keep a

Re: Setting SPARK_MEM higher than available memory in driver

2014-03-28 Thread Aaron Davidson
Assuming you're using a new enough version of Spark, you should use spark.executor.memory to set the memory for your executors, without changing the driver memory. See the docs for your version of Spark. On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote: Hi, My worker

Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
PM, Aaron Davidson ilike...@gmail.comwrote: Spark will only use each core for one task at a time, so doing sc.textFile(s3 location, num reducers) where you set num reducers to at least as many as the total number of cores in your cluster, is about as fast you can get out of the box. Same

Re: Best practices: Parallelized write to / read from S3

2014-04-01 Thread Aaron Davidson
nicholas.cham...@gmail.com wrote: OK sweet. Thanks for walking me through that. I wish this were StackOverflow so I could bestow some nice rep on all you helpful people. On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson ilike...@gmail.comwrote: Note that you may have minSplits set to more than

Re: Generic types and pair RDDs

2014-04-01 Thread Aaron Davidson
Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124 To fully understand what's

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson
Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is repartition(), which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson
Yup, sorry about that. This error message should not produce incorrect behavior, but it is annoying. Posted a patch to fix it: https://github.com/apache/spark/pull/361 Thanks for reporting it! On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote: when i start spark-shell i

Re: Multi master Spark

2014-04-09 Thread Aaron Davidson
It is as Jagat said. The Masters do not need to know about one another, as ZooKeeper manages their implicit communication. As for Workers (and applications, such as spark-shell), once a Worker is registered with *some *Master, its metadata is stored in ZooKeeper such that if another Master is

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Aaron Davidson
This is likely because hdfs's core-site.xml (or something similar) provides an fs.default.name which changes the default FileSystem and Spark uses the Hadoop FileSystem API to resolve paths. Anyway, your solution is definitely a good one -- another would be to remote hdfs from Spark's classpath if

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
This was actually a bug in the log message itself, where the Master would print its own ip and port instead of the registered worker's. It has been fixed in 0.9.1 and 1.0.0 (here's the patch: https://github.com/apache/spark/commit/c0795cf481d47425ec92f4fd0780e2e0b3fdda85 ). Sorry about the

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
successfully) ;) br, Gerd On 13 April 2014 10:17, Aaron Davidson ilike...@gmail.com wrote: By the way, 64 MB of RAM per machine is really small, I'm surprised Spark can even start up on that! Perhaps you meant to set SPARK_DAEMON_MEMORY so that the actual worker process itself would

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread Aaron Davidson
It's likely the Ints are getting boxed at some point along the journey (perhaps starting with parallelize()). I could definitely see boxed Ints being 7 times larger than primitive ones. If you wanted to be very careful, you could try making an RDD[Array[Int]], where each element is simply a

Re: Spark resilience

2014-04-14 Thread Aaron Davidson
, this feature is not supported in YARN or Mesos fine-grained mode. On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel manojsamelt...@gmail.comwrote: Could you please elaborate how drivers can be restarted automatically ? Thanks, On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Aaron Davidson
Cool! It's pretty rare to actually get logs from a wild hardware failure. The problem is as you said, that the executor keeps failing, but the worker doesn't get the hint, so it keeps creating new, bad executors. However, this issue should not have caused your cluster to fail to start up. In the

Re: Lost an executor error - Jobs fail

2014-04-15 Thread Aaron Davidson
if the same error happens. On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson ilike...@gmail.comwrote: Cool! It's pretty rare to actually get logs from a wild hardware failure. The problem is as you said, that the executor keeps failing, but the worker doesn't get the hint, so it keeps creating new, bad

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Hey, I was talking about something more like: val size = 1024 * 1024 val numSlices = 8 val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / numSlices) } val rdd = sc.parallelize(arr, numSlices).cache() val size2 = rdd.map(_.length).sum() assert( size2 ==

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Ah, I think I can see where your issue may be coming from. In spark-shell, the MASTER is local[*], which just means it uses a pre-set number of cores. This distinction only matters because the default number of slices created from sc.parallelize() is based on the number of cores. So when you run

Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Aaron Davidson
This is probably related to the Scala bug that :cp does not work: https://issues.scala-lang.org/browse/SI-6502 On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat walrusthe...@gmail.comwrote: Actually altering the classpath in the REPL causes the provided SparkContext to disappear: scala

Re: Spark resilience

2014-04-16 Thread Aaron Davidson
change for spark also ?? On Tue, Apr 15, 2014 at 9:53 PM, Manoj Samel manojsamelt...@gmail.comwrote: Thanks Aaron, this is useful ! - Manoj On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson ilike...@gmail.comwrote: Launching drivers inside the cluster was a feature added in 0.9

Re: How do I access the SPARK SQL

2014-04-24 Thread Aaron Davidson
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL. Assuming you've downloaded Spark, just run 'mvn install' to publish Spark locally, and depend on Spark version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.comwrote: It's a simple

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Aaron Davidson
Could it be that you're using the default number of partitions of parallelize() is too small in this case? Try something like spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it should already be 30, but perhaps that's not the case in YARN mode...) On Fri, Apr 25, 2014 at

Re: pySpark memory usage

2014-05-04 Thread Aaron Davidson
I'd just like to update this thread by pointing to the PR based on our initial design: https://github.com/apache/spark/pull/640 This solution is a little more general and avoids catching IOException altogether. Long live exception propagation! On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell

Re: Easy one

2014-05-06 Thread Aaron Davidson
If you're using standalone mode, you need to make sure the Spark Workers know about the extra memory. This can be configured in spark-env.sh on the workers as export SPARK_WORKER_MEMORY=4g On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira ianferre...@hotmail.comwrote: Hi there, Why can’t I seem

Re: Test

2014-05-11 Thread Aaron Davidson
I didn't get the original message, only the reply. Ruh-roh. On Sun, May 11, 2014 at 8:09 AM, Azuryy azury...@gmail.com wrote: Got. But it doesn't indicate all can receive this test. Mail list is unstable recently. Sent from my iPhone5s On 2014年5月10日, at 13:31, Matei Zaharia

Re: How to read a multipart s3 file?

2014-05-12 Thread Aaron Davidson
One way to ensure Spark writes more partitions is by using RDD#repartition() to make each partition smaller. One Spark partition always corresponds to one file in the underlying store, and it's usually a good idea to have each partition size range somewhere between 64 MB to 256 MB. Too few

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
Out of curiosity, do you have a library in mind that would make it easy to setup a bit torrent network and distribute files in an rsync (i.e., apply a diff to a tree, ideally) fashion? I'm not familiar with this space, but we do want to minimize the complexity of our standard ec2 launch scripts to

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Aaron Davidson
, 2014 at 11:32 PM, Aaron Davidson ilike...@gmail.comwrote: One issue with using Spark itself is that this rsync is required to get Spark to work... Also note that a similar strategy is used for *updating* the spark cluster on ec2, where the diff aspect is much more important, as you might

Re: life if an executor

2014-05-20 Thread Aaron Davidson
One issue is that new jars can be added during the lifetime of a SparkContext, which can mean after executors are already started. Off-heap storage is always serialized, correct. On Tue, May 20, 2014 at 6:48 AM, Koert Kuipers ko...@tresata.com wrote: just for my clarification: off heap cannot

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark. On Tue, May 20, 2014 at 11:36 AM,

Re: Spark stalling during shuffle (maybe a memory issue)

2014-05-20 Thread Aaron Davidson
: spark.storage.memoryFraction spark.shuffle.memoryFraction ?? On Tue, May 20, 2014 at 12:09 AM, Aaron Davidson [hidden email]http://user/SendEmail.jtp?type=nodenode=6134i=1 wrote: This is very likely because the serialized map output locations buffer exceeds the akka frame size. Please try setting

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Aaron Davidson
Unfortunately, those errors are actually due to an Executor that exited, such that the connection between the Worker and Executor failed. This is not a fatal issue, unless there are analogous messages from the Worker to the Master (which should be present, if they exist, at around the same point

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
How many partitions are in your input data set? A possibility is that your input data has 10 unsplittable files, so you end up with 10 partitions. You could improve this by using RDD#repartition(). Note that mapPartitionsWithIndex is sort of the main processing loop for many Spark functions. It

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
() in shark CLI; if shark support SET mapred.max.split.size to control file size ? if yes, after i create table, i can control file num, then I can control task number. if not , do anyone know other way to control task number in shark CLI? 2014-05-26 9:36 GMT+08:00 Aaron Davidson ilike

Re: how to set task number?

2014-05-25 Thread Aaron Davidson
=6400, it does not work,too. is there other way to control task number in shark CLI ? 2014-05-26 10:38 GMT+08:00 Aaron Davidson ilike...@gmail.com: You can try setting mapred.map.tasks to get Hive to do the right thing. On Sun, May 25, 2014 at 7:27 PM, qingyang li liqingyang1...@gmail.comwrote

Re: Fails: Spark sbt/sbt publish local

2014-05-25 Thread Aaron Davidson
I suppose you actually ran publish-local and not publish local like your example showed. That being the case, could you show the compile error that occurs? It could be related to the hadoop version. On Sun, May 25, 2014 at 7:51 PM, ABHISHEK abhi...@gmail.com wrote: Hi, I'm trying to install

Re: Fails: Spark sbt/sbt publish local

2014-05-25 Thread Aaron Davidson
-configuration) Repository for publishing is not specified. [error] Total time: 58 s, completed May 25, 2014 8:20:46 PM On Mon, May 26, 2014 at 8:46 AM, Aaron Davidson ilike...@gmail.comwrote: I suppose you actually ran publish-local and not publish local like your example showed. That being

Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Aaron Davidson
Sorry, to clarify: Spark *does* effectively turn Akka's failure detector off. On Tue, May 27, 2014 at 10:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing

Re: Akka disassociation on Java SE Embedded

2014-05-27 Thread Aaron Davidson
Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should cause these messages nowadays is if the TCP connection (which Akka sustains between Actor Systems on different machines)

Re: access hdfs file name in map()

2014-05-29 Thread Aaron Davidson
Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition). Note that

Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Aaron Davidson
Also, the Spark examples can run out of the box on a single machine, as well as a cluster. See the Master URLs heading here: http://spark.apache.org/docs/latest/submitting-applications.html#master-urls On Fri, May 30, 2014 at 9:24 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: With

Re: Using Spark on Data size larger than Memory size

2014-06-01 Thread Aaron Davidson
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many

Re: Akka disassociation on Java SE Embedded

2014-06-01 Thread Aaron Davidson
-- Chanwit Kaewkasi linkedin.com/in/chanwit On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote: Spark should effectively turn Akka's failure detector off, because we historically had problems with GCs and other issues causing disassociations. The only thing that should

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
straightforward ways of producing assembly jars. On Sat, May 31, 2014 at 11:23 PM, Russell Jurney russell.jur...@gmail.com wrote: Thanks for the fast reply. I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in standalone mode. On Saturday, May 31, 2014, Aaron Davidson ilike

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
require that I know where $SPARK_HOME is. However, I have no idea. Any idea where that might be? On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson ilike...@gmail.com wrote: Gotcha. The easiest way to get your dependencies to your Executors would probably be to construct your SparkContext

Re: hadoopRDD stalls reading entire directory

2014-06-01 Thread Aaron Davidson
. Then when I run rdd.first, I get this over and over: 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson ilike

Re: Can anyone help me set memory for standalone cluster?

2014-06-01 Thread Aaron Davidson
In addition to setting the Standalone memory, you'll also need to tell your SparkContext to claim the extra resources. Set spark.executor.memory to 1600m as well. This should be a system property set in SPARK_JAVA_OPTS in conf/spark-env.sh (in 0.9.1, which you appear to be using) -- e.g., export

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
+1 please re-add this feature On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at

Re: hadoopRDD stalls reading entire directory

2014-06-02 Thread Aaron Davidson
/parcels/CDH/lib/hadoop/lib/native -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://hivecluster2:7077 On Sun, Jun 1, 2014 at 7:41 PM, Aaron Davidson ilike...@gmail.com wrote: Sounds like you have two shells running, and the first one is talking all your resources. Do

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Aaron Davidson
- files may be left over from previous saves, which is dangerous. Is this correct? On Mon, Jun 2, 2014 at 4:17 PM, Aaron Davidson ilike...@gmail.com wrote: +1 please re-add this feature On Mon, Jun 2, 2014 at 12:44 PM, Patrick Wendell pwend...@gmail.com wrote: Thanks for pointing

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-09 Thread Aaron Davidson
It is not a very good idea to save the results in the exact same place as the data. Any failures during the job could lead to corrupted data, because recomputing the lost partitions would involve reading the original (now-nonexistent) data. As such, the only safe way to do this would be to do as

Re: How to enable fault-tolerance?

2014-06-09 Thread Aaron Davidson
Looks like your problem is local mode: https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430 For some reason, someone decided not to do retries when running in local mode. Not exactly sure why, feel free to

Re: How to specify executor memory in EC2 ?

2014-06-12 Thread Aaron Davidson
The scripts for Spark 1.0 actually specify this property in /root/spark/conf/spark-defaults.conf I didn't know that this would override the --executor-memory flag, though, that's pretty odd. On Thu, Jun 12, 2014 at 6:02 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: Yes, I am

Re: long GC pause during file.cache()

2014-06-15 Thread Aaron Davidson
Note also that Java does not work well with very large JVMs due to this exact issue. There are two commonly used workarounds: 1) Spawn multiple (smaller) executors on the same machine. This can be done by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone mode[1]). 2) Use Tachyon

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Aaron Davidson
I remember having to do a similar thing in the spark docker scripts for testing purposes. Were you able to modify the /etc/hosts directly? I remember issues with that as docker apparently mounts it as part of its read-only filesystem. On Tue, Jun 17, 2014 at 4:36 PM, Mohit Jaggi

Re: Executors not utilized properly.

2014-06-17 Thread Aaron Davidson
repartition() is actually just an alias of coalesce(), but which the shuffle flag to set to true. This shuffle is probably what you're seeing as taking longer, but it is required when you go from a smaller number of partitions to a larger. When actually decreasing the number of partitions,

Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Aaron Davidson
Yup, alright, same solution then :) On Tue, Jun 17, 2014 at 7:39 PM, Mohit Jaggi mohitja...@gmail.com wrote: I used --privileged to start the container and then unmounted /etc/hosts. Then I created a new /etc/hosts file On Tue, Jun 17, 2014 at 4:58 PM, Aaron Davidson ilike...@gmail.com

Re: Shark vs Impala

2014-06-23 Thread Aaron Davidson
Note that regarding a long load time, data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far, far better than it would reading from gzipped S3 files. You must also be careful

Re: DAGScheduler: Failed to run foreach

2014-06-23 Thread Aaron Davidson
Please note that this: for (sentence - sourcerdd) { ... } is actually Scala syntactic sugar which is converted into sourcerdd.foreach { sentence = ... } What this means is that this will actually run on the cluster, which is probably not what you want if you're trying to print them. Try

Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Aaron Davidson
Is it possible you have blank lines in your input? Not that this should be an error condition, but it may be what's causing it. On Wed, Jun 25, 2014 at 11:57 AM, durin m...@simon-schaefer.net wrote: Hi Zongheng Yang, thanks for your response. Reading your answer, I did some more tests and

Re: Changing log level of spark

2014-06-25 Thread Aaron Davidson
If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: I have a log4j.xml in

Re: Improving Spark multithreaded performance?

2014-06-26 Thread Aaron Davidson
I don't have specific solutions for you, but the general things to try are: - Decrease task size by broadcasting any non-trivial objects. - Increase duration of tasks by making them less fine-grained. How many tasks are you sending? I've seen in the past something like 25 seconds for ~10k total

Re: Interconnect benchmarking

2014-06-27 Thread Aaron Davidson
A simple throughput test is also repartition()ing a large RDD. This also stresses the disks, though, so you might try to mount your spark temporary directory as a ramfs. On Fri, Jun 27, 2014 at 5:57 PM, danilopds danilob...@gmail.com wrote: Hi, According with the research paper bellow of

Re: Serialization of objects

2014-07-01 Thread Aaron Davidson
If you want to stick with Java serialization and need to serialize a non-Serializable object, your best choices are probably to either subclass it with a Serializable one or wrap it in a class of your own which implements its own writeObject/readObject methods (see here:

Re: Failed to launch Worker

2014-07-01 Thread Aaron Davidson
Where are you running the spark-class version? Hopefully also on the workers. If you're trying to centrally start/stop all workers, you can add a slaves file to the spark conf/ directory which is just a list of your hosts, one per line. Then you can just use ./sbin/start-slaves.sh to start the

Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread Aaron Davidson
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of that kin? This error suggests the worker is trying to bind a server on the master's IP, which clearly doesn't make sense On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I did

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
and they all compete for running GCs which slow things down and trigger the error you saw. By reducing the number of cores, there are more cpu resources available to a task so the GC could finish before the error gets throw. HTH, Jerry On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
Hmm, looks like the Executor is trying to connect to the driver on localhost, from this line: 14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler What is your setup? Standalone mode with 4 separate machines? Are

Re: Comparative study

2014-07-08 Thread Aaron Davidson
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. +1 @Reynold Spark can handle big big data. There are known issues with informing the user about what went wrong

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I thought you might have preemptively done so. I think the issue is actually related to your network configuration, as Spark probably failed to find your driver's ip address. Do you see a warning on the driver that looks

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
By the way, you can run the sc.getConf.get(spark.driver.host) thing inside spark-shell, whether or not the Executors actually start up successfully. On Tue, Jul 8, 2014 at 8:23 PM, Aaron Davidson ilike...@gmail.com wrote: You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I

Re: Confused by groupByKey() and the default partitioner

2014-07-12 Thread Aaron Davidson
Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here is the associated code, you'll see that it simply checks that the

Re: KMeans for large training data

2014-07-12 Thread Aaron Davidson
The netlib.BLAS: Failed to load implementation warning only means that the BLAS implementation may be slower than using a native one. The reason why it only shows up at the end is that the library is only used for the finalization step of the KMeans algorithm, so your job should've been wrapping

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Aaron Davidson
If you don't really care about the batchedDegree, but rather just want to do operations over some set of elements rather than one at a time, then just use mapPartitions(). Otherwise, if you really do want certain sized batches and you are able to relax the constraints slightly, is to construct

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-12 Thread Aaron Davidson
I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi mohitja...@gmail.com wrote: I put the same dataset into

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread Aaron Davidson
Also check the web ui for that. Each iteration will have one or more stages associated with it in the driver web ui. On Sat, Jul 12, 2014 at 6:47 PM, crater cq...@ucmerced.edu wrote: Hi Xiangrui, Thanks for the information. Also, it is possible to figure out the execution time per

Re: Large Task Size?

2014-07-12 Thread Aaron Davidson
I also did a quick glance through the code and couldn't find anything worrying that should be included in the task closures. The only possibly unsanitary part is the Updater you pass in -- what is your Updater and is it possible it's dragging in a significant amount of extra state? On Sat, Jul

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson
at 9:13 AM, Guanhua Yan gh...@lanl.gov wrote: Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list concatenation operations, and found that the performance becomes even worse. So groupByKey is not that bad in my code. Best regards, - Guanhua From: Aaron Davidson ilike

Re: Large Task Size?

2014-07-15 Thread Aaron Davidson
for running union? Could that create larger task sizes? Kyle On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson ilike...@gmail.com wrote: I also did a quick glance through the code and couldn't find anything worrying that should be included in the task closures. The only possibly unsanitary part

Re: which kind of BlockId should I use?

2014-07-20 Thread Aaron Davidson
Hm, this is not a public API, but you should theoretically be able to use TestBlockId if you like. Internally, we just use the BlockId's natural hashing and equality to do lookups and puts, so it should work fine. However, since it is in no way public API, it may change even in maintenance

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Aaron Davidson
What's the exception you're seeing? Is it an OOM? On Mon, Jul 21, 2014 at 11:20 AM, chutium teng@gmail.com wrote: Hi, unfortunately it is not so straightforward xxx_parquet.db is a folder of managed database created by hive/impala, so, every sub element in it is a table in

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Aaron Davidson
In particular, take a look at the TorrentBroadcast, which should be much more efficient than HttpBroadcast (which was the default in 1.0) for large files. If you find that TorrentBroadcast doesn't work for you, then another way to solve this problem is to place the data on all nodes' local disks,

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
information here about executors but is ambiguous about whether there are single executors or multiple executors on each machine. This message from Aaron Davidson implies that the executor memory should be set to total available memory on the machine divided by the number of cores: *http

Re: Spilling in-memory... messages in log even with MEMORY_ONLY

2014-07-27 Thread Aaron Davidson
I see. There should not be a significant algorithmic difference between those two cases, as far as I can think, but there is a good bit of local-mode-only logic in Spark. One typical problem we see on large-heap, many-core JVMs, though, is much more time spent in garbage collection. I'm not sure

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than each individual line). Hopefully that's sufficient, though. On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote: Is there a way to get iterator from RDD?

Re: Iterator over RDD in PySpark

2014-08-01 Thread Aaron Davidson
understand that returned byte array somehow corresponds to actual data, but how can I get it? On Fri, Aug 1, 2014 at 8:49 PM, Aaron Davidson ilike...@gmail.com wrote: rdd.toLocalIterator will do almost what you want, but requires that each individual partition fits in memory (rather than

Re: s3:// sequence file startup time

2014-08-17 Thread Aaron Davidson
The driver must initially compute the partitions and their preferred locations for each part of the file, which results in a serial getFileBlockLocations() on each part. However, I would expect this to take several seconds, not minutes, to perform on 1000 parts. Is your driver inside or outside of

Re: Spark: why need a masterLock when sending heartbeat to master

2014-08-17 Thread Aaron Davidson
Yes, good point, I believe the masterLock is now unnecessary altogether. The reason for its initial existence was that changeMaster() originally could be called out-of-band of the actor, and so we needed to make sure the master reference did not change out from under us. Now it appears that all

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Aaron Davidson
This is likely due to a bug in shuffle file consolidation (which you have enabled) which was hopefully fixed in 1.1 with this patch: https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd Until 1.0.3 or 1.1 are released, the simplest solution is to disable

  1   2   >