Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Eugen Cepoi
-08:00 Ted Yu <yuzhih...@gmail.com>: > bq. that solved some problems > > Is there any problem that was not solved by the tweak ? > > Thanks > > On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote: > >> You can limit the amount of mem

Re: Bad Digest error while doing aws s3 put

2016-02-08 Thread Eugen Cepoi
I had similar problems with multi part uploads. In my case the real error was something else which was being masked by this issue https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad digest exception was a side effect and not the original issue. For me it was some library version

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi
Do you have a large number of tasks? This can happen if you have a large number of tasks and a small driver or if you use accumulators of lists like datastructures. 2015-12-11 11:17 GMT-08:00 Zhan Zhang : > I think you are fetching too many results to the driver.

Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi
Hey, Is there some kind of "explain" feature implemented in mllib for the algorithms based on tree ensembles? Some method to which you would feed in a single feature vector and it would return/print what features contributed to the decision or how much each feature contributed "negatively" and

Re: Mllib explain feature for tree ensembles

2015-10-28 Thread Eugen Cepoi
lassifier.scala#L213> > to > estimate the importance of each feature. > > 2015-10-28 18:29 GMT+08:00 Eugen Cepoi <cepoi.eu...@gmail.com>: > >> Hey, >> >> Is there some kind of "explain" feature implemented in mllib for the >> algorithms ba

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Eugen Cepoi
> the aws console and make sure the ports are accessible within the cluster. > > Thanks > Best Regards > > On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi <cepoi.eu...@gmail.com> > wrote: > >> Huh indeed this worked, thanks. Do you know why this happens, is that >>

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Eugen Cepoi
nks > Best Regards > > On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi <cepoi.eu...@gmail.com> > wrote: > >> Hi, >> >> I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. >> The job is reading data from Kinesis and the batch size is of 30s (I used >&

spark streaming failing to replicate blocks

2015-10-19 Thread Eugen Cepoi
Hi, I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. The job is reading data from Kinesis and the batch size is of 30s (I used the same value for the kinesis checkpointing). In the executor logs I can see every 5 seconds a sequence of stacktraces indicating that the block

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi
Hey, A quick update on other things that have been tested. When looking at the compiled code of the spark-streaming-kinesis-asl jar everything looks normal (there is a class that implements SyncMap and it is used inside the receiver). Starting a spark shell and using introspection to instantiate

Re: Spark 1.5 Streaming and Kinesis

2015-10-15 Thread Eugen Cepoi
this is the issue, need to find a way to confirm that now... 2015-10-15 16:12 GMT+07:00 Eugen Cepoi <cepoi.eu...@gmail.com>: > Hey, > > A quick update on other things that have been tested. > > When looking at the compiled code of the spark-streaming-kinesis-asl jar >

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
*The thing is that foreach forces materialization of the RDD and it seems to be executed on the driver program* What makes you think that? No, foreach is run in the executors (distributed) and not in the driver. 2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues alex.jose.rodrig...@gmail.com: Hi

Re: map vs foreach for sending data to external system

2015-07-02 Thread Eugen Cepoi
noticed much faster executions with map although I don't like the map approach. I'll look at it with new eyes if foreach is the way to go. [1] – https://spark.apache.org/docs/latest/programming-guide.html#actions Thanks guys! -- Alexandre Rodrigues On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
Comma separated paths works only with spark 1.4 and up 2015-06-26 18:56 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com: You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com: See this related thread: http://search-hadoop.com/m

Re: Multiple dir support : newApiHadoopFile

2015-06-26 Thread Eugen Cepoi
You can comma separate them or use globbing patterns 2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com: See this related thread: http://search-hadoop.com/m/q3RTtiYm8wgHego1 On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote: Hi, How do we read files from multiple

Re: What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-06-26 Thread Eugen Cepoi
Are you using yarn? If yes increase the yarn memory overhead option. Yarn is probably killing your executors. Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a écrit : Do we have any update on this thread? Has anyone met and solved similar problems before? Any pointers will be

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Hey, I am not 100% sure but from my understanding accumulators are per partition (so per task as its the same) and are sent back to the driver with the task result and merged. When a task needs to be run n times (multiple rdds depend on this one, some partition loss later in the chain etc) then

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
that the threads are being started at the begining and will last until the end of the jvm. 2015-06-18 15:32 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com: 2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com: I was thinking exactly the same. I'm going to try it, It doesn't

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
Yeah thats the problem. There is probably some perfect num of partitions that provides the best balance between partition size and memory and merge overhead. Though it's not an ideal solution :( There could be another way but very hacky... for example if you store one sketch in a singleton per

Re: Accumulators / Accumulables : thread-local, task-local, executor-local ?

2015-06-18 Thread Eugen Cepoi
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com: I was thinking exactly the same. I'm going to try it, It doesn't really matter if I lose an executor, since its sketch will be lost, but then reexecuted somewhere else. I mean that between the action that will update the

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread Eugen Cepoi
Cache is more general. ReduceByKey involves a shuffle step where the data will be in memory and on disk (for what doesn't hold in memory). The shuffle files will remain around until the end of the job. The blocks from memory will be dropped if memory is needed for other things. This is an

Re: Spark on EMR

2015-06-17 Thread Eugen Cepoi
It looks like it is a wrapper around https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark So basically adding an option -v,1.4.0.a should work. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html 2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Eugen Cepoi
Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass 2015-06-11 14:30 GMT+02:00 Igor Berman igor.ber...@gmail.com: Another option would be to close sc and open new context with your custom configuration On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote: you need

Re: Optimisation advice for Avro-Parquet merge job

2015-06-04 Thread Eugen Cepoi
Hi 2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com: Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
) } This is my method, can you show me where should i modify to use FileInputFormat ? If you add the path there what should you give while invoking newAPIHadoopFile On Wed, May 27, 2015 at 2:20 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: You can do that using FileInputFormat.addInputPath

Re: How to give multiple directories as input ?

2015-05-27 Thread Eugen Cepoi
You can do that using FileInputFormat.addInputPath 2015-05-27 10:41 GMT+02:00 ayan guha guha.a...@gmail.com: What about /blah/*/blah/out*.avro? On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I am doing that now. Is there no other way ? On Wed, May 27, 2015 at 12:40 PM,

Re: Questions about Accumulators

2015-05-03 Thread Eugen Cepoi
Yes that's it. If a partition is lost, to recompute it, some steps will need to be re-executed. Perhaps the map function in which you update the accumulator. I think you can do it more safely in a transformation near the action, where it is less likely that an error will occur (not always

Re: Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
using a plain TextOutputFormat, the multi part upload works, this confirms that the lzo compression is probably the problem... but it is not a solution :( 2015-04-13 18:46 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com: Hi, I am not sure my problem is relevant to spark, but perhaps someone else

Multipart upload to S3 fails with Bad Digest Exceptions

2015-04-13 Thread Eugen Cepoi
Hi, I am not sure my problem is relevant to spark, but perhaps someone else had the same error. When I try to write files that need multipart upload to S3 from a job on EMR I always get this error: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified did not match

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-18 Thread Eugen Cepoi
situation. Was able to work around by forcefully committing one of the rdds right before the union into cache, and forcing that by executing take(1). Nothing else ever helped. Seems like yet-undiscovered 1.2.x thing. On Tue, Mar 17, 2015 at 4:21 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-17 Thread Eugen Cepoi
+01:00 Eugen Cepoi cepoi.eu...@gmail.com: Hum increased it to 1024 but doesn't help still have the same problem :( 2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com: The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
Hum increased it to 1024 but doesn't help still have the same problem :( 2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com: The one by default 0.07 of executor memory. I'll try increasing it and post back the result. Thanks 2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com

Re: Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange thing, the exact same code does work (after upgrade) in the spark-shell. But this information might be misleading as it works with 1.1.1... *The job takes as input two

Hanging tasks in spark 1.2.1 while working with 1.1.1

2015-03-13 Thread Eugen Cepoi
Hi, I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange thing, the exact same code does work (after upgrade) in the spark-shell. But this information might be misleading as it works with 1.1.1... *The job takes as input two data sets:* - rdd A of +170gb (with less it is

Re: How to design a long live spark application

2015-02-05 Thread Eugen Cepoi
Yes you can submit multiple actions from different threads to the same SparkContext. It is safe. Indeed what you want to achieve is quite common. Expose some operations over a SparkContext through HTTP. I have used spray for this and it just worked fine. At bootstrap of your web app, start a

Re: application as a service

2014-08-17 Thread Eugen Cepoi
Hi, You can achieve it by running a spray service for example that has access to the RDD in question. When starting the app you first build your RDD and cache it. In your spray endpoints you will translate the HTTP requests to operations on that RDD. 2014-08-17 17:27 GMT+02:00 Zhanfeng Huo

Re: collect() on small group of Avro files causes plain NullPointerException

2014-07-22 Thread Eugen Cepoi
Do you have a list/array in your avro record? If yes this could cause the problem. I experienced this kind of problem and solved it by providing custom kryo ser/de for avro lists. Also be carefull spark reuses records, so if you just read and then don't copy/transform them you would end up with

Re: Using Spark as web app backend

2014-06-25 Thread Eugen Cepoi
Yeah I agree with Koert, it would be the lightest solution. I have used it quite successfully and it just works. There is not much spark specifics here, you can follow this example https://github.com/jacobus/s4 on how to build your spray service. Then the easy solution would be to have a

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
Le 20 juin 2014 01:46, Shivani Rao raoshiv...@gmail.com a écrit : Hello Andrew, i wish I could share the code, but for proprietary reasons I can't. But I can give some idea though of what i am trying to do. The job reads a file and for each line of that file and processors these lines. I am

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
17:15 GMT+02:00 Shivani Rao raoshiv...@gmail.com: Hello Abhi, I did try that and it did not work And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So how did you overcome this problem? Shivani On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
about ADD_JARS. In order to ensure my spark_shell has all required jars, I added the jars to the $CLASSPATH in the compute_classpath.sh script. is there another way of doing it? Shivani On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: In my case it was due

Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-06 Thread Eugen Cepoi
by default. If you opened a JIRA for that I'm sure someone would pick it up. On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads the properties file from SPARK_HOME/conf/spark-defauls.conf ? IMO

spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-03 Thread Eugen Cepoi
Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads the properties file from SPARK_HOME/conf/spark-defauls.conf ? IMO it would be more natural to override what is defined in SPARK_HOME/conf by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by command line

Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net: Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Nope it doesn't replace it. It allows you to make fat jars and other nice things such as

Re: Packaging a spark job using maven

2014-05-16 Thread Eugen Cepoi
Laurent the problem is that the reference.conf that is embedded in akka jars is being overriden by some other conf. This happens when multiple files have the same name. I am using Spark with maven. In order to build the fat jar I use the shade plugin and it works pretty well. The trick here is to

spark 0.9.1 textFile hdfs unknown host exception

2014-05-16 Thread Eugen Cepoi
Hi, I have some strange behaviour when using textFile to read some data from HDFS in spark 0.9.1. I get UnknownHost exceptions, where hadoop client tries to resolve the dfs.nameservices and fails. So far: - this has been tested inside the shell - the exact same code works with spark-0.8.1 -

Re: spark 0.9.1 textFile hdfs unknown host exception

2014-05-15 Thread Eugen Cepoi
is that HADOOP_CONF_DIR is not shared with the workers when set only on the driver (it was not defined in spark-env)? Also wouldn't it be more natural to create the conf on driver side and then share it with the workers? 2014-05-09 10:51 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com: Hi, I have some strange

Re: Pig on Spark

2014-04-25 Thread Eugen Cepoi
It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language

Re: what is the best way to do cartesian

2014-04-25 Thread Eugen Cepoi
Depending on the size of the rdd you could also do a collect broadcast and then compute the product in a map function over the other rdd. If this is the same rdd you might also want to cache it. This pattern worked quite good for me Le 25 avr. 2014 18:33, Alex Boisvert alex.boisv...@gmail.com a

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Thanks again Eugen! I don't get the point..why you prefer to avoid kyro ser for closures?is there any problem with that? On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: You have two kind of ser : data and closures

Re: RDD collect help

2014-04-18 Thread Eugen Cepoi
...@mail.gmail.com%3E . On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Because it happens to reference something outside the closures scope that will reference some other objects (that you don't need) and so one, resulting in serializing with your task a lot of things

Re: RDD collect help

2014-04-17 Thread Eugen Cepoi
wrong or this is a limit of Spark? On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote: Nope, those operations are lazy, meaning it will create the RDDs

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Eugen Cepoi
rather than the partition results (which is the collection of points). So is there a way to reduce the data at the granularity of partitions? Thanks, Yanzhe On Wednesday, April 16, 2014 at 2:24 AM, Eugen Cepoi wrote: It depends on your algorithm but I guess that you probably should use

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) = skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
: Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html Eugen

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
(collect, shuffle, maybe perist to disk - but I am not sure for this one). 2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it: Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi