-08:00 Ted Yu <yuzhih...@gmail.com>:
> bq. that solved some problems
>
> Is there any problem that was not solved by the tweak ?
>
> Thanks
>
> On Thu, Mar 3, 2016 at 4:11 PM, Eugen Cepoi <cepoi.eu...@gmail.com> wrote:
>
>> You can limit the amount of mem
I had similar problems with multi part uploads. In my case the real error
was something else which was being masked by this issue
https://issues.apache.org/jira/browse/SPARK-6560. In the end this bad
digest exception was a side effect and not the original issue. For me it
was some library version
Do you have a large number of tasks? This can happen if you have a large
number of tasks and a small driver or if you use accumulators of lists like
datastructures.
2015-12-11 11:17 GMT-08:00 Zhan Zhang :
> I think you are fetching too many results to the driver.
Hey,
Is there some kind of "explain" feature implemented in mllib for the
algorithms based on tree ensembles?
Some method to which you would feed in a single feature vector and it would
return/print what features contributed to the decision or how much each
feature contributed "negatively" and
lassifier.scala#L213>
> to
> estimate the importance of each feature.
>
> 2015-10-28 18:29 GMT+08:00 Eugen Cepoi <cepoi.eu...@gmail.com>:
>
>> Hey,
>>
>> Is there some kind of "explain" feature implemented in mllib for the
>> algorithms ba
> the aws console and make sure the ports are accessible within the cluster.
>
> Thanks
> Best Regards
>
> On Thu, Oct 22, 2015 at 8:53 PM, Eugen Cepoi <cepoi.eu...@gmail.com>
> wrote:
>
>> Huh indeed this worked, thanks. Do you know why this happens, is that
>>
nks
> Best Regards
>
> On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi <cepoi.eu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN.
>> The job is reading data from Kinesis and the batch size is of 30s (I used
>&
Hi,
I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN.
The job is reading data from Kinesis and the batch size is of 30s (I used
the same value for the kinesis checkpointing).
In the executor logs I can see every 5 seconds a sequence of stacktraces
indicating that the block
Hey,
A quick update on other things that have been tested.
When looking at the compiled code of the spark-streaming-kinesis-asl jar
everything looks normal (there is a class that implements SyncMap and it is
used inside the receiver).
Starting a spark shell and using introspection to instantiate
this is
the issue, need to find a way to confirm that now...
2015-10-15 16:12 GMT+07:00 Eugen Cepoi <cepoi.eu...@gmail.com>:
> Hey,
>
> A quick update on other things that have been tested.
>
> When looking at the compiled code of the spark-streaming-kinesis-asl jar
>
*The thing is that foreach forces materialization of the RDD and it seems
to be executed on the driver program*
What makes you think that? No, foreach is run in the executors
(distributed) and not in the driver.
2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues
alex.jose.rodrig...@gmail.com:
Hi
noticed much faster executions with map although I don't like the
map approach. I'll look at it with new eyes if foreach is the way to go.
[1] – https://spark.apache.org/docs/latest/programming-guide.html#actions
Thanks guys!
--
Alexandre Rodrigues
On Thu, Jul 2, 2015 at 5:37 PM, Eugen Cepoi
Comma separated paths works only with spark 1.4 and up
2015-06-26 18:56 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com:
You can comma separate them or use globbing patterns
2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com:
See this related thread:
http://search-hadoop.com/m
You can comma separate them or use globbing patterns
2015-06-26 18:54 GMT+02:00 Ted Yu yuzhih...@gmail.com:
See this related thread:
http://search-hadoop.com/m/q3RTtiYm8wgHego1
On Fri, Jun 26, 2015 at 9:43 AM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
How do we read files from multiple
Are you using yarn?
If yes increase the yarn memory overhead option. Yarn is probably killing
your executors.
Le 26 juin 2015 20:43, XianXing Zhang xianxing.zh...@gmail.com a écrit :
Do we have any update on this thread? Has anyone met and solved similar
problems before?
Any pointers will be
Hey,
I am not 100% sure but from my understanding accumulators are per partition
(so per task as its the same) and are sent back to the driver with the task
result and merged. When a task needs to be run n times (multiple rdds
depend on this one, some partition loss later in the chain etc) then
that the threads are being started at the begining and will last until the
end of the jvm.
2015-06-18 15:32 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com:
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com:
I was thinking exactly the same. I'm going to try it, It doesn't
Yeah thats the problem. There is probably some perfect num of partitions
that provides the best balance between partition size and memory and merge
overhead. Though it's not an ideal solution :(
There could be another way but very hacky... for example if you store one
sketch in a singleton per
2015-06-18 15:17 GMT+02:00 Guillaume Pitel guillaume.pi...@exensa.com:
I was thinking exactly the same. I'm going to try it, It doesn't really
matter if I lose an executor, since its sketch will be lost, but then
reexecuted somewhere else.
I mean that between the action that will update the
Cache is more general. ReduceByKey involves a shuffle step where the data
will be in memory and on disk (for what doesn't hold in memory). The
shuffle files will remain around until the end of the job. The blocks from
memory will be dropped if memory is needed for other things. This is an
It looks like it is a wrapper around
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
So basically adding an option -v,1.4.0.a should work.
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html
2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda
Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass
2015-06-11 14:30 GMT+02:00 Igor Berman igor.ber...@gmail.com:
Another option would be to close sc and open new context with your custom
configuration
On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote:
you need
Hi
2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com:
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
)
}
This is my method, can you show me where should i modify to use
FileInputFormat ? If you add the path there what should you give while
invoking newAPIHadoopFile
On Wed, May 27, 2015 at 2:20 PM, Eugen Cepoi cepoi.eu...@gmail.com
wrote:
You can do that using FileInputFormat.addInputPath
You can do that using FileInputFormat.addInputPath
2015-05-27 10:41 GMT+02:00 ayan guha guha.a...@gmail.com:
What about /blah/*/blah/out*.avro?
On 27 May 2015 18:08, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am doing that now.
Is there no other way ?
On Wed, May 27, 2015 at 12:40 PM,
Yes that's it. If a partition is lost, to recompute it, some steps will
need to be re-executed. Perhaps the map function in which you update the
accumulator.
I think you can do it more safely in a transformation near the action,
where it is less likely that an error will occur (not always
using a plain TextOutputFormat, the multi
part upload works, this confirms that the lzo compression is probably the
problem... but it is not a solution :(
2015-04-13 18:46 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com:
Hi,
I am not sure my problem is relevant to spark, but perhaps someone else
Hi,
I am not sure my problem is relevant to spark, but perhaps someone else had
the same error. When I try to write files that need multipart upload to S3
from a job on EMR I always get this error:
com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you
specified did not match
situation. Was able to work
around by forcefully committing one of the rdds right before the union
into cache, and forcing that by executing take(1). Nothing else ever
helped.
Seems like yet-undiscovered 1.2.x thing.
On Tue, Mar 17, 2015 at 4:21 PM, Eugen Cepoi cepoi.eu...@gmail.com
wrote
+01:00 Eugen Cepoi cepoi.eu...@gmail.com:
Hum increased it to 1024 but doesn't help still have the same problem :(
2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com:
The one by default 0.07 of executor memory. I'll try increasing it and
post back the result.
Thanks
2015-03-13 18
Hum increased it to 1024 but doesn't help still have the same problem :(
2015-03-13 18:28 GMT+01:00 Eugen Cepoi cepoi.eu...@gmail.com:
The one by default 0.07 of executor memory. I'll try increasing it and
post back the result.
Thanks
2015-03-13 18:09 GMT+01:00 Ted Yu yuzhih...@gmail.com
, Eugen Cepoi cepoi.eu...@gmail.com
wrote:
Hi,
I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1.
Strange thing, the exact same code does work (after upgrade) in the
spark-shell. But this information might be misleading as it works with
1.1.1...
*The job takes as input two
Hi,
I have a job that hangs after upgrading to spark 1.2.1 from 1.1.1. Strange
thing, the exact same code does work (after upgrade) in the spark-shell.
But this information might be misleading as it works with 1.1.1...
*The job takes as input two data sets:*
- rdd A of +170gb (with less it is
Yes you can submit multiple actions from different threads to the same
SparkContext. It is safe.
Indeed what you want to achieve is quite common. Expose some operations
over a SparkContext through HTTP.
I have used spray for this and it just worked fine.
At bootstrap of your web app, start a
Hi,
You can achieve it by running a spray service for example that has access
to the RDD in question. When starting the app you first build your RDD and
cache it. In your spray endpoints you will translate the HTTP requests to
operations on that RDD.
2014-08-17 17:27 GMT+02:00 Zhanfeng Huo
Do you have a list/array in your avro record? If yes this could cause the
problem. I experienced this kind of problem and solved it by providing
custom kryo ser/de for avro lists. Also be carefull spark reuses records,
so if you just read and then don't copy/transform them you would end up
with
Yeah I agree with Koert, it would be the lightest solution. I have
used it quite successfully and it just works.
There is not much spark specifics here, you can follow this example
https://github.com/jacobus/s4 on how to build your spray service.
Then the easy solution would be to have a
Le 20 juin 2014 01:46, Shivani Rao raoshiv...@gmail.com a écrit :
Hello Andrew,
i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am
17:15 GMT+02:00 Shivani Rao raoshiv...@gmail.com:
Hello Abhi, I did try that and it did not work
And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So
how did you overcome this problem?
Shivani
On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi cepoi.eu...@gmail.com
wrote
about ADD_JARS. In order to ensure
my spark_shell has all required jars, I added the jars to the $CLASSPATH
in the compute_classpath.sh script. is there another way of doing it?
Shivani
On Fri, Jun 20, 2014 at 9:47 AM, Eugen Cepoi cepoi.eu...@gmail.com
wrote:
In my case it was due
by default. If you opened a JIRA for
that I'm sure someone would pick it up.
On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Is it on purpose that when setting SPARK_CONF_DIR spark submit still
loads
the properties file from SPARK_HOME/conf/spark-defauls.conf ?
IMO
Is it on purpose that when setting SPARK_CONF_DIR spark submit still loads
the properties file from SPARK_HOME/conf/spark-defauls.conf ?
IMO it would be more natural to override what is defined in SPARK_HOME/conf
by SPARK_CONF_DIR when defined (and SPARK_CONF_DIR being overriden by
command line
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net:
Hi Eugen,
Thanks for your help. I'm not familiar with the shaded plugin and i was
wondering: does it replace the assembly plugin ?
Nope it doesn't replace it. It allows you to make fat jars and other nice
things such as
Laurent the problem is that the reference.conf that is embedded in akka
jars is being overriden by some other conf. This happens when multiple
files have the same name.
I am using Spark with maven. In order to build the fat jar I use the shade
plugin and it works pretty well. The trick here is to
Hi,
I have some strange behaviour when using textFile to read some data from
HDFS in spark 0.9.1.
I get UnknownHost exceptions, where hadoop client tries to resolve the
dfs.nameservices and fails.
So far:
- this has been tested inside the shell
- the exact same code works with spark-0.8.1
-
is that HADOOP_CONF_DIR is not shared with the workers when set
only on the driver (it was not defined in spark-env)?
Also wouldn't it be more natural to create the conf on driver side and then
share it with the workers?
2014-05-09 10:51 GMT+02:00 Eugen Cepoi cepoi.eu...@gmail.com:
Hi,
I have some strange
It depends, personally I have the opposite opinion.
IMO expressing pipelines in a functional language feels natural, you just
have to get used with the language (scala).
Testing spark jobs is easy where testing a Pig script is much harder and
not natural.
If you want a more high level language
Depending on the size of the rdd you could also do a collect broadcast and
then compute the product in a map function over the other rdd. If this is
the same rdd you might also want to cache it. This pattern worked quite
good for me
Le 25 avr. 2014 18:33, Alex Boisvert alex.boisv...@gmail.com a
GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:
Thanks again Eugen! I don't get the point..why you prefer to avoid kyro
ser for closures?is there any problem with that?
On Apr 17, 2014 11:10 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
You have two kind of ser : data and closures
...@mail.gmail.com%3E
.
On Fri, Apr 18, 2014 at 10:31 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote:
Because it happens to reference something outside the closures scope that
will reference some other objects (that you don't need) and so one,
resulting in serializing with your task a lot of things
wrong or
this is a limit of Spark?
On Apr 15, 2014 1:36 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:
Ok thanks for the help!
Best,
Flavio
On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi cepoi.eu...@gmail.comwrote:
Nope, those operations are lazy, meaning it will create the RDDs
rather than the
partition results (which is the collection of points). So is there a way to
reduce the data at the granularity of partitions?
Thanks,
Yanzhe
On Wednesday, April 16, 2014 at 2:24 AM, Eugen Cepoi wrote:
It depends on your algorithm but I guess that you probably should use
It depends on your algorithm but I guess that you probably should use
reduce (the code probably doesn't compile but it shows you the idea).
val result = data.reduce { case (left, right) =
skyline(left ++ right)
}
Or in the case you want to merge the result of a partition with another one
you
:
Thanks Eugen for tgee reply. Could you explain me why I have the
problem?Why my serialization doesn't work?
On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Hi,
as a easy workaround you can enable Kryo serialization
http://spark.apache.org/docs/latest/configuration.html
Eugen
(collect, shuffle, maybe perist to disk - but I am not sure for
this one).
2014-04-15 0:34 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:
Ok, that's fair enough. But why things work up to the collect?during map
and filter objects are not serialized?
On Apr 15, 2014 12:31 AM, Eugen Cepoi
55 matches
Mail list logo