Access several s3 buckets, with credentials containing "/"

2015-06-05 Thread Pierre B
Hi list!

My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile("s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv").count
val c2 =
sc.textFile("s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket2/file.csv").count
val c3 =
sc.textFile("s3n://[ACCESS_KEY_ID3:SECRET_ACCESS_KEY3]@bucket3/file.csv").count
...
```

One/several of those AWS credentials might contain "/" in the private access
key.
This is a known problem and from my research, the only ways to deal with
these "/" are:
1/ use environment variables to set the AWS credentials, then access the s3
buckets without specifying the credentials
2/ set the hadoop configuration to contain the the credentials.

However, none of these solutions allow me to access different buckets, with
different credentials.

Can anyone help me on this?

Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-several-s3-buckets-with-credentials-containing-tp23172.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Access several s3 buckets, with credentials containing "/"

2015-06-05 Thread Pierre B
Hi list!

My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile("s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket/file1.csv").count
val c2 =
sc.textFile("s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket/file1.csv").count
val c3 =
sc.textFile("s3n://[ACCESS_KEY_ID3:SECRET_ACCESS_KEY3]@bucket/file1.csv").count
...
```

One/several of those AWS credentials might contain "/" in the private access
key.
This is a known problem and from my research, the only ways to deal with
these "/" are:
1/ use environment variables to set the AWS credentials, then access the s3
buckets without specifying the credentials
2/ set the hadoop configuration to contain the the credentials.

However, none of these solutions allow me to access different buckets, with
different credentials.

Can anyone help me on this?

Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Access-several-s3-buckets-with-credentials-containing-tp23171.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SQL] Self join with ArrayType columns problems

2015-01-26 Thread Pierre B
Using Spark 1.2.0, we are facing some weird behaviour when performing self
join on a table with some ArrayType field. 
(potential bug ?) 

I have set up a minimal non working example here: 
https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
  
In a nutshell, if the ArrayType column used for the pivot is created
manually in the StructType definition, everything works as expected. 
However, if the ArrayType pivot column is obtained by a sql query (be it by
using a "array" wrapper, or using a collect_list operator for instance),
then results are completely off. 

Could anyone have a look as this really is a blocking issue. 

Thanks! 

Cheers 

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp21364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MissingRequirementError with spark

2015-01-15 Thread Pierre B
I found this, which might be useful:

https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala

I seems that forking is needed.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ScalaReflectionException when using saveAsParquetFile in sbt

2015-01-15 Thread Pierre B
Same problem here...
Did u find a solution for this?

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ScalaReflectionException-when-using-saveAsParquetFile-in-sbt-tp21020p21150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MissingRequirementError with spark

2015-01-15 Thread Pierre B
After upgrading our project to Spark 1.2.0, we get this error when doing a
"sbt test":

scala.reflect.internal.MissingRequirementError: class
org.apache.spark.sql.catalyst.ScalaReflection 

The strange thing is that when running our test suites from IntelliJ,
everything runs smoothly...

Any idea what the difference might be?

Thanks,

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Ok thanks Michael.

In general, what's the easy way to figure out what's already implemented?

The exception I was getting was not really helpful here?

Also, is there a roadmap document somewhere ?

Thanks!

P.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909p16942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SQL] Is RANK function supposed to work in SparkSQL 1.1.0?

2014-10-21 Thread Pierre B
Hi!

The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer

Is this function supposed to be available?

Thanks

P.

---


java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$AbstractAggregationBuffer
at org.apache.spark.sql.hive.HiveUdafFunction.(hiveUdfs.scala:334)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:233)
at
org.apache.spark.sql.hive.HiveGenericUdaf.newInstance(hiveUdfs.scala:207)
at
org.apache.spark.sql.execution.Aggregate.org$apache$spark$sql$execution$Aggregate$$newAggregateBuffer(Aggregate.scala:97)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:129)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Is-RANK-function-supposed-to-work-in-SparkSQL-1-1-0-tp16909.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - custom aggregation function (UDAF)

2014-10-13 Thread Pierre B
Is it planned in a "near" future ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-custom-aggregation-function-UDAF-tp15784p16275.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-09 Thread Pierre B
To add a bit on this one, if you look at RDD.scala in Spark code, you'll see
that both "parent" and "firstParent" methods are protected[spark]. 

I guess, for good reasons, that I must admit I don't understand completely,
you are not supposed to explore an RDD lineage programmatically...

I had a usecase myself and was disappointed to find out about this.

Anyone could enlighten me on the reasons for this restriction?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-look-at-RDD-s-lineage-Or-debug-a-fault-tolerance-error-tp15959p16041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SQL] Set Parquet block size?

2014-10-09 Thread Pierre B
Hi there!

Is there a way to modify default parquet block size?

I didn't see any reference to ParquetOutputFormat.setBlockSize in Spark code
so I was wondering if there was a way to provide this option?

I'm asking because we are facing Out of Memory issues when writing parquet
files.
The rdd we are saving to parquet have a fairly high number of columns (in
the thousands, around 3k for the moment).

The only way we can get rid of this for the moment is by doing a .coalesce
on the SchemaRDD before saving to parquet, but as we get more columns, even
this approach is not working.

Any help is appreciated!

Thanks

Pierre 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Set-Parquet-block-size-tp16039.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Cool Thanks Michael!

Message sent from a mobile device - excuse typos and abbreviations

> Le 8 juil. 2014 à 22:17, Michael Armbrust [via Apache Spark User List] 
>  a écrit :
> 
>> On Tue, Jul 8, 2014 at 12:43 PM, Pierre B <[hidden email]> wrote:
>> 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet
>> file) back to a RDD of a given case class?
> 
> There may be someday, but doing so will either require a lot of reflection or 
> a bunch of macro magic.  So while I think this would be cool, it will 
> probably be a while before we can implement it, and it'll likely be 
> experimental.
>  
>> 2/ Even better, is there a way to get the schema information from a
>> SchemaRDD ? I am trying to figure out how to properly get the various fields
>> of the Rows of a SchemaRDD. Knowing the schema (in the form of a Map?), I
>> guess I could nicely use getInt, getString, ..., on each row.
> 
> We are actively working on this (SPARK-2179).  Hopefully there will be a PR 
> soon, and we are targeting the 1.1 release.
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p9084.html
> To unsubscribe from [Spark SQL]: Convert SchemaRDD back to RDD, click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p9090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Hi there!

1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet
file) back to a RDD of a given case class?

2/ Even better, is there a way to get the schema information from a
SchemaRDD ? I am trying to figure out how to properly get the various fields
of the Rows of a SchemaRDD. Knowing the schema (in the form of a Map?), I
guess I could nicely use getInt, getString, ..., on each row.

Parquet is really appealing for our project, for compression, columnar
access and embedded meta-data, but it would make much more sense if the
schema was available when loading.

Is there any plan to make this accessible?

Thanks

Pierre
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Pierre B
Hi Michaël,

Thanks for this. We could indeed do that.

But I guess the question is more about the change of behaviour from 0.9.1 to
1.0.0.
We never had to care about that in previous versions.

Does that mean we have to manually remove existing files or is there a way
to "aumotically" overwrite when using saveAsTextFile?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-tp6696p6700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Pierre B
Hi all!

We'be been using the sbt-pack sbt plugin
(https://github.com/xerial/sbt-pack) for building our standalone Spark
application for a while now. Until version 1.0.0, that worked nicely.

For those who don't know the sbt-pack plugin, it basically copies all the
dependencies JARs from your local ivy/maven cache to a your target folder
(in target/pack/lib), and creates launch scripts (in target/pack/bin) for
your application (notably setting all these jars on the classpath).

Now, since Spark 1.0.0 was released, we are encountering a weird error where
running our project with "sbt run" is fine but running our app with the
launch scripts generated by sbt-pack fails.

After a (quite painful) investigation, it turns out some JARs are NOT copied
from the local ivy2 cache to the lib folder. I noticed that all the missing
jars contain "shaded" in their file name (but all not all jars with such
name are missing).
One of the missing JARs is explicitly from the Spark definition
(SparkBuild.scala, line 350): ``mesos-0.18.1-shaded-protobuf.jar``.

This file is clearly present in my local ivy cache, but is not copied by
sbt-pack.

Is there an evident reason for that?

I don't know much about the shading mechanism, maybe I'm missing something
here?


Any help would be appreciated!

Cheers

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SparkContext startup time out

2014-05-30 Thread Pierre B
I was annoyed by this as well.
It appears that just permuting the order of decencies inclusion solves this
problem:

first spark, than your cdh hadoop distro.

HTH,

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p6582.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Pierre B
Hi everyone!

Any recommendation anyone?


Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Summit-2014-Hotel-suggestions-tp5457p6424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre B
Thanks Philip,

I don’t want to go the JobLogger way (too hacky ;) )

In version 1.0, if i’m not mistaken, you can even do what I’m asking for, since 
they removed the “private” for TaskInfo and such and replaced it with the 
“@DeveloperApi” annotation.

I was looking for a simple way to do this in 0.9.1, but thanks anyway!

Pierre


On 23 May 2014, at 17:41, Philip Ogren [via Apache Spark User List] 
 wrote:

> Hi Pierre,
> 
> I asked a similar question on this list about 6 weeks ago.  Here is one 
> answer I got that is of particular note:
> 
> In the upcoming release of Spark 1.0 there will be a feature that provides 
> for exactly what you describe: capturing the information displayed on the UI 
> in JSON. More details will be provided in the documentation, but for now, 
> anything before 0.9.1 can only go through JobLogger.scala, which outputs 
> information in a somewhat arbitrary format and will be deprecated soon. If 
> you find this feature useful, you can test it out by building the master 
> branch of Spark yourself, following the instructions in 
> https://github.com/apache/spark/pull/42.
> 
> 
> 
> On 05/22/2014 08:51 AM, Pierre B wrote:
>> Is there a simple way to monitor the overall progress of an action using
>> SparkListener or anything else?
>> 
>> I see that one can name an RDD... Could that be used to determine which
>> action triggered a stage, ... ?
>> 
>> 
>> Thanks
>> 
>> Pierre
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6326.html
> To unsubscribe from Use SparkListener to get overall progress of an action, 
> click here.
> NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6327.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Pierre B
I’ve been looking at how this is implemented in the UI:
https://github.com/apache/spark/blob/branch-0.9/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala

1/ it’s easy to get the RDD name at the Stage events level
2/ the tricky part is that at the task level, we cannot link the tasks back to 
their corresponding stage like it’s done because TaskInfo is private (in fact 
private[spark]) : 

val stageIdToTaskInfos =
HashMap[Int, HashSet[(TaskInfo, Option[TaskMetrics], 
Option[ExceptionFailure])]]()

Tell me if I’m wrong, but i guess that’s the end of the story. There’s no way 
to do that without doing a custom build of spark…

HTH




Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans






On 23 May 2014, at 16:40, Otávio Carvalho [via Apache Spark User List] 
 wrote:

> Mayur,
> 
> I'm interested on it as well. Can you send me?
> 
> Cheers,
> 
> 
> Otávio Carvalho.
> Undergrad. Student at Federal University of Rio Grande do Sul
> Porto Alegre, Brazil.
> 
> 
> 2014-05-23 11:00 GMT-03:00 Pierre Borckmans <[hidden email]>:
> That would be great, Mayur, thanks!
> 
> Anyhow, to be more specific, my question really was the following:
> 
> Is there any way to link events in the SparkListener to an action triggered 
> in your code?
> 
> Cheers
> 
> 
> 
> 
> Pierre Borckmans
> Software team
> 
> RealImpact Analytics | Brussels Office
> www.realimpactanalytics.com | [hidden email]
> 
> FR +32 485 91 87 31 | Skype pierre.borckmans
> 
> 
> 
> 
> 
> 
> On 23 May 2014, at 10:17, Mayur Rustagi <[hidden email]> wrote:
> 
>> We have an internal patched version of Spark webUI which exports application 
>> related data as Json. We use monitoring systems as well as alternate UI for 
>> that json data for our specific application. Found it much cleaner. Can 
>> provide 0.9.1 version.
>> Would submit as a pull request soon. 
>> 
>> 
>> Mayur Rustagi
>> Ph: > target="_blank">+1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi
>> 
>> 
>> 
>> On Fri, May 23, 2014 at 10:57 AM, Chester <[hidden email]> wrote:
>> This is something we are interested as well. We are planning to investigate 
>> more on this. If someone has suggestions, we would love to hear.
>> 
>> Chester
>> 
>> Sent from my iPad
>> 
>> On May 22, 2014, at 8:02 AM, Pierre B <[hidden email]> wrote:
>> 
>>> Hi Andy!
>>> 
>>> Yes Spark UI provides a lot of interesting informations for debugging 
>>> purposes.
>>> 
>>> Here I’m trying to integrate a simple progress monitoring in my app ui.
>>> 
>>> I’m typically running a few “jobs” (or rather actions), and I’d like to be 
>>> able to display the progress of each of those in my ui.
>>> 
>>> I don’t really see how i could do that using SparkListener for the moment …
>>> 
>>> Thanks for your help!
>>> 
>>> Cheers!
>>> 
>>> 
>>> 
>>> 
>>> Pierre Borckmans
>>> Software team
>>> 
>>> RealImpact Analytics | Brussels Office
>>> www.realimpactanalytics.com | [hidden email]
>>> 
>>> FR +32 485 91 87 31 | Skype pierre.borckmans
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] 
>>> <[hidden email]> wrote:
>>> 
>>>> SparkListener offers good stuffs.
>>>> But I also completed it with another metrics stuffs on my own that use 
>>>> Akka to aggregate metrics from anywhere I'd like to collect them (without 
>>>> any deps on ganglia yet on Codahale).
>>>> However, this was useful to gather some custom metrics (from within the 
>>>> tasks then) not really to collect overall monitoring information about the 
>>>> spark thingies themselves.
>>>> For that Spark UI offers already a pretty good insight no?
>>>> 
>>>> Cheers,
>>>> 
>>>> aℕdy ℙetrella
>>>> about.me/noootsab
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, May 22, 2014 at 4:51 PM, Pierre B <>>> href="x-msg://7/user/SendEmail.jtp?type=node&node=6258&i=0" 
>>>> target="_top" rel="nofollow" link="external">[hidden email]> wrote:
>>>> Is ther

Re: Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Hi Andy!

Yes Spark UI provides a lot of interesting informations for debugging purposes.

Here I’m trying to integrate a simple progress monitoring in my app ui.

I’m typically running a few “jobs” (or rather actions), and I’d like to be able 
to display the progress of each of those in my ui.

I don’t really see how i could do that using SparkListener for the moment …

Thanks for your help!

Cheers!




Pierre Borckmans
Software team

RealImpact Analytics | Brussels Office
www.realimpactanalytics.com | pierre.borckm...@realimpactanalytics.com

FR +32 485 91 87 31 | Skype pierre.borckmans






On 22 May 2014, at 16:58, andy petrella [via Apache Spark User List] 
 wrote:

> SparkListener offers good stuffs.
> But I also completed it with another metrics stuffs on my own that use Akka 
> to aggregate metrics from anywhere I'd like to collect them (without any deps 
> on ganglia yet on Codahale).
> However, this was useful to gather some custom metrics (from within the tasks 
> then) not really to collect overall monitoring information about the spark 
> thingies themselves.
> For that Spark UI offers already a pretty good insight no?
> 
> Cheers,
> 
> aℕdy ℙetrella
> about.me/noootsab
> 
> 
> 
> 
> On Thu, May 22, 2014 at 4:51 PM, Pierre B <[hidden email]> wrote:
> Is there a simple way to monitor the overall progress of an action using
> SparkListener or anything else?
> 
> I see that one can name an RDD... Could that be used to determine which
> action triggered a stage, ... ?
> 
> 
> Thanks
> 
> Pierre
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6258.html
> To unsubscribe from Use SparkListener to get overall progress of an action, 
> click here.
> NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256p6259.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Use SparkListener to get overall progress of an action

2014-05-22 Thread Pierre B
Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?

I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?


Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Nested method in a class: Task not serializable?

2014-05-16 Thread Pierre B
Hi!

I understand the usual "Task not serializable" issue that arises when
accessing a field or a method that is out of scope of a closure.

To fix it, I usually define a local copy of these fields/methods, which
avoids the need to serialize the whole class:

class MyClass(val myField: Any) {
  def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")

val myField = this.myField
println(f.map( _ + myField ).count)
  }
}

===

Now, if I define a nested function in the run method, it cannot be
serialized:
class MyClass() {
  def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")

def mapFn(line: String) = line.split(";")

val myField = this.myField
println(f.map( mapFn( _ ) ).count)

  }
}

I don't understand since I thought "mapFn" would be in scope...
Even stranger, if I define mapFn to be a val instead of a def, then it
works:

class MyClass() {
  def run() = {
val f = sc.textFile("hdfs://xxx.xxx.xxx.xxx/file.csv")

val mapFn = (line: String) => line.split(";")
   
println(f.map( mapFn( _ ) ).count)
  }
}

Is this related to the way Scala represents nested functions?

What's the recommended way to deal with this issue ?

Thanks for your help,

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Nested-method-in-a-class-Task-not-serializable-tp5869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 0.9.0 - local mode - sc.addJar problem (bug?)

2014-03-02 Thread Pierre B
I'm still puzzled why trying wget with my IP is not working properly, whereas
it's working if I use 127.0.0.1 or localhost...


 

?

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-0-local-mode-sc-addJar-problem-bug-tp2218p2221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark 0.9.0 - local mode - sc.addJar problem (bug?)

2014-03-02 Thread Pierre B
Hi all!

In spark 0.9.0, local mode, whenever I try to add jar(s), using either
SparkConf.addJars or SparkConfiguration.addJar, in the shell or in a
standalone mode, I observe a strange behaviour.

I investigated this because my standalone app works perfectly on my cluster
but is getting stuck in local mode.

So, as can be seen in the following screenshot, the jar file is supposedly
made available at the given http address:

 


However, when I try to get the file from http (in a browser or using wget),
the download always gets stuck after a while (usually around 66,724 bytes,
as can be seen in the next screenshot):

 

This happens for all jars I've trying, except the ones smaller than
60kbytes.

Could this be a bug or just a problem on my machine? (MacOS Mavericks)

Cheers

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-0-9-0-local-mode-sc-addJar-problem-bug-tp2218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.