Re: Running Spark on EMR

2017-01-15 Thread Darren Govoni
So what was the answer?


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Andrew Holway 
 Date: 1/15/17  11:37 AM  (GMT-05:00) To: Marco 
Mistroni  Cc: Neil Jonkers , User 
 Subject: Re: Running Spark on EMR 
Darn. I didn't respond to the list. Sorry.


On Sun, Jan 15, 2017 at 5:29 PM, Marco Mistroni  wrote:
thanks Neil. I followed original suggestion from Andrw and everything is 
working fine nowkr
On Sun, Jan 15, 2017 at 4:27 PM, Neil Jonkers  wrote:
Hello,
Can you drop the url:
 spark://master:7077
The url is used when running Spark in standalone mode.
Regards

 Original message From: Marco Mistroni  Date:15/01/2017  16:34  
(GMT+02:00) To: User  Subject: Running Spark on EMR 
hi all could anyone assist here?i am trying to run spark 2.0.0 on an EMR 
cluster,but i am having issues connecting to the master nodeSo, below is a 
snippet of what i am doing

sc = SparkSession.builder.master(sparkHost).appName("DataProcess").getOrCreate()

sparkHost is passed as input parameter. that was thought so that i can run the 
script locallyon my spark local instance as well as submitting scripts on any 
cluster i want

Now i have 1 - setup a cluster on EMR. 2 - connected to masternode3  - launch 
the command spark-submit myscripts.py spark://master:7077
But that results in an connection refused exceptionThen i have tried to remove 
the .master call above and launch the script with the following command
spark-submit --master spark://master:7077   myscript.py  but still i am 
gettingconnectionREfused exception

I am using Spark 2.0.0 , could anyone advise on how shall i build the spark 
session and how can i submit a pythjon script to the cluster?
kr marco  





-- 
Otter Networks UG
http://otternetworks.de
Gotenstraße 17
10829 Berlin



Spark in docker over EC2

2017-01-10 Thread Darren Govoni
Anyone got a good guide for getting spark master to talk to remote workers 
inside dockers? I followed the tips found by searching but doesn't work still. 
Spark 1.6.2.
I exposed all the ports and tried to set local IP inside container to the host 
IP but spark complains it can't bind ui ports.
Thanks in advance!


Sent from my Verizon, Samsung Galaxy smartphone

RE: AMQP extension for Apache Spark Streaming (messaging/IoT)

2016-07-03 Thread Darren Govoni


This is fantastic news.


Sent from my Verizon 4G LTE smartphone

 Original message 
From: Paolo Patierno  
Date: 7/3/16  4:41 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: AMQP extension for Apache Spark Streaming (messaging/IoT) 

Hi all,

I'm working on an AMQP extension for Apache Spark Streaming, developing a 
reliable receiver for that. 

After
 MQTT support (I see it in the Apache Bahir repository), another messaging/IoT 
protocol 
could be very useful for the Apache Spark Streaming ecosystem. Out there a lot 
of broker (with "store and forward" mechanism) support AMQP as first citizen 
protocol other than the Apache Qpid Dispatch Router that is based on that for 
message routing.
Currently the source code is in my own GitHub account and it's in a early 
stage; the first step was just having something working end-to-end. I'm going 
to add feature like QoS and flow control in AMQP terms very soon. I was 
inspired by the spark-packages directories structure using Scala (as main 
language) and SBT (as build tool).

https://github.com/ppatierno/dstream-amqp

What do 
you think about that ?

Looking forward to hear from you.

Thanks,
Paolo.
  

Re: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni


Well that could be the problem. A SQL database is essential a big synchronizer. 
If you have a lot of spark tasks all bottlenecking on a single database socket 
(is the database clustered or colocated with spark workers?) then you will have 
blocked threads on the database server.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Malcolm Lockyer <malcolm.lock...@hapara.com> 
Date: 05/30/2016  10:40 PM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Spark + Kafka processing trouble 

On Tue, May 31, 2016 at 1:56 PM, Darren Govoni <dar...@ontrenet.com> wrote:
> So you are calling a SQL query (to a single database) within a spark
> operation distributed across your workers?

Yes, but currently with very small sets of data (1-10,000) and on a
single (dev) machine right now.





(sorry didn't reply to the list)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark + Kafka processing trouble

2016-05-30 Thread Darren Govoni


So you are calling a SQL query (to a single database) within a spark operation 
distributed across your workers? 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Malcolm Lockyer  
Date: 05/30/2016  9:45 PM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Spark + Kafka processing trouble 

Hopefully this is not off topic for this list, but I am hoping to
reach some people who have used Kafka + Spark before.

We are new to Spark and are setting up our first production
environment and hitting a speed issue that maybe configuration related
- and we have little experience in configuring Spark environments.

So we've got a Spark streaming job that seems to take an inordinate
amount of time to process. I realize that without specifics, it is
difficult to trace - however the most basic primitives in Spark are
performing horribly. The lazy nature of Spark is making it difficult
for me to understand what is happening - any suggestions are very much
appreciated.

Environment is MBP 2.2 i7. Spark master is "local[*]". We are using
Kafka and PostgreSQL, both local. The job is designed to:

a) grab some data from Kafka
b) correlate with existing data in PostgreSQL
c) output data to Kafka

I am isolating timings by calling System.nanoTime() before and after
something that forces calculation, for example .count() on a
DataFrame. It seems like every operation has a MASSIVE fixed overhead
and that is stacking up making each iteration on the RDD extremely
slow. Slow operations include pulling a single item from the Kafka
queue, running a simple query against PostgresSQL, and running a Spark
aggregation on a RDD with a handful of rows.

The machine is not maxing out on memory, disk or CPU. The machine
seems to be doing nothing for a high percentage of the execution time.
We have reproduced this behavior on two other machines. So we're
suspecting a configuration issue

As a concrete example, we have a DataFrame produced by running a JDBC
query by mapping over an RDD from Kafka. Calling count() (I guess
forcing execution) on this DataFrame when there is *1* item/row (Note:
SQL database is EMPTY at this point so this is not a factor) takes 4.5
seconds, calling count when there are 10,000 items takes 7 seconds.

Can anybody offer experience of something like this happening for
them? Any suggestions on how to understand what is going wrong?

I have tried tuning the number of Kafka partitions - increasing this
seems to increase the concurrency and ultimately number of things
processed per minute, but to get something half decent, I'm going to
need running with 1024 or more partitions. Is 1024 partitions a
reasonable number? What do you use in you environments?

I've tried different options for batchDuration. The calculation seems
to be batchDuration * Kafka partitions for number of items per
iteration, but this is always still extremely slow (many per iteration
vs. very few doesn't seem to really improve things). Can you suggest a
list of the Spark configuration parameters related to speed that you
think are key - preferably with the values you use for those
parameters?

I'd really really appreciate any help or suggestions as I've been
working on this speed issue for 3 days without success and my head is
starting to hurt. Thanks in advance.



Thanks,

--

Malcolm Lockyer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submit python egg?

2016-05-18 Thread Darren Govoni


Hi  I have a python egg with a __main__.py in it. I am able to execute the egg 
by itself fine.
Is there a way to just submit the egg to spark and have it run? It seems an 
external .py script is needed which would be unfortunate if true.
Thanks


Sent from my Verizon Wireless 4G LTE smartphone

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni


Our data is made up of single text documents scraped off the web. We store 
these in a  RDD. A Dataframe or similar structure makes no sense at that point. 
And the RDD is transient.
So my point is. Dataframes should not replace plain old rdd since rdds allow 
for more flexibility and sql etc is not even usable on our data while in rdd. 
So all those nice dataframe apis aren't usable until it's structured. Which is 
the core problem anyway.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas <nicholas.cham...@gmail.com> 
Date: 03/02/2016  5:43 PM  (GMT-05:00) 
To: Darren Govoni <dar...@ontrenet.com>, Jules Damji <dmat...@comcast.net>, 
Joshua Sorrell <jsor...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

Plenty of people get their data in Parquet, Avro, or ORC files; or from a 
database; or do their initial loading of un- or semi-structured data using one 
of the various data source libraries which help with type-/schema-inference.
All of these paths help you get to a DataFrame very quickly.
Nick
On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:


Dataframes are essentially structured tables with schemas. So where does the 
non typed data sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.




Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas <nicholas.cham...@gmail.com> 
Date: 03/02/2016  5:13 PM  (GMT-05:00) 
To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

> However, I believe, investing (or having some members of your group) learn 
>and invest in Scala is worthwhile for few reasons. One, you will get the 
>performance gain, especially now with Tungsten (not sure how it relates to 
>Python, but some other knowledgeable people on the list, please chime in).
The more your workload uses DataFrames, the less of a difference there will be 
between the languages (Scala, Java, Python, or R) in terms of performance.
One of the main benefits of Catalyst (which DFs enable) is that it 
automatically optimizes DataFrame operations, letting you focus on _what_ you 
want while Spark will take care of figuring out _how_.
Tungsten takes things further by tightly managing memory using the type 
information made available to it via DataFrames. This benefit comes into play 
regardless of the language used.
So in short, DataFrames are the "new RDD"--i.e. the new base structure you 
should be using in your Spark programs wherever possible. And with DataFrames, 
what language you use matters much less in terms of performance.
Nick
On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
Hello Joshua,
comments are inline...

On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
I haven't used Spark in the last year and a half. I am about to start a project 
with a new team, and we need to decide whether to use pyspark or Scala.
Indeed, good questions, and they do come up lot in trainings that I have 
attended, where this inevitable question is raised.I believe, it depends on 
your level of comfort zone or adventure into newer things.
True, for the most part that Apache Spark committers have been committed to 
keep the APIs at parity across all the language offerings, even though in some 
cases, in particular Python, they have lagged by a minor release. To the the 
extent that they’re committed to level-parity is a good sign. It might to be 
the case with some experimental APIs, where they lag behind,  but for the most 
part, they have been admirably consistent. 
With Python there’s a minor performance hit, since there’s an extra level of 
indirection in the architecture and an additional Python PID that the executors 
launch to execute your pickled Python lambdas. Other than that it boils down to 
your comfort zone. I recommend looking at Sameer’s slides on (Advanced Spark 
for DevOps Training) where he walks through the pySpark and Python 
architecture. 

We are NOT a java shop. So some of the build tools/procedures will require some 
learning overhead if we go the Scala route. What I want to know is: is the 
Scala version of Spark still far enough ahead of pyspark to be well worth any 
initial training overhead?  
If you are a very advanced Python shop and if you’ve in-house libraries that 
you have written in Python that don’t exist in Scala or some ML libs that don’t 
exist in the Scala version and will require fair amount of porting and gap is 
too large, then perhaps it makes sense to stay put with Python.
However, I believe, investing (or having some members of your 

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Darren Govoni


Dataframes are essentially structured tables with schemas. So where does the 
non typed data sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.




Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nicholas Chammas  
Date: 03/02/2016  5:13 PM  (GMT-05:00) 
To: Jules Damji , Joshua Sorrell  
Cc: user@spark.apache.org 
Subject: Re: Does pyspark still lag far behind the Scala API in terms of 
features 

> However, I believe, investing (or having some members of your group) learn 
>and invest in Scala is worthwhile for few reasons. One, you will get the 
>performance gain, especially now with Tungsten (not sure how it relates to 
>Python, but some other knowledgeable people on the list, please chime in).
The more your workload uses DataFrames, the less of a difference there will be 
between the languages (Scala, Java, Python, or R) in terms of performance.
One of the main benefits of Catalyst (which DFs enable) is that it 
automatically optimizes DataFrame operations, letting you focus on _what_ you 
want while Spark will take care of figuring out _how_.
Tungsten takes things further by tightly managing memory using the type 
information made available to it via DataFrames. This benefit comes into play 
regardless of the language used.
So in short, DataFrames are the "new RDD"--i.e. the new base structure you 
should be using in your Spark programs wherever possible. And with DataFrames, 
what language you use matters much less in terms of performance.
Nick
On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:
Hello Joshua,
comments are inline...

On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
I haven't used Spark in the last year and a half. I am about to start a project 
with a new team, and we need to decide whether to use pyspark or Scala.
Indeed, good questions, and they do come up lot in trainings that I have 
attended, where this inevitable question is raised.I believe, it depends on 
your level of comfort zone or adventure into newer things.
True, for the most part that Apache Spark committers have been committed to 
keep the APIs at parity across all the language offerings, even though in some 
cases, in particular Python, they have lagged by a minor release. To the the 
extent that they’re committed to level-parity is a good sign. It might to be 
the case with some experimental APIs, where they lag behind,  but for the most 
part, they have been admirably consistent. 
With Python there’s a minor performance hit, since there’s an extra level of 
indirection in the architecture and an additional Python PID that the executors 
launch to execute your pickled Python lambdas. Other than that it boils down to 
your comfort zone. I recommend looking at Sameer’s slides on (Advanced Spark 
for DevOps Training) where he walks through the pySpark and Python 
architecture. 

We are NOT a java shop. So some of the build tools/procedures will require some 
learning overhead if we go the Scala route. What I want to know is: is the 
Scala version of Spark still far enough ahead of pyspark to be well worth any 
initial training overhead?  
If you are a very advanced Python shop and if you’ve in-house libraries that 
you have written in Python that don’t exist in Scala or some ML libs that don’t 
exist in the Scala version and will require fair amount of porting and gap is 
too large, then perhaps it makes sense to stay put with Python.
However, I believe, investing (or having some members of your group) learn and 
invest in Scala is worthwhile for few reasons. One, you will get the 
performance gain, especially now with Tungsten (not sure how it relates to 
Python, but some other knowledgeable people on the list, please chime in). Two, 
since Spark is written in Scala, it gives you an enormous advantage to read 
sources (which are well documented and highly readable) should you have to 
consult or learn nuances of certain API method or action not covered 
comprehensively in the docs. And finally, there’s a long term benefit in 
learning Scala for reasons other than Spark. For example, writing other 
scalable and distributed applications.

Particularly, we will be using Spark Streaming. I know a couple of years ago 
that practically forced the decision to use Scala.  Is this still the case?
You’ll notice that certain APIs call are not available, at least for now, in 
Python. http://spark.apache.org/docs/latest/streaming-programming-guide.html

CheersJules
--
The Best Ideas Are Simple
Jules S. Damji
e-mail:dmat...@comcast.net
e-mail:jules.da...@gmail.com




RE: How could I do this algorithm in Spark?

2016-02-25 Thread Darren Govoni


This might be hard to do. One generalization of this problem is 
https://en.m.wikipedia.org/wiki/Longest_path_problem
Given a node (e.g. A), find longest path. All interior relations are transitive 
and can be inferred.
But finding a distributed spark way of doing it in P time would be interesting.

Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Guillermo Ortiz  
Date: 02/24/2016  5:26 PM  (GMT-05:00) 
To: user  
Subject: How could I do this algorithm in Spark? 

I want to do some algorithm in Spark.. I know how to do it in a single machine 
where all data are together, but I don't know a good way to do it in Spark. 
If someone has an idea..I have some data like thisa , bx , yb , cy , yc , d
I want something like:a , db , dc , dx , yy , y
I need to know that a->b->c->d, so a->d, b->d and c->d.I don't want the code, 
just an idea how I could deal with it. 
Any idea?


RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni


I meant to write 'last task in stage'.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Darren Govoni <dar...@ontrenet.com> 
Date: 02/16/2016  6:55 AM  (GMT-05:00) 
To: Abhishek Modi <abshkm...@gmail.com>, user@spark.apache.org 
Subject: RE: Unusually large deserialisation time 



I think this is part of the bigger issue of serious deadlock conditions 
occurring in spark many of us have posted on.
Would the task in question be the past task of a stage by chance?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Abhishek Modi <abshkm...@gmail.com> 
Date: 02/16/2016  4:12 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Unusually large deserialisation time 

I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here 
is my code snippet 

// myRdd is an rdd consisting of Tuple2[Int,Long] 
myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) 

//The rangify function 
def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterator[ Tuple2[Long, List [ 
ArrayBuffer[ Tuple2[Long,Long] ] ] ] ]= { 
  var sum=0L 
  val mylist=ArrayBuffer[ Tuple2[Long,Long] ]() 

  if(l.isEmpty) 
    return List( (0L,List [ ArrayBuffer[ Tuple2[Long,Long] ] ] ())).toIterator 

  var prev= -1000L 
  var begin= -1000L 

  for (x <- l){ 
    sum+=x._1 

    if(prev<0){ 
      prev=x._2 
      begin=x._2 
    } 

    else if(x._2==prev+1) 
      prev=x._2 

    else { 
      list+=((begin,prev)) 
      prev=x._2 
      begin=x._2 
    } 
  } 

  mylist+= ((begin,prev)) 

  List((sum, List(mylist) ) ).toIterator 
} 


The rdd is cached in memory. I'm using 20 executors with 1 core for each 
executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs of 
the job, there is a task which has an abnormally large deserialisation time. 
Screenshot attached 

Thank you,Abhishek




RE: Unusually large deserialisation time

2016-02-16 Thread Darren Govoni


I think this is part of the bigger issue of serious deadlock conditions 
occurring in spark many of us have posted on.
Would the task in question be the past task of a stage by chance?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Abhishek Modi  
Date: 02/16/2016  4:12 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Unusually large deserialisation time 

I'm doing a mapPartitions on a rdd cached in memory followed by a reduce. Here 
is my code snippet 

// myRdd is an rdd consisting of Tuple2[Int,Long] 
myRdd.mapPartitions(rangify).reduce( (x,y) => (x._1+y._1,x._2 ++ y._2)) 

//The rangify function 
def rangify(l: Iterator[ Tuple2[Int,Long] ]) : Iterator[ Tuple2[Long, List [ 
ArrayBuffer[ Tuple2[Long,Long] ] ] ] ]= { 
  var sum=0L 
  val mylist=ArrayBuffer[ Tuple2[Long,Long] ]() 

  if(l.isEmpty) 
    return List( (0L,List [ ArrayBuffer[ Tuple2[Long,Long] ] ] ())).toIterator 

  var prev= -1000L 
  var begin= -1000L 

  for (x <- l){ 
    sum+=x._1 

    if(prev<0){ 
      prev=x._2 
      begin=x._2 
    } 

    else if(x._2==prev+1) 
      prev=x._2 

    else { 
      list+=((begin,prev)) 
      prev=x._2 
      begin=x._2 
    } 
  } 

  mylist+= ((begin,prev)) 

  List((sum, List(mylist) ) ).toIterator 
} 


The rdd is cached in memory. I'm using 20 executors with 1 core for each 
executor. The cached rdd has 60 blocks. The problem is for every 2-3 runs of 
the job, there is a task which has an abnormally large deserialisation time. 
Screenshot attached 

Thank you,Abhishek




Re: Launching EC2 instances with Spark compiled for Scala 2.11

2016-01-25 Thread Darren Govoni


Why not deploy it. Then build a custom distribution with Scala 2.11 and just 
overlay it.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Nuno Santos  
Date: 01/25/2016  7:38 AM  (GMT-05:00) 
To: user@spark.apache.org 
Subject: Re: Launching EC2 instances with Spark compiled for Scala 2.11 

Hello, 

Any updates on this question? I'm also very interested in a solution, as I'm
trying to use Spark on EC2 but need Scala 2.11 support. The scripts in the
ec2 directory of the Spark distribution install use Scala 2.10 by default
and I can't see any obvious option to change to Scala 2.11. 

Regards, 
Nuno



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-EC2-instances-with-Spark-compiled-for-Scala-2-11-tp24979p26059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni


Yeah. I have screenshots and stack traces. I will post them to the ticket. 
Nothing informative.
I should also mention I'm using pyspark but I think the deadlock is inside the 
Java scheduler code.



Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/25/2016  8:59 AM  (GMT-05:00) 
To: Ted Yu <yuzhih...@gmail.com> 
Cc: Darren Govoni <dar...@ontrenet.com>, Renu Yadav <yren...@gmail.com>, Muthu 
Jayakumar <bablo...@gmail.com>, user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 



Is the thread dump the stack trace you are talking about? If so, I will see if 
I can capture the few different stages I have seen it in.



Thanks for the help, I was able to do it for 0.1% of my data. I will create the 
JIRA.



Thanks,
Isaac


On Jan 25, 2016, at 8:51 AM, Ted Yu <yuzhih...@gmail.com> wrote:







Opening a JIRA is fine. 



See if you can capture stack trace during the hung stage and attach to JIRA so 
that we have more clue. 



Thanks


On Jan 25, 2016, at 4:25 AM, Darren Govoni <dar...@ontrenet.com> wrote:






Probably we should open a ticket for this.
There's definitely a deadlock situation occurring in spark under certain 
conditions.



The only clue I have is it always happens on the last stage. And it does seem 
sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I 
only run 10mb of it it will succeed. This suggest a serious fundamental scaling 
problem.



Workers have plenty of resources.










Sent from my Verizon Wireless 4G LTE smartphone





 Original message 

From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/24/2016 2:54 PM (GMT-05:00) 

To: Renu Yadav <yren...@gmail.com> 

Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar <bablo...@gmail.com>, 
Ted Yu <yuzhih...@gmail.com>,
user@spark.apache.org 

Subject: Re: 10hrs of Scheduler Delay 



I am not getting anywhere with any of the suggestions so far. :(



Trying some more outlets, I will share any solution I find.



- Isaac




On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote:



If you turn on spark.speculation on then that might help. it worked  for me




On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni 
<dar...@ontrenet.com> wrote:



Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.











Sent from my Verizon Wireless 4G LTE smartphone





 Original message ----



From: Muthu Jayakumar <bablo...@gmail.com>


Date: 01/22/2016 3:50 PM (GMT-05:00) 

To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 



Does increasing the number of partition helps? You could try out something 3 
times what you currently have. 
Another trick i used was to partition the problem into multiple dataframes and 
run them sequentially and persistent the result and then run a union on the 
results. 



Hope this helps. 




On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:




Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.












Sent from my Verizon Wireless 4G LTE smartphone





 Original message 


From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/21/2016 11:18 PM (GMT-05:00) 

To: Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 




I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/Distance

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Darren Govoni


Probably we should open a ticket for this.There's definitely a deadlock 
situation occurring in spark under certain conditions.
The only clue I have is it always happens on the last stage. And it does seem 
sensitive to scale. If my job has 300mb of data I'll see the deadlock. But if I 
only run 10mb of it it will succeed. This suggest a serious fundamental scaling 
problem.
Workers have plenty of resources.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/24/2016  2:54 PM  (GMT-05:00) 
To: Renu Yadav <yren...@gmail.com> 
Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar <bablo...@gmail.com>, 
Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 






I am not getting anywhere with any of the suggestions so far. :(



Trying some more outlets, I will share any solution I find.



- Isaac




On Jan 23, 2016, at 1:48 AM, Renu Yadav <yren...@gmail.com> wrote:



If you turn on spark.speculation on then that might help. it worked  for me




On Sat, Jan 23, 2016 at 3:21 AM, Darren Govoni 
<dar...@ontrenet.com> wrote:



Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.











Sent from my Verizon Wireless 4G LTE smartphone





 Original message 



From: Muthu Jayakumar <bablo...@gmail.com>


Date: 01/22/2016 3:50 PM (GMT-05:00) 

To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 



Does increasing the number of partition helps? You could try out something 3 
times what you currently have. 
Another trick i used was to partition the problem into multiple dataframes and 
run them sequentially and persistent the result and then run a union on the 
results. 



Hope this helps. 




On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:




Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.












Sent from my Verizon Wireless 4G LTE smartphone





 Original message 


From: "Sanders, Isaac B" <sande...@rose-hulman.edu>


Date: 01/21/2016 11:18 PM (GMT-05:00) 

To: Ted Yu <yuzhih...@gmail.com>


Cc: user@spark.apache.org


Subject: Re: 10hrs of Scheduler Delay 




I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?




Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Thanks for the tip. I will try it. But this is the kind of thing spark is 
supposed to figure out and handle. Or at least not get stuck forever.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Muthu Jayakumar <bablo...@gmail.com> 
Date: 01/22/2016  3:50 PM  (GMT-05:00) 
To: Darren Govoni <dar...@ontrenet.com>, "Sanders, Isaac B" 
<sande...@rose-hulman.edu>, Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Does increasing the number of partition helps? You could try out something 3 
times what you currently have. Another trick i used was to partition the 
problem into multiple dataframes and run them sequentially and persistent the 
result and then run a union on the results. 
Hope this helps. 

On Fri, Jan 22, 2016, 3:48 AM Darren Govoni <dar...@ontrenet.com> wrote:


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B" <sande...@rose-hulman.edu> 
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu <yuzhih...@gmail.com> 
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
<sande...@rose-hulman.edu> wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried

Re: 10hrs of Scheduler Delay

2016-01-22 Thread Darren Govoni


Me too. I had to shrink my dataset to get it to work. For us at least Spark 
seems to have scaling issues.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: "Sanders, Isaac B"  
Date: 01/21/2016  11:18 PM  (GMT-05:00) 
To: Ted Yu  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 


I have run the driver on a smaller dataset (k=2, n=5000) and it worked quickly 
and didn’t hang like this. This dataset is closer to k=10, n=4.4m, but I am 
using more resources on this one.



- Isaac






On Jan 21, 2016, at 11:06 PM, Ted Yu  wrote:



You may have seen the following on github page:


Latest commit 50fdf0e  on Feb 22, 2015






That was 11 months ago.



Can you search for similar algorithm which runs on Spark and is newer ?



If nothing found, consider running the tests coming from the project to 
determine whether the delay is intrinsic.



Cheers



On Thu, Jan 21, 2016 at 7:46 PM, Sanders, Isaac B 
 wrote:



That thread seems to be moving, it oscillates between a few different traces… 
Maybe it is working. It seems odd that it would take that long.



This is 3rd party code, and after looking at some of it, I think it might not 
be as Spark-y as it could be.



I linked it below. I don’t know a lot about spark, so it might be fine, but I 
have my suspicions.



https://github.com/alitouka/spark_dbscan/blob/master/src/src/main/scala/org/alitouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala



- Isaac




On Jan 21, 2016, at 10:08 PM, Ted Yu  wrote:



You may have noticed the following - did this indicate prolonged computation in 
your code ?


org.apache.commons.math3.util.MathArrays.distance(MathArrays.java:205)
org.apache.commons.math3.ml.distance.EuclideanDistance.compute(EuclideanDistance.java:34)
org.alitouka.spark.dbscan.spatial.DistanceCalculation$class.calculateDistance(DistanceCalculation.scala:15)
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver$.calculateDistance(DistanceToNearestNeighborDriver.scala:16)




On Thu, Jan 21, 2016 at 5:13 PM, Sanders, Isaac B 
 wrote:



Hadoop is: HDP 2.3.2.0-2950



Here is a gist (pastebin) of my versions en masse and a stacktrace: 
https://gist.github.com/isaacsanders/2e59131758469097651b



Thanks







On Jan 21, 2016, at 7:44 PM, Ted Yu  wrote:



Looks like you were running on YARN.



What hadoop version are you using ?



Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?



Thanks



On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B 
 wrote:



The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 







On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac





















































Re: 10hrs of Scheduler Delay

2016-01-21 Thread Darren Govoni


I've experienced this same problem. Always the last stage hangs. Indeterminant. 
No errors in logs. I run spark 1.5.2. Can't find an explanation. But it's 
definitely a showstopper.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Ted Yu  
Date: 01/21/2016  7:44 PM  (GMT-05:00) 
To: "Sanders, Isaac B"  
Cc: user@spark.apache.org 
Subject: Re: 10hrs of Scheduler Delay 

Looks like you were running on YARN.
What hadoop version are you using ?
Can you capture a few stack traces of the AppMaster during the delay and 
pastebin them ?
Thanks
On Thu, Jan 21, 2016 at 8:08 AM, Sanders, Isaac B  
wrote:





The Spark Version is 1.4.1



The logs are full of standard fair, nothing like an exception or even 
interesting [INFO] lines.



Here is the script I am using: 
https://gist.github.com/isaacsanders/660f480810fbc07d4df2



Thanks
Isaac




On Jan 21, 2016, at 11:03 AM, Ted Yu  wrote:



Can you provide a bit more information ?



command line for submitting Spark job
version of Spark
anything interesting from driver / executor logs ?



Thanks 





On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B 
 wrote:


Hey all,



I am a CS student in the United States working on my senior thesis.



My thesis uses Spark, and I am encountering some trouble.



I am using 
https://github.com/alitouka/spark_dbscan, and to determine parameters, I am 
using the utility class they supply, 
org.alitouka.spark.dbscan.exploratoryAnalysis.DistanceToNearestNeighborDriver.



I am on a 10 node cluster with one machine with 8 cores and 32G of memory and 
nine machines with 6 cores and 16G of memory.



I have 442M of data, which seems like it would be a joke, but the job stalls at 
the last stage.



It was stuck in Scheduler Delay for 10 hours overnight, and I have tried a 
number of things for the last couple days, but nothing seems to be helping.



I have tried:

- Increasing heap sizes and numbers of cores

- More/less executors with different amounts of resources.

- Kyro Serialization

- FAIR Scheduling



It doesn’t seem like it should require this much. Any ideas?



- Isaac















Re: Docker/Mesos with Spark

2016-01-19 Thread Darren Govoni


I also would be interested in some best practice for making this work.
Where will the writeup be posted? On mesosphere website?


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Sathish Kumaran Vairavelu  
Date: 01/19/2016  7:00 PM  (GMT-05:00) 
To: Tim Chen  
Cc: John Omernik , user  
Subject: Re: Docker/Mesos with Spark 

Thank you! Looking forward for it..

On Tue, Jan 19, 2016 at 4:03 PM Tim Chen  wrote:
Hi Sathish,
Sorry about that, I think that's a good idea and I'll write up a section in the 
Spark documentation page to explain how it can work. We (Mesosphere) have been 
doing this for our DCOS spark for our past releases and has been working well 
so far.
Thanks!
Tim
On Tue, Jan 19, 2016 at 12:28 PM, Sathish Kumaran Vairavelu 
 wrote:
Hi Tim

Do you have any materials/blog for running Spark in a container in Mesos 
cluster environment? I have googled it but couldn't find info on it. Spark 
documentation says it is possible, but no details provided.. Please help


Thanks 

Sathish



On Mon, Sep 21, 2015 at 11:54 AM Tim Chen  wrote:
Hi John,
There is no other blog post yet, I'm thinking to do a series of posts but so 
far haven't get time to do that yet.
Running Spark in docker containers makes distributing spark versions easy, it's 
simple to upgrade and automatically caches on the slaves so the same image just 
runs right away. Most of the docker perf is usually related to network and 
filesystem overheads, but I think with recent changes in Spark to make Mesos 
sandbox the default temp dir filesystem won't be a big concern as it's mostly 
writing to the mounted in Mesos sandbox. Also Mesos uses host network by 
default so network is affected much.
Most of the cluster mode limitation is that you need to make the spark job 
files available somewhere that all the slaves can access remotely (http, s3, 
hdfs, etc) or available on all slaves locally by path. 
I'll try to make more doc efforts once I get my existing patches and testing 
infra work done.
Let me know if you have more questions,
Tim
On Sat, Sep 19, 2015 at 5:42 AM, John Omernik  wrote:
I was searching in the 1.5.0 docs on the Docker on Mesos capabilities and just 
found you CAN run it this way.  Are there any user posts, blog posts, etc on 
why and how you'd do this? 
Basically, at first I was questioning why you'd run spark in a docker 
container, i.e., if you run with tar balled executor, what are you really 
gaining?  And in this setup, are you losing out on performance somehow? (I am 
guessing smarter people than I have figured that out).  
Then I came along a situation where I wanted to use a python library with 
spark, and it had to be installed on every node, and I realized one big 
advantage of dockerized spark would be that spark apps that needed other 
libraries could be contained and built well.   
OK, that's huge, let's do that.  For my next question there are lot of 
"questions" have on how this actually works.  Does Clustermode/client mode 
apply here? If so, how?  Is there a good walk through on getting this setup? 
Limitations? Gotchas?  Should I just dive in an start working with it? Has 
anyone done any stories/rough documentation? This seems like a really helpful 
feature to scaling out spark, and letting developers truly build what they need 
without tons of admin overhead, so I really want to explore. 
Thanks!
John








Re: rdd.foreach return value

2016-01-18 Thread Darren Govoni


What's the rationale behind that? It certainly limits the kind of flow logic we 
can do in one statement.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: David Russell  
Date: 01/18/2016  10:44 PM  (GMT-05:00) 
To: charles li  
Cc: user@spark.apache.org 
Subject: Re: rdd.foreach return value 

The foreach operation on RDD has a void (Unit) return type. See attached. So 
there is no return value to the driver.

David

"All that is gold does not glitter, Not all those who wander are lost."


 Original Message 
Subject: rdd.foreach return value
Local Time: January 18 2016 10:34 pm
UTC Time: January 19 2016 3:34 am
From: charles.up...@gmail.com
To: user@spark.apache.org

code snippet



the 'print' actually print info on the worker node, but I feel confused where 
the 'return' value 
goes to. for I get nothing on the driver node.
-- 
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao



Task hang problem

2015-12-29 Thread Darren Govoni


Hi,

  I've had this nagging problem where a task will hang and the
entire job hangs. Using pyspark. Spark 1.5.1



The job output looks like this, and hangs after the last task:



..

15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage
0.0 (TID 15) in 11668 ms on 10.65.143.174 (29/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage
0.0 (TID 23) in 11684 ms on 10.65.143.174 (30/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage
0.0 (TID 7) in 11717 ms on 10.65.143.174 (31/32)

{nothing here for a while, ~6mins}





Here is the executor status, from UI.





  

  31
  31
  0
  RUNNING
  PROCESS_LOCAL
  2 / 10.65.143.174
  2015/12/29 17:00:28
  6.8 min
  0 ms
  0 ms
  60 ms
  0 ms
  0 ms
  0.0 B

  



Here is executor 2 from 10.65.143.174. Never see task 31 get to the
executor.any ideas?



.

15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
variable 0

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called
with curMem=0, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 5.8 KB, free 2.1 GB)

15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable
0 took 208 ms

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called
with curMem=5979, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 8.3 KB, free 2.1 GB)

15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747,
init = 166, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0
(TID 15). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735,
init = 220, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0
(TID 23). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812,
init = 158, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID
7). 967 bytes result sent to driver

root@ip-10-65-143-174 2]$ 


Sent from my Verizon Wireless 4G LTE smartphone

Re: Task hang problem

2015-12-29 Thread Darren Govoni

  

  
  
here's executor trace.

  

  
  
Thread 58: Executor task launch
worker-3 (RUNNABLE)

  
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketInputStream.read(SocketInputStream.java:122)
java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
java.io.BufferedInputStream.read(BufferedInputStream.java:254)
java.io.DataInputStream.readInt(DataInputStream.java:387)
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 41: BLOCK_MANAGER cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 42: BROADCAST_VARS cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 54: driver-heartbeater
(TIMED_WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 3: Finalizer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
  

  
  
Thread 25:
ForkJoinPool-3-worker-15 (WAITING)

  
sun.misc.Unsafe.park(Native Method)
scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  

  
  
Thread 35: Hashed wheel timer #2
(TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483)
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 68: Idle Worker Monitor
for /usr/bin/python2.7 (TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:229)
  

  
  
Thread 1: main (WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:819)

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Darren Govoni


I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case. 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Chris Fregly  
Date: 12/28/2015  5:22 PM  (GMT-05:00) 
To: Richard Eggert  
Cc: Daniel Siegmann , Divya Gehlot 
, "user @spark"  
Subject: Re: DataFrame Vs RDDs ... Which one to use When ? 

here's a good article that sums it up, in my opinion: 
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive 
JVM bytecode.  haha.
@richard:  remember that even if you're currently writing RDDs in Java/Scala, 
you're not gaining the code gen/rewrite performance benefits of the Catalyst 
optimizer.
i agree with @daniel who suggested that you start with DataFrames and revert to 
RDDs only when DataFrames don't give you what you need.
the only time i use RDDs directly these days is when i'm dealing with a Spark 
library that has not yet moved to DataFrames - ie. GraphX - and it's kind of 
annoying switching back and forth.
almost everything you need should be in the DataFrame API.
Datasets are similar to RDDs, but give you strong compile-time typing, tabular 
structure, and Catalyst optimizations.
hopefully Datasets is the last API we see from Spark SQL...  i'm getting tired 
of re-writing slides and book chapters!  :)
On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert  
wrote:
One advantage of RDD's over DataFrames is that RDD's allow you to use your own 
data types, whereas DataFrames are backed by RDD's of Record objects, which are 
pretty flexible but don't give you much in the way of compile-time type 
checking. If you have an RDD of case class elements or JSON, then Spark SQL can 
automatically figure out how to convert it into an RDD of Record objects (and 
therefore a DataFrame), but there's no way to automatically go the other way 
(from DataFrame/Record back to custom types).
In general, you can ultimately do more with RDDs than DataFrames, but 
DataFrames give you a lot of niceties (automatic query optimization, table 
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime 
overhead associated with writing RDD code in a non-JVM language (such as Python 
or R), since the query optimizer is effectively creating the required JVM code 
under the hood. There's little to no performance benefit if you're already 
writing Java or Scala code, however (and RDD-based code may actually perform 
better in some cases, if you're willing to carefully tune your code).
On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann  
wrote:
DataFrames are a higher level API for working with tabular data - RDDs are used 
underneath. You can use either and easily convert between them in your code as 
necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier to 
code against them. Though if you're used to thinking in terms of collections 
rather than tables, you may find RDDs more natural. Data frames can also be 
faster, since Spark will do some optimizations under the hood - if you are 
using PySpark, this will avoid the overhead. Data frames may also perform 
better if you're reading structured data, such as a Hive table or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary (when 
you need to perform an operation not supported by data frames / Spark SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API - 
datasets. The release of Spark 1.6 is currently being finalized and I would 
expect it in the next few days. You will probably want to use the new API once 
it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot  wrote:
Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?

Would really appreciate the clarification .

Thanks,
Divya 






-- 
Rich




-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com



Re: Scala VS Java VS Python

2015-12-16 Thread Darren Govoni


I use python too. I'm actually surprises it's not the primary language since it 
is by far more used in data science than java snd Scala combined.
If I had a second choice of script language for general apps I'd want groovy 
over scala.


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Daniel Lopes  
Date: 12/16/2015  4:16 PM  (GMT-05:00) 
To: Daniel Valdivia  
Cc: user  
Subject: Re: Scala VS Java VS Python 

For me Scala is better like Spark is written in Scala, and I like python cuz I 
always used python for data science. :)
On Wed, Dec 16, 2015 at 5:54 PM, Daniel Valdivia  
wrote:
Hello,



This is more of a "survey" question for the community, you can reply to me 
directly so we don't flood the mailing list.



I'm having a hard time learning Spark using Python since the API seems to be 
slightly incomplete, so I'm looking at my options to start doing all my apps in 
either Scala or Java, being a Java Developer, java 1.8 looks like the logical 
way, however I'd like to ask here what's the most common (Scala Or Java) since 
I'm observing mixed results in the social documentation, however Scala seems to 
be the predominant language for spark examples.



Thank for the advice

-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org






-- 
Daniel Lopes, B.EngData Scientist - BankFacilCREA/SP 5069410560Mob +55 (18) 
99764-2733Ph +55 (11) 3522-8009http://about.me/dannyeuu
Av. Nova Independência, 956, São Paulo, SPBairro Brooklin PaulistaCEP 
04570-001https://www.bankfacil.com.br




Re: Pyspark submitted app just hangs

2015-12-02 Thread Darren Govoni

The pyspark app stdout/err log shows this oddity.

Traceback (most recent call last):
  File "/root/spark/notebooks/ingest/XXX.py", line 86, in 
print pdfRDD.collect()[:5]
  File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 773, 
in collect
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
536, in __call__
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
364, in send_command
  File 
"/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
473, in send_command

  File "/usr/lib64/python2.7/socket.py", line 430, in readline
data = recv(1)
KeyboardInterrupt


On 12/02/2015 08:57 PM, Jim Lohse wrote:
Is this the stderr output from a woker? Are any files being written? 
Can you run in debug and see how far it's getting?


This to me doesn't give me a direction to look without the actual logs 
from $SPARK_HOME or the stderr from the worker UI.


Just imho maybe someone know what this means but it seems like it 
could be caused by a lot of things.


On 12/2/2015 6:48 PM, Darren Govoni wrote:

Hi all,
  Wondering if someone can provide some insight why this pyspark app 
is just hanging. Here is output.


...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 
0.0 (TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 22.0 in stage 
0.0 (TID 22, 10.97.144.52, PROCESS_LOCAL, 1801814 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 23.0 in stage 
0.0 (TID 23, 10.65.67.146, PROCESS_LOCAL, 1823921 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 24.0 in stage 
0.0 (TID 24, 10.144.176.22, PROCESS_LOCAL, 1820713 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 25.0 in stage 
0.0 (TID 25, 10.65.143.174, PROCESS_LOCAL, 1850492 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 26.0 in stage 
0.0 (TID 26, 10.97.144.52, PROCESS_LOCAL, 1845557 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 27.0 in stage 
0.0 (TID 27, 10.65.67.146, PROCESS_LOCAL, 1876187 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 28.0 in stage 
0.0 (TID 28, 10.144.176.22, PROCESS_LOCAL, 2054748 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 29.0 in stage 
0.0 (TID 29, 10.65.143.174, PROCESS_LOCAL, 1967659 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 30.0 in stage 
0.0 (TID 30, 10.97.144.52, PROCESS_LOCAL, 1977909 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 31.0 in stage 
0.0 (TID 31, 10.65.67.146, PROCESS_LOCAL, 2084044 bytes)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.143.174:39356 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.144.176.22:40904 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.97.144.52:35646 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.67.146:44110 (size: 5.2 KB, free: 4.1 GB)


...

In the spark console, it says 0/32 tasks and just sits there. No 
movement.


Thanks in advance,
D

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Pyspark submitted app just hangs

2015-12-02 Thread Darren Govoni

Hi all,
  Wondering if someone can provide some insight why this pyspark app is 
just hanging. Here is output.


...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 0.0 
(TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 22.0 in stage 0.0 
(TID 22, 10.97.144.52, PROCESS_LOCAL, 1801814 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 23.0 in stage 0.0 
(TID 23, 10.65.67.146, PROCESS_LOCAL, 1823921 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 24.0 in stage 0.0 
(TID 24, 10.144.176.22, PROCESS_LOCAL, 1820713 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 25.0 in stage 0.0 
(TID 25, 10.65.143.174, PROCESS_LOCAL, 1850492 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 26.0 in stage 0.0 
(TID 26, 10.97.144.52, PROCESS_LOCAL, 1845557 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 27.0 in stage 0.0 
(TID 27, 10.65.67.146, PROCESS_LOCAL, 1876187 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 28.0 in stage 0.0 
(TID 28, 10.144.176.22, PROCESS_LOCAL, 2054748 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 29.0 in stage 0.0 
(TID 29, 10.65.143.174, PROCESS_LOCAL, 1967659 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 30.0 in stage 0.0 
(TID 30, 10.97.144.52, PROCESS_LOCAL, 1977909 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 31.0 in stage 0.0 
(TID 31, 10.65.67.146, PROCESS_LOCAL, 2084044 bytes)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.143.174:39356 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.144.176.22:40904 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.97.144.52:35646 (size: 5.2 KB, free: 4.1 GB)
15/12/03 01:47:06 INFO BlockManagerInfo: Added broadcast_0_piece0 in 
memory on 10.65.67.146:44110 (size: 5.2 KB, free: 4.1 GB)


...

In the spark console, it says 0/32 tasks and just sits there. No movement.

Thanks in advance,
D

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Python Kafka support?

2015-11-10 Thread Darren Govoni

Hi,
 I read on this page 
http://spark.apache.org/docs/latest/streaming-kafka-integration.html 
about python support for "receiverless" kafka integration (Approach 2) 
but it says its incomplete as of version 1.4.


Has this been updated in version 1.5.1?

Darren

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org