Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Tim,

We will try to run the application in coarse grain mode, and share the
findings with you.

Regards
Sumit Chawla


On Mon, Dec 19, 2016 at 3:11 PM, Timothy Chen  wrote:

> Dynamic allocation works with Coarse grain mode only, we wasn't aware
> a need for Fine grain mode after we enabled dynamic allocation support
> on the coarse grain mode.
>
> What's the reason you're running fine grain mode instead of coarse
> grain + dynamic allocation?
>
> Tim
>
> On Mon, Dec 19, 2016 at 2:45 PM, Mehdi Meziane
>  wrote:
> > We will be interested by the results if you give a try to Dynamic
> allocation
> > with mesos !
> >
> >
> > - Mail Original -
> > De: "Michael Gummelt" 
> > À: "Sumit Chawla" 
> > Cc: u...@mesos.apache.org, d...@mesos.apache.org, "User"
> > , d...@spark.apache.org
> > Envoyé: Lundi 19 Décembre 2016 22h42:55 GMT +01:00 Amsterdam / Berlin /
> > Berne / Rome / Stockholm / Vienne
> > Objet: Re: Mesos Spark Fine Grained Execution - CPU count
> >
> >
> >> Is this problem of idle executors sticking around solved in Dynamic
> >> Resource Allocation?  Is there some timeout after which Idle executors
> can
> >> just shutdown and cleanup its resources.
> >
> > Yes, that's exactly what dynamic allocation does.  But again I have no
> idea
> > what the state of dynamic allocation + mesos is.
> >
> > On Mon, Dec 19, 2016 at 1:32 PM, Chawla,Sumit 
> > wrote:
> >>
> >> Great.  Makes much better sense now.  What will be reason to have
> >> spark.mesos.mesosExecutor.cores more than 1, as this number doesn't
> include
> >> the number of cores for tasks.
> >>
> >> So in my case it seems like 30 CPUs are allocated to executors.  And
> there
> >> are 48 tasks so 48 + 30 =  78 CPUs.  And i am noticing this gap of 30 is
> >> maintained till the last task exits.  This explains the gap.   Thanks
> >> everyone.  I am still not sure how this number 30 is calculated.  ( Is
> it
> >> dynamic based on current resources, or is it some configuration.  I
> have 32
> >> nodes in my cluster).
> >>
> >> Is this problem of idle executors sticking around solved in Dynamic
> >> Resource Allocation?  Is there some timeout after which Idle executors
> can
> >> just shutdown and cleanup its resources.
> >>
> >>
> >> Regards
> >> Sumit Chawla
> >>
> >>
> >> On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt <
> mgumm...@mesosphere.io>
> >> wrote:
> >>>
> >>> >  I should preassume that No of executors should be less than number
> of
> >>> > tasks.
> >>>
> >>> No.  Each executor runs 0 or more tasks.
> >>>
> >>> Each executor consumes 1 CPU, and each task running on that executor
> >>> consumes another CPU.  You can customize this via
> >>> spark.mesos.mesosExecutor.cores
> >>> (https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md)
> and
> >>> spark.task.cpus
> >>> (https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md)
> >>>
> >>> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit  >
> >>> wrote:
> 
>  Ah thanks. looks like i skipped reading this "Neither will executors
>  terminate when they’re idle."
> 
>  So in my job scenario,  I should preassume that No of executors should
>  be less than number of tasks. Ideally one executor should execute 1
> or more
>  tasks.  But i am observing something strange instead.  I start my job
> with
>  48 partitions for a spark job. In mesos ui i see that number of tasks
> is 48,
>  but no. of CPUs is 78 which is way more than 48.  Here i am assuming
> that 1
>  CPU is 1 executor.   I am not specifying any configuration to set
> number of
>  cores per executor.
> 
>  Regards
>  Sumit Chawla
> 
> 
>  On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere
>   wrote:
> >
> > That makes sense. From the documentation it looks like the executors
> > are not supposed to terminate:
> >
> > http://spark.apache.org/docs/latest/running-on-mesos.html#
> fine-grained-deprecated
> >>
> >> Note that while Spark tasks in fine-grained will relinquish cores as
> >> they terminate, they will not relinquish memory, as the JVM does
> not give
> >> memory back to the Operating System. Neither will executors
> terminate when
> >> they’re idle.
> >
> >
> > I suppose your task to executor CPU ratio is low enough that it looks
> > like most of the resources are not being reclaimed. If your tasks
> were using
> > significantly more CPU the amortized cost of the idle executors
> would not be
> > such a big deal.
> >
> >
> > —
> > Joris Van Remoortere
> > Mesosphere
> >
> > On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen 
> > wrote:
> >>
> >> Hi Chawla,
> >>
> >> One possible reason is that Mesos fine grain mode also 

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Timothy Chen
Dynamic allocation works with Coarse grain mode only, we wasn't aware
a need for Fine grain mode after we enabled dynamic allocation support
on the coarse grain mode.

What's the reason you're running fine grain mode instead of coarse
grain + dynamic allocation?

Tim

On Mon, Dec 19, 2016 at 2:45 PM, Mehdi Meziane
 wrote:
> We will be interested by the results if you give a try to Dynamic allocation
> with mesos !
>
>
> - Mail Original -
> De: "Michael Gummelt" 
> À: "Sumit Chawla" 
> Cc: u...@mesos.apache.org, d...@mesos.apache.org, "User"
> , d...@spark.apache.org
> Envoyé: Lundi 19 Décembre 2016 22h42:55 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: Re: Mesos Spark Fine Grained Execution - CPU count
>
>
>> Is this problem of idle executors sticking around solved in Dynamic
>> Resource Allocation?  Is there some timeout after which Idle executors can
>> just shutdown and cleanup its resources.
>
> Yes, that's exactly what dynamic allocation does.  But again I have no idea
> what the state of dynamic allocation + mesos is.
>
> On Mon, Dec 19, 2016 at 1:32 PM, Chawla,Sumit 
> wrote:
>>
>> Great.  Makes much better sense now.  What will be reason to have
>> spark.mesos.mesosExecutor.cores more than 1, as this number doesn't include
>> the number of cores for tasks.
>>
>> So in my case it seems like 30 CPUs are allocated to executors.  And there
>> are 48 tasks so 48 + 30 =  78 CPUs.  And i am noticing this gap of 30 is
>> maintained till the last task exits.  This explains the gap.   Thanks
>> everyone.  I am still not sure how this number 30 is calculated.  ( Is it
>> dynamic based on current resources, or is it some configuration.  I have 32
>> nodes in my cluster).
>>
>> Is this problem of idle executors sticking around solved in Dynamic
>> Resource Allocation?  Is there some timeout after which Idle executors can
>> just shutdown and cleanup its resources.
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt 
>> wrote:
>>>
>>> >  I should preassume that No of executors should be less than number of
>>> > tasks.
>>>
>>> No.  Each executor runs 0 or more tasks.
>>>
>>> Each executor consumes 1 CPU, and each task running on that executor
>>> consumes another CPU.  You can customize this via
>>> spark.mesos.mesosExecutor.cores
>>> (https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md) and
>>> spark.task.cpus
>>> (https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md)
>>>
>>> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit 
>>> wrote:

 Ah thanks. looks like i skipped reading this "Neither will executors
 terminate when they’re idle."

 So in my job scenario,  I should preassume that No of executors should
 be less than number of tasks. Ideally one executor should execute 1 or more
 tasks.  But i am observing something strange instead.  I start my job with
 48 partitions for a spark job. In mesos ui i see that number of tasks is 
 48,
 but no. of CPUs is 78 which is way more than 48.  Here i am assuming that 1
 CPU is 1 executor.   I am not specifying any configuration to set number of
 cores per executor.

 Regards
 Sumit Chawla


 On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere
  wrote:
>
> That makes sense. From the documentation it looks like the executors
> are not supposed to terminate:
>
> http://spark.apache.org/docs/latest/running-on-mesos.html#fine-grained-deprecated
>>
>> Note that while Spark tasks in fine-grained will relinquish cores as
>> they terminate, they will not relinquish memory, as the JVM does not give
>> memory back to the Operating System. Neither will executors terminate 
>> when
>> they’re idle.
>
>
> I suppose your task to executor CPU ratio is low enough that it looks
> like most of the resources are not being reclaimed. If your tasks were 
> using
> significantly more CPU the amortized cost of the idle executors would not 
> be
> such a big deal.
>
>
> —
> Joris Van Remoortere
> Mesosphere
>
> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen 
> wrote:
>>
>> Hi Chawla,
>>
>> One possible reason is that Mesos fine grain mode also takes up cores
>> to run the executor per host, so if you have 20 agents running Fine
>> grained executor it will take up 20 cores while it's still running.
>>
>> Tim
>>
>> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
>> wrote:
>> > Hi
>> >
>> > I am using Spark 1.6. I have one query about Fine Grained model in
>> > Spark.
>> > I have a simple Spark application which 

Loading a class from a dependency jar

2016-12-19 Thread viraj
Hi,

I am currently using the kite library(https://github.com/kite-sdk/kite) to
persist to HBase from my Spark Job. All this happens in the driver.

I am on version 1.6.1 on spark.

The problem I am facing is that a particular class in one of the dependency
jars is not found by kite when it uses Class.forName and I get a
ClassNotFoundException.
For example:

I have my spark job and 2 relevant dependencies:
1. kite jar
2. dependency1 jar

Kite is trying to read a class which is present in dependency1 jar using
https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-hbase/src/main/java/org/kitesdk/data/hbase/HBaseDatasetRepository.java#L163

I have tried:
1. Adding all the jars using the --jars option in spark-submit.
2. Adding all the jars using --driver-class-path option in spark-submit.

I am also registering the class using the KryoRegistrator present in the
twitter-chill library without any issues.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-class-from-a-dependency-jar-tp28238.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Mehdi Meziane
We will be interested by the results if you give a try to Dynamic allocation 
with mesos ! 



- Mail Original - 
De: "Michael Gummelt"  
À: "Sumit Chawla"  
Cc: u...@mesos.apache.org, d...@mesos.apache.org, "User" 
, d...@spark.apache.org 
Envoyé: Lundi 19 Décembre 2016 22h42:55 GMT +01:00 Amsterdam / Berlin / Berne / 
Rome / Stockholm / Vienne 
Objet: Re: Mesos Spark Fine Grained Execution - CPU count 



> Is this problem of idle executors sticking around solved in Dynamic Resource 
> Allocation? Is there some timeout after which Idle executors can just 
> shutdown and cleanup its resources. 

Yes, that's exactly what dynamic allocation does. But again I have no idea what 
the state of dynamic allocation + mesos is. 



On Mon, Dec 19, 2016 at 1:32 PM, Chawla,Sumit < sumitkcha...@gmail.com > wrote: 



Great. Makes much better sense now. What will be reason to have 
spark.mesos.mesosExecutor. cores more than 1, as this number doesn't include 
the number of cores for tasks. 


So in my case it seems like 30 CPUs are allocated to executors. And there are 
48 tasks so 48 + 30 = 78 CPUs. And i am noticing this gap of 30 is maintained 
till the last task exits. This explains the gap. Thanks everyone. I am still 
not sure how this number 30 is calculated. ( Is it dynamic based on current 
resources, or is it some configuration. I have 32 nodes in my cluster). 


Is this problem of idle executors sticking around solved in Dynamic Resource 
Allocation? Is there some timeout after which Idle executors can just shutdown 
and cleanup its resources. 





Regards 
Sumit Chawla 





On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt < mgumm...@mesosphere.io > 
wrote: 





> I should preassume that No of executors should be less than number of tasks. 

No. Each executor runs 0 or more tasks. 

Each executor consumes 1 CPU, and each task running on that executor consumes 
another CPU. You can customize this via spark.mesos.mesosExecutor.cores ( 
https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md ) and 
spark.task.cpus ( 
https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md ) 





On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit < sumitkcha...@gmail.com > 
wrote: 



Ah thanks. looks like i skipped reading this " Neither will executors terminate 
when they’re idle." 


So in my job scenario, I should preassume that No of executors should be less 
than number of tasks. Ideally one executor should execute 1 or more tasks. But 
i am observing something strange instead. I start my job with 48 partitions for 
a spark job. In mesos ui i see that number of tasks is 48, but no. of CPUs is 
78 which is way more than 48. Here i am assuming that 1 CPU is 1 executor. I am 
not specifying any configuration to set number of cores per executor. 



Regards 
Sumit Chawla 





On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere < jo...@mesosphere.io > 
wrote: 



That makes sense. From the documentation it looks like the executors are not 
supposed to terminate: 
http://spark.apache.org/docs/latest/running-on-mesos.html#fine-grained-deprecated
 


Note that while Spark tasks in fine-grained will relinquish cores as they 
terminate, they will not relinquish memory, as the JVM does not give memory 
back to the Operating System. Neither will executors terminate when they’re 
idle. 


I suppose your task to executor CPU ratio is low enough that it looks like most 
of the resources are not being reclaimed. If your tasks were using 
significantly more CPU the amortized cost of the idle executors would not be 
such a big deal. 






— 
Joris Van Remoortere 
Mesosphere 

On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen < tnac...@gmail.com > wrote: 


Hi Chawla, 

One possible reason is that Mesos fine grain mode also takes up cores 
to run the executor per host, so if you have 20 agents running Fine 
grained executor it will take up 20 cores while it's still running. 

Tim 

On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit < sumitkcha...@gmail.com > wrote: 


> Hi 
> 
> I am using Spark 1.6. I have one query about Fine Grained model in Spark. 
> I have a simple Spark application which transforms A -> B. Its a single 
> stage application. To begin the program, It starts with 48 partitions. 
> When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs 
> allocated to job. Now as the tasks get done, the number of active tasks 
> number starts decreasing. How ever, the number of CPUs does not decrease 
> propotionally. When the job was about to finish, there was a single 
> remaininig task, however CPU count was still 20. 
> 
> My questions, is why there is no one to one mapping between tasks and cpus 
> in Fine grained? How can these CPUs be released when the job is done, so 
> that other jobs can start. 
> 
> 
> Regards 
> Sumit Chawla 





-- 







Michael Gummelt 
Software Engineer 
Mesosphere 




-- 








Re: PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Found it. Some how my host mapping was messing it up. Changing it to point to 
localhost worked.:

/etc/host

#127.0.0.1 XX.com
127.0.0.1 localhost

From: "Jain, Nishit" >
Date: Monday, December 19, 2016 at 2:54 PM
To: "user@spark.apache.org" 
>
Subject: PySpark: [Errno 8] nodename nor servname provided, or not known

Hi,

I am using pre-built 'spark-2.0.1-bin-hadoop2.7’ and when I try to start 
pyspark, I get following message.
Any ideas what could be wrong? I tried using python3, setting SPARK_LOCAL_IP to 
127.0.0.1 but same error.


~ -> cd /Applications/spark-2.0.1-bin-hadoop2.7/bin/
/Applications/spark-2.0.1-bin-hadoop2.7/bin -> pyspark
Python 2.7.12 (default, Oct 11 2016, 05:24:00)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/12/19 14:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/19 14:50:47 WARN Utils: Your hostname, XX.com resolves to a loopback 
address: 127.0.0.1; using XX.XX.XX.XXX instead (on interface en0)
16/12/19 14:50:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Traceback (most recent call last):
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 115, in __init__
conf, jsc, profiler_cls)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 174, in _do_init
self._accumulatorServer = accumulators._start_update_server()
  File 
"/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 
259, in _start_update_server
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 417, in __init__
self.server_bind()
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 431, in server_bind
self.socket.bind(self.server_address)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

Thanks,
Nishit


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
> Is this problem of idle executors sticking around solved in Dynamic
Resource Allocation?  Is there some timeout after which Idle executors can
just shutdown and cleanup its resources.

Yes, that's exactly what dynamic allocation does.  But again I have no idea
what the state of dynamic allocation + mesos is.

On Mon, Dec 19, 2016 at 1:32 PM, Chawla,Sumit 
wrote:

> Great.  Makes much better sense now.  What will be reason to have
> spark.mesos.mesosExecutor.cores more than 1, as this number doesn't
> include the number of cores for tasks.
>
> So in my case it seems like 30 CPUs are allocated to executors.  And there
> are 48 tasks so 48 + 30 =  78 CPUs.  And i am noticing this gap of 30 is
> maintained till the last task exits.  This explains the gap.   Thanks
> everyone.  I am still not sure how this number 30 is calculated.  ( Is it
> dynamic based on current resources, or is it some configuration.  I have 32
> nodes in my cluster).
>
> Is this problem of idle executors sticking around solved in Dynamic
> Resource Allocation?  Is there some timeout after which Idle executors can
> just shutdown and cleanup its resources.
>
>
> Regards
> Sumit Chawla
>
>
> On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt 
> wrote:
>
>> >  I should preassume that No of executors should be less than number of
>> tasks.
>>
>> No.  Each executor runs 0 or more tasks.
>>
>> Each executor consumes 1 CPU, and each task running on that executor
>> consumes another CPU.  You can customize this via
>> spark.mesos.mesosExecutor.cores (https://github.com/apache/spa
>> rk/blob/v1.6.3/docs/running-on-mesos.md) and spark.task.cpus (
>> https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md)
>>
>> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit 
>> wrote:
>>
>>> Ah thanks. looks like i skipped reading this *"Neither will executors
>>> terminate when they’re idle."*
>>>
>>> So in my job scenario,  I should preassume that No of executors should
>>> be less than number of tasks. Ideally one executor should execute 1 or more
>>> tasks.  But i am observing something strange instead.  I start my job with
>>> 48 partitions for a spark job. In mesos ui i see that number of tasks is
>>> 48, but no. of CPUs is 78 which is way more than 48.  Here i am assuming
>>> that 1 CPU is 1 executor.   I am not specifying any configuration to set
>>> number of cores per executor.
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>> On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere <
>>> jo...@mesosphere.io> wrote:
>>>
 That makes sense. From the documentation it looks like the executors
 are not supposed to terminate:
 http://spark.apache.org/docs/latest/running-on-mesos.html#fi
 ne-grained-deprecated

> Note that while Spark tasks in fine-grained will relinquish cores as
> they terminate, they will not relinquish memory, as the JVM does not give
> memory back to the Operating System. Neither will executors terminate when
> they’re idle.


 I suppose your task to executor CPU ratio is low enough that it looks
 like most of the resources are not being reclaimed. If your tasks were
 using significantly more CPU the amortized cost of the idle executors would
 not be such a big deal.


 —
 *Joris Van Remoortere*
 Mesosphere

 On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen 
 wrote:

> Hi Chawla,
>
> One possible reason is that Mesos fine grain mode also takes up cores
> to run the executor per host, so if you have 20 agents running Fine
> grained executor it will take up 20 cores while it's still running.
>
> Tim
>
> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
> wrote:
> > Hi
> >
> > I am using Spark 1.6. I have one query about Fine Grained model in
> Spark.
> > I have a simple Spark application which transforms A -> B.  Its a
> single
> > stage application.  To begin the program, It starts with 48
> partitions.
> > When the program starts running, in mesos UI it shows 48 tasks and
> 48 CPUs
> > allocated to job.  Now as the tasks get done, the number of active
> tasks
> > number starts decreasing.  How ever, the number of CPUs does not
> decrease
> > propotionally.  When the job was about to finish, there was a single
> > remaininig task, however CPU count was still 20.
> >
> > My questions, is why there is no one to one mapping between tasks
> and cpus
> > in Fine grained?  How can these CPUs be released when the job is
> done, so
> > that other jobs can start.
> >
> >
> > Regards
> > Sumit Chawla
>


>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Great.  Makes much better sense now.  What will be reason to have
spark.mesos.mesosExecutor.cores more than 1, as this number doesn't include
the number of cores for tasks.

So in my case it seems like 30 CPUs are allocated to executors.  And there
are 48 tasks so 48 + 30 =  78 CPUs.  And i am noticing this gap of 30 is
maintained till the last task exits.  This explains the gap.   Thanks
everyone.  I am still not sure how this number 30 is calculated.  ( Is it
dynamic based on current resources, or is it some configuration.  I have 32
nodes in my cluster).

Is this problem of idle executors sticking around solved in Dynamic
Resource Allocation?  Is there some timeout after which Idle executors can
just shutdown and cleanup its resources.


Regards
Sumit Chawla


On Mon, Dec 19, 2016 at 12:45 PM, Michael Gummelt 
wrote:

> >  I should preassume that No of executors should be less than number of
> tasks.
>
> No.  Each executor runs 0 or more tasks.
>
> Each executor consumes 1 CPU, and each task running on that executor
> consumes another CPU.  You can customize this via 
> spark.mesos.mesosExecutor.cores
> (https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md)
> and spark.task.cpus (https://github.com/apache/spark/blob/v1.6.3/docs/
> configuration.md)
>
> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit 
> wrote:
>
>> Ah thanks. looks like i skipped reading this *"Neither will executors
>> terminate when they’re idle."*
>>
>> So in my job scenario,  I should preassume that No of executors should be
>> less than number of tasks. Ideally one executor should execute 1 or more
>> tasks.  But i am observing something strange instead.  I start my job with
>> 48 partitions for a spark job. In mesos ui i see that number of tasks is
>> 48, but no. of CPUs is 78 which is way more than 48.  Here i am assuming
>> that 1 CPU is 1 executor.   I am not specifying any configuration to set
>> number of cores per executor.
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere <
>> jo...@mesosphere.io> wrote:
>>
>>> That makes sense. From the documentation it looks like the executors are
>>> not supposed to terminate:
>>> http://spark.apache.org/docs/latest/running-on-mesos.html#fi
>>> ne-grained-deprecated
>>>
 Note that while Spark tasks in fine-grained will relinquish cores as
 they terminate, they will not relinquish memory, as the JVM does not give
 memory back to the Operating System. Neither will executors terminate when
 they’re idle.
>>>
>>>
>>> I suppose your task to executor CPU ratio is low enough that it looks
>>> like most of the resources are not being reclaimed. If your tasks were
>>> using significantly more CPU the amortized cost of the idle executors would
>>> not be such a big deal.
>>>
>>>
>>> —
>>> *Joris Van Remoortere*
>>> Mesosphere
>>>
>>> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen 
>>> wrote:
>>>
 Hi Chawla,

 One possible reason is that Mesos fine grain mode also takes up cores
 to run the executor per host, so if you have 20 agents running Fine
 grained executor it will take up 20 cores while it's still running.

 Tim

 On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
 wrote:
 > Hi
 >
 > I am using Spark 1.6. I have one query about Fine Grained model in
 Spark.
 > I have a simple Spark application which transforms A -> B.  Its a
 single
 > stage application.  To begin the program, It starts with 48
 partitions.
 > When the program starts running, in mesos UI it shows 48 tasks and 48
 CPUs
 > allocated to job.  Now as the tasks get done, the number of active
 tasks
 > number starts decreasing.  How ever, the number of CPUs does not
 decrease
 > propotionally.  When the job was about to finish, there was a single
 > remaininig task, however CPU count was still 20.
 >
 > My questions, is why there is no one to one mapping between tasks and
 cpus
 > in Fine grained?  How can these CPUs be released when the job is
 done, so
 > that other jobs can start.
 >
 >
 > Regards
 > Sumit Chawla

>>>
>>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Hi,

I am using pre-built 'spark-2.0.1-bin-hadoop2.7’ and when I try to start 
pyspark, I get following message.
Any ideas what could be wrong? I tried using python3, setting SPARK_LOCAL_IP to 
127.0.0.1 but same error.


~ -> cd /Applications/spark-2.0.1-bin-hadoop2.7/bin/
/Applications/spark-2.0.1-bin-hadoop2.7/bin -> pyspark
Python 2.7.12 (default, Oct 11 2016, 05:24:00)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/12/19 14:50:47 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/12/19 14:50:47 WARN Utils: Your hostname, XX.com resolves to a loopback 
address: 127.0.0.1; using XX.XX.XX.XXX instead (on interface en0)
16/12/19 14:50:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Traceback (most recent call last):
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/shell.py", line 
43, in 
spark = SparkSession.builder\
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/session.py", 
line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 115, in __init__
conf, jsc, profiler_cls)
  File "/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", 
line 174, in _do_init
self._accumulatorServer = accumulators._start_update_server()
  File 
"/Applications/spark-2.0.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 
259, in _start_update_server
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 417, in __init__
self.server_bind()
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/SocketServer.py",
 line 431, in server_bind
self.socket.bind(self.server_address)
  File 
"/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py",
 line 228, in meth
return getattr(self._sock,name)(*args)
socket.gaierror: [Errno 8] nodename nor servname provided, or not known

Thanks,
Nishit


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
>  I should preassume that No of executors should be less than number of
tasks.

No.  Each executor runs 0 or more tasks.

Each executor consumes 1 CPU, and each task running on that executor
consumes another CPU.  You can customize this via
spark.mesos.mesosExecutor.cores (
https://github.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md) and
spark.task.cpus (
https://github.com/apache/spark/blob/v1.6.3/docs/configuration.md)

On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit 
wrote:

> Ah thanks. looks like i skipped reading this *"Neither will executors
> terminate when they’re idle."*
>
> So in my job scenario,  I should preassume that No of executors should be
> less than number of tasks. Ideally one executor should execute 1 or more
> tasks.  But i am observing something strange instead.  I start my job with
> 48 partitions for a spark job. In mesos ui i see that number of tasks is
> 48, but no. of CPUs is 78 which is way more than 48.  Here i am assuming
> that 1 CPU is 1 executor.   I am not specifying any configuration to set
> number of cores per executor.
>
> Regards
> Sumit Chawla
>
>
> On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere <
> jo...@mesosphere.io> wrote:
>
>> That makes sense. From the documentation it looks like the executors are
>> not supposed to terminate:
>> http://spark.apache.org/docs/latest/running-on-mesos.html#fi
>> ne-grained-deprecated
>>
>>> Note that while Spark tasks in fine-grained will relinquish cores as
>>> they terminate, they will not relinquish memory, as the JVM does not give
>>> memory back to the Operating System. Neither will executors terminate when
>>> they’re idle.
>>
>>
>> I suppose your task to executor CPU ratio is low enough that it looks
>> like most of the resources are not being reclaimed. If your tasks were
>> using significantly more CPU the amortized cost of the idle executors would
>> not be such a big deal.
>>
>>
>> —
>> *Joris Van Remoortere*
>> Mesosphere
>>
>> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen  wrote:
>>
>>> Hi Chawla,
>>>
>>> One possible reason is that Mesos fine grain mode also takes up cores
>>> to run the executor per host, so if you have 20 agents running Fine
>>> grained executor it will take up 20 cores while it's still running.
>>>
>>> Tim
>>>
>>> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
>>> wrote:
>>> > Hi
>>> >
>>> > I am using Spark 1.6. I have one query about Fine Grained model in
>>> Spark.
>>> > I have a simple Spark application which transforms A -> B.  Its a
>>> single
>>> > stage application.  To begin the program, It starts with 48 partitions.
>>> > When the program starts running, in mesos UI it shows 48 tasks and 48
>>> CPUs
>>> > allocated to job.  Now as the tasks get done, the number of active
>>> tasks
>>> > number starts decreasing.  How ever, the number of CPUs does not
>>> decrease
>>> > propotionally.  When the job was about to finish, there was a single
>>> > remaininig task, however CPU count was still 20.
>>> >
>>> > My questions, is why there is no one to one mapping between tasks and
>>> cpus
>>> > in Fine grained?  How can these CPUs be released when the job is done,
>>> so
>>> > that other jobs can start.
>>> >
>>> >
>>> > Regards
>>> > Sumit Chawla
>>>
>>
>>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Ah thanks. looks like i skipped reading this *"Neither will executors
terminate when they’re idle."*

So in my job scenario,  I should preassume that No of executors should be
less than number of tasks. Ideally one executor should execute 1 or more
tasks.  But i am observing something strange instead.  I start my job with
48 partitions for a spark job. In mesos ui i see that number of tasks is
48, but no. of CPUs is 78 which is way more than 48.  Here i am assuming
that 1 CPU is 1 executor.   I am not specifying any configuration to set
number of cores per executor.

Regards
Sumit Chawla


On Mon, Dec 19, 2016 at 11:35 AM, Joris Van Remoortere 
wrote:

> That makes sense. From the documentation it looks like the executors are
> not supposed to terminate:
> http://spark.apache.org/docs/latest/running-on-mesos.html#
> fine-grained-deprecated
>
>> Note that while Spark tasks in fine-grained will relinquish cores as they
>> terminate, they will not relinquish memory, as the JVM does not give memory
>> back to the Operating System. Neither will executors terminate when they’re
>> idle.
>
>
> I suppose your task to executor CPU ratio is low enough that it looks like
> most of the resources are not being reclaimed. If your tasks were using
> significantly more CPU the amortized cost of the idle executors would not
> be such a big deal.
>
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen  wrote:
>
>> Hi Chawla,
>>
>> One possible reason is that Mesos fine grain mode also takes up cores
>> to run the executor per host, so if you have 20 agents running Fine
>> grained executor it will take up 20 cores while it's still running.
>>
>> Tim
>>
>> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
>> wrote:
>> > Hi
>> >
>> > I am using Spark 1.6. I have one query about Fine Grained model in
>> Spark.
>> > I have a simple Spark application which transforms A -> B.  Its a single
>> > stage application.  To begin the program, It starts with 48 partitions.
>> > When the program starts running, in mesos UI it shows 48 tasks and 48
>> CPUs
>> > allocated to job.  Now as the tasks get done, the number of active tasks
>> > number starts decreasing.  How ever, the number of CPUs does not
>> decrease
>> > propotionally.  When the job was about to finish, there was a single
>> > remaininig task, however CPU count was still 20.
>> >
>> > My questions, is why there is no one to one mapping between tasks and
>> cpus
>> > in Fine grained?  How can these CPUs be released when the job is done,
>> so
>> > that other jobs can start.
>> >
>> >
>> > Regards
>> > Sumit Chawla
>>
>
>


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Timothy Chen
Hi Chawla,

One possible reason is that Mesos fine grain mode also takes up cores
to run the executor per host, so if you have 20 agents running Fine
grained executor it will take up 20 cores while it's still running.

Tim

On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit  wrote:
> Hi
>
> I am using Spark 1.6. I have one query about Fine Grained model in Spark.
> I have a simple Spark application which transforms A -> B.  Its a single
> stage application.  To begin the program, It starts with 48 partitions.
> When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs
> allocated to job.  Now as the tasks get done, the number of active tasks
> number starts decreasing.  How ever, the number of CPUs does not decrease
> propotionally.  When the job was about to finish, there was a single
> remaininig task, however CPU count was still 20.
>
> My questions, is why there is no one to one mapping between tasks and cpus
> in Fine grained?  How can these CPUs be released when the job is done, so
> that other jobs can start.
>
>
> Regards
> Sumit Chawla

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
Yea, the idea is to use dynamic allocation.  I can't speak to how well it
works with Mesos, though.

On Mon, Dec 19, 2016 at 11:01 AM, Mehdi Meziane 
wrote:

> I think that what you are looking for is Dynamic resource allocation:
> http://spark.apache.org/docs/latest/job-scheduling.html#
> dynamic-resource-allocation
>
> Spark provides a mechanism to dynamically adjust the resources your
> application occupies based on the workload. This means that your
> application may give resources back to the cluster if they are no longer
> used and request them again later when there is demand. This feature is
> particularly useful if multiple applications share resources in your Spark
> cluster.
>
> - Mail Original -
> De: "Sumit Chawla" 
> À: "Michael Gummelt" 
> Cc: u...@mesos.apache.org, "Dev" , "User" <
> user@spark.apache.org>, "dev" 
> Envoyé: Lundi 19 Décembre 2016 19h35:51 GMT +01:00 Amsterdam / Berlin /
> Berne / Rome / Stockholm / Vienne
> Objet: Re: Mesos Spark Fine Grained Execution - CPU count
>
>
> But coarse grained does the exact same thing which i am trying to avert
> here.  At the cost of lower startup, it keeps the resources reserved till
> the entire duration of the job.
>
> Regards
> Sumit Chawla
>
>
> On Mon, Dec 19, 2016 at 10:06 AM, Michael Gummelt 
> wrote:
>
>> Hi
>>
>> I don't have a lot of experience with the fine-grained scheduler.  It's
>> deprecated and fairly old now.  CPUs should be relinquished as tasks
>> complete, so I'm not sure why you're seeing what you're seeing.  There have
>> been a few discussions on the spark list regarding deprecating the
>> fine-grained scheduler, and no one seemed too dead-set on keeping it.  I'd
>> recommend you move over to coarse-grained.
>>
>> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
>> wrote:
>>
>>> Hi
>>>
>>> I am using Spark 1.6. I have one query about Fine Grained model in
>>> Spark.  I have a simple Spark application which transforms A -> B.  Its a
>>> single stage application.  To begin the program, It starts with 48
>>> partitions.  When the program starts running, in mesos UI it shows 48 tasks
>>> and 48 CPUs allocated to job.  Now as the tasks get done, the number of
>>> active tasks number starts decreasing.  How ever, the number of CPUs does
>>> not decrease propotionally.  When the job was about to finish, there was a
>>> single remaininig task, however CPU count was still 20.
>>>
>>> My questions, is why there is no one to one mapping between tasks and
>>> cpus in Fine grained?  How can these CPUs be released when the job is done,
>>> so that other jobs can start.
>>>
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Michael Stratton
I don't think the issue is an empty partition, but it may not hurt to try a
repartition prior to writing just to rule it out due to the premature EOF
exception.

On Mon, Dec 19, 2016 at 1:53 PM, Joseph Naegele  wrote:

> Thanks Michael, hdfs dfsadmin -report tells me:
>
>
>
> Configured Capacity: 7999424823296 (7.28 TB)
>
> Present Capacity: 7997657774971 (7.27 TB)
>
> DFS Remaining: 7959091768187 (7.24 TB)
>
> DFS Used: 38566006784 (35.92 GB)
>
> DFS Used%: 0.48%
>
> Under replicated blocks: 0
>
> Blocks with corrupt replicas: 0
>
> Missing blocks: 0
>
> Missing blocks (with replication factor 1): 0
>
>
>
> -
>
> Live datanodes (1):
>
>
>
> Name: 127.0.0.1:50010 (localhost)
>
> Hostname: XXX.XXX.XXX
>
> Decommission Status : Normal
>
> Configured Capacity: 7999424823296 (7.28 TB)
>
> DFS Used: 38566006784 (35.92 GB)
>
> Non DFS Used: 1767048325 (1.65 GB)
>
> DFS Remaining: 7959091768187 (7.24 TB)
>
> DFS Used%: 0.48%
>
> DFS Remaining%: 99.50%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 17
>
> Last contact: Mon Dec 19 13:00:06 EST 2016
>
>
>
> The Hadoop exception occurs because it times out after 60 seconds in a
> “select” call on a java.nio.channels.SocketChannel, while waiting to read
> from the socket. This implies the client writer isn’t writing on the socket
> as expected, but shouldn’t this all be handled by the Hadoop library within
> Spark?
>
>
>
> It looks like a few similar, but rare, cases have been reported before,
> e.g. https://issues.apache.org/jira/browse/HDFS-770 which is *very* old.
>
>
>
> If you’re pretty sure Spark couldn’t be responsible for issues at this
> level I’ll stick to the Hadoop mailing list.
>
>
>
> Thanks
>
> ---
>
> Joe Naegele
>
> Grier Forensics
>
>
>
> *From:* Michael Stratton [mailto:michael.strat...@komodohealth.com]
> *Sent:* Monday, December 19, 2016 10:00 AM
> *To:* Joseph Naegele 
> *Cc:* user 
> *Subject:* Re: [Spark SQL] Task failed while writing rows
>
>
>
> It seems like an issue w/ Hadoop. What do you get when you run hdfs
> dfsadmin -report?
>
>
>
> Anecdotally(And w/o specifics as it has been a while), I've generally used
> Parquet instead of ORC as I've gotten a bunch of random problems reading
> and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it
> can be a pain.
>
>
>
> On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <
> jnaeg...@grierforensics.com> wrote:
>
> Hi all,
>
> I'm having trouble with a relatively simple Spark SQL job. I'm using Spark
> 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record).
> It's current compressed size is around 13 GB, but my problem started when
> it was much smaller, maybe 5 GB. This dataset is generated by performing a
> query on an existing ORC dataset in HDFS, selecting a subset of the
> existing data (i.e. removing duplicates). When I write this dataset to HDFS
> using ORC I get the following exceptions in the driver:
>
> org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.lang.RuntimeException: Failed to commit task
> Suppressed: java.lang.IllegalArgumentException: Column has wrong number
> of index entries found: 0 expected: 32
>
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
>
> This happens multiple times. The executors tell me the following a few
> times before the same exceptions as above:
>
>
>
> 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output
> committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
>
> 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-
> 1479228275669:blk_1073862425_121642
>
> java.io.EOFException: Premature EOF: no length prefix available
>
> at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
>
> at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
>
> My HDFS datanode says:
>
> 2016-12-09 02:39:24,783 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op:
> HDFS_WRITE, cliID: 
> DFSClient_attempt_201612090102__m_25_0_956624542_193,
> offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid:
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> duration: 93026972
>
> 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder: 
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>
> 2016-12-09 02:39:49,262 

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Mehdi Meziane
I think that what you are looking for is Dynamic resource allocation: 
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
 


Spark provides a mechanism to dynamically adjust the resources your application 
occupies based on the workload. This means that your application may give 
resources back to the cluster if they are no longer used and request them again 
later when there is demand. This feature is particularly useful if multiple 
applications share resources in your Spark cluster. 

- Mail Original - 
De: "Sumit Chawla"  
À: "Michael Gummelt"  
Cc: u...@mesos.apache.org, "Dev" , "User" 
, "dev"  
Envoyé: Lundi 19 Décembre 2016 19h35:51 GMT +01:00 Amsterdam / Berlin / Berne / 
Rome / Stockholm / Vienne 
Objet: Re: Mesos Spark Fine Grained Execution - CPU count 


But coarse grained does the exact same thing which i am trying to avert here. 
At the cost of lower startup, it keeps the resources reserved till the entire 
duration of the job. 



Regards 
Sumit Chawla 



On Mon, Dec 19, 2016 at 10:06 AM, Michael Gummelt < mgumm...@mesosphere.io > 
wrote: 




Hi 

I don't have a lot of experience with the fine-grained scheduler. It's 
deprecated and fairly old now. CPUs should be relinquished as tasks complete, 
so I'm not sure why you're seeing what you're seeing. There have been a few 
discussions on the spark list regarding deprecating the fine-grained scheduler, 
and no one seemed too dead-set on keeping it. I'd recommend you move over to 
coarse-grained. 





On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit < sumitkcha...@gmail.com > wrote: 



Hi 


I am using Spark 1.6. I have one query about Fine Grained model in Spark. I 
have a simple Spark application which transforms A -> B. Its a single stage 
application. To begin the program, It starts with 48 partitions. When the 
program starts running, in mesos UI it shows 48 tasks and 48 CPUs allocated to 
job. Now as the tasks get done, the number of active tasks number starts 
decreasing. How ever, the number of CPUs does not decrease propotionally. When 
the job was about to finish, there was a single remaininig task, however CPU 
count was still 20. 


My questions, is why there is no one to one mapping between tasks and cpus in 
Fine grained? How can these CPUs be released when the job is done, so that 
other jobs can start. 






Regards 
Sumit Chawla 




-- 







Michael Gummelt 
Software Engineer 
Mesosphere 



Re: Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread Sergey B.
I have a asked a similar question here

http://stackoverflow.com/questions/40701518/spark-2-0-redefining-sparksession-params-through-getorcreate-and-not-seeing-cha

Please see the answer, basically stating that it's impossible to change
Session config as soon as it was initiated

On Mon, Dec 19, 2016 at 9:01 PM, Venkata Naidu  wrote:

> We can create a link in the spark conf directory to point hive.conf file
> of hive installation I believe.
>
> Thanks,
> Venkat.
>
> On Mon, Dec 19, 2016, 10:58 AM apu  wrote:
>
>> This is for Spark 2.0:
>>
>> If I wanted Hive support on a new SparkSession, I would build it with:
>>
>> spark = SparkSession \
>> .builder \
>> .enableHiveSupport() \
>> .getOrCreate()
>>
>> However, PySpark already creates a SparkSession for me, which appears to
>> lack HiveSupport. How can I either:
>>
>> (a) Add Hive support to an existing SparkSession,
>>
>> or
>>
>> (b) Configure PySpark so that the SparkSession it creates at startup has
>> Hive support enabled?
>>
>> Thanks!
>>
>> Apu
>>
>


RE: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Joseph Naegele
Thanks Michael, hdfs dfsadmin -report tells me:

 

Configured Capacity: 7999424823296 (7.28 TB)

Present Capacity: 7997657774971 (7.27 TB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used: 38566006784 (35.92 GB)

DFS Used%: 0.48%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

 

-

Live datanodes (1):

 

Name: 127.0.0.1:50010 (localhost)

Hostname: XXX.XXX.XXX

Decommission Status : Normal

Configured Capacity: 7999424823296 (7.28 TB)

DFS Used: 38566006784 (35.92 GB)

Non DFS Used: 1767048325 (1.65 GB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used%: 0.48%

DFS Remaining%: 99.50%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 17

Last contact: Mon Dec 19 13:00:06 EST 2016

 

The Hadoop exception occurs because it times out after 60 seconds in a “select” 
call on a java.nio.channels.SocketChannel, while waiting to read from the 
socket. This implies the client writer isn’t writing on the socket as expected, 
but shouldn’t this all be handled by the Hadoop library within Spark?

 

It looks like a few similar, but rare, cases have been reported before, e.g. 
https://issues.apache.org/jira/browse/HDFS-770 which is *very* old.

 

If you’re pretty sure Spark couldn’t be responsible for issues at this level 
I’ll stick to the Hadoop mailing list.

 

Thanks

---

Joe Naegele

Grier Forensics

 

From: Michael Stratton [mailto:michael.strat...@komodohealth.com] 
Sent: Monday, December 19, 2016 10:00 AM
To: Joseph Naegele 
Cc: user 
Subject: Re: [Spark SQL] Task failed while writing rows

 

It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin 
-report?

 

Anecdotally(And w/o specifics as it has been a while), I've generally used 
Parquet instead of ORC as I've gotten a bunch of random problems reading and 
writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it can be a 
pain.

 

On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele  > wrote:

Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.lang.RuntimeException: Failed to commit task
Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: 0 expected: 32

Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. 
Aborting...

This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

 

2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output committer 
class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor 
exception  for block 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862425_121642

java.io.EOFException: Premature EOF: no length prefix available

at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)

at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)

at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)


My HDFS datanode says:

2016-12-09 02:39:24,783 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: HDFS_WRITE, 
cliID: DFSClient_attempt_201612090102__m_25_0_956624542_193, offset: 0, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration: 
93026972

2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, 
type=LAST_IN_PIPELINE, downstreams=0:[] terminating

2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:57790 dst: /127.0.0.1:50010 

  java.net.SocketTimeoutException: 6 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 
remote=/127.0.0.1:57790]


It looks like the datanode is receiving the block on 

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
But coarse grained does the exact same thing which i am trying to avert
here.  At the cost of lower startup, it keeps the resources reserved till
the entire duration of the job.

Regards
Sumit Chawla


On Mon, Dec 19, 2016 at 10:06 AM, Michael Gummelt 
wrote:

> Hi
>
> I don't have a lot of experience with the fine-grained scheduler.  It's
> deprecated and fairly old now.  CPUs should be relinquished as tasks
> complete, so I'm not sure why you're seeing what you're seeing.  There have
> been a few discussions on the spark list regarding deprecating the
> fine-grained scheduler, and no one seemed too dead-set on keeping it.  I'd
> recommend you move over to coarse-grained.
>
> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
> wrote:
>
>> Hi
>>
>> I am using Spark 1.6. I have one query about Fine Grained model in
>> Spark.  I have a simple Spark application which transforms A -> B.  Its a
>> single stage application.  To begin the program, It starts with 48
>> partitions.  When the program starts running, in mesos UI it shows 48 tasks
>> and 48 CPUs allocated to job.  Now as the tasks get done, the number of
>> active tasks number starts decreasing.  How ever, the number of CPUs does
>> not decrease propotionally.  When the job was about to finish, there was a
>> single remaininig task, however CPU count was still 20.
>>
>> My questions, is why there is no one to one mapping between tasks and
>> cpus in Fine grained?  How can these CPUs be released when the job is done,
>> so that other jobs can start.
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>


Pivot in Spark with Case and when

2016-12-19 Thread KhajaAsmath Mohammed
Hi ,

I am trying to convert sample of hive code into spark sql for better
performance. below is part of Hive query that needs to be  converted to
Spark SQL.

All the data is grouped on particular column(id) and max value(value
column) is taken for that particular grouped column(id) and pivoted out

max(CASE WHEN

(  trim(dt_map) = 'Geotab GPS
kph'

OR trim(dt_map) = 'Teletrac
(FTP feed) GPS deg'

)

THEN T.value ELSE null END) AS
vehicle_speed_kph_1


I have gone through Window and pivot functions but looks like I will not
be achieving above condition. Would be very helpful if someone can provide
 any suggestions


*Note:* I don't want to create temporary table and use the same hive query
again instead of using the functions.


Thanks,

Asmath


Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
Hi

I don't have a lot of experience with the fine-grained scheduler.  It's
deprecated and fairly old now.  CPUs should be relinquished as tasks
complete, so I'm not sure why you're seeing what you're seeing.  There have
been a few discussions on the spark list regarding deprecating the
fine-grained scheduler, and no one seemed too dead-set on keeping it.  I'd
recommend you move over to coarse-grained.

On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit 
wrote:

> Hi
>
> I am using Spark 1.6. I have one query about Fine Grained model in Spark.
> I have a simple Spark application which transforms A -> B.  Its a single
> stage application.  To begin the program, It starts with 48 partitions.
> When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs
> allocated to job.  Now as the tasks get done, the number of active tasks
> number starts decreasing.  How ever, the number of CPUs does not decrease
> propotionally.  When the job was about to finish, there was a single
> remaininig task, however CPU count was still 20.
>
> My questions, is why there is no one to one mapping between tasks and cpus
> in Fine grained?  How can these CPUs be released when the job is done, so
> that other jobs can start.
>
>
> Regards
> Sumit Chawla
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread Venkata Naidu
We can create a link in the spark conf directory to point hive.conf file of
hive installation I believe.

Thanks,
Venkat.

On Mon, Dec 19, 2016, 10:58 AM apu  wrote:

> This is for Spark 2.0:
>
> If I wanted Hive support on a new SparkSession, I would build it with:
>
> spark = SparkSession \
> .builder \
> .enableHiveSupport() \
> .getOrCreate()
>
> However, PySpark already creates a SparkSession for me, which appears to
> lack HiveSupport. How can I either:
>
> (a) Add Hive support to an existing SparkSession,
>
> or
>
> (b) Configure PySpark so that the SparkSession it creates at startup has
> Hive support enabled?
>
> Thanks!
>
> Apu
>


Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread apu
This is for Spark 2.0:

If I wanted Hive support on a new SparkSession, I would build it with:

spark = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()

However, PySpark already creates a SparkSession for me, which appears to
lack HiveSupport. How can I either:

(a) Add Hive support to an existing SparkSession,

or

(b) Configure PySpark so that the SparkSession it creates at startup has
Hive support enabled?

Thanks!

Apu


Re: Reference External Variables in Map Function (Inner class)

2016-12-19 Thread mbayebabacar
Hello Marcelo,
Finally what was the solution, I face the same problem.
Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reference-External-Variables-in-Map-Function-Inner-class-tp11990p28237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Michael Stratton
It seems like an issue w/ Hadoop. What do you get when you run hdfs
dfsadmin -report?

Anecdotally(And w/o specifics as it has been a while), I've generally used
Parquet instead of ORC as I've gotten a bunch of random problems reading
and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it
can be a pain.

On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele  wrote:

> Hi all,
>
> I'm having trouble with a relatively simple Spark SQL job. I'm using Spark
> 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record).
> It's current compressed size is around 13 GB, but my problem started when
> it was much smaller, maybe 5 GB. This dataset is generated by performing a
> query on an existing ORC dataset in HDFS, selecting a subset of the
> existing data (i.e. removing duplicates). When I write this dataset to HDFS
> using ORC I get the following exceptions in the driver:
>
> org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.lang.RuntimeException: Failed to commit task
> Suppressed: java.lang.IllegalArgumentException: Column has wrong number
> of index entries found: 0 expected: 32
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
>
> This happens multiple times. The executors tell me the following a few
> times before the same exceptions as above:
>
> 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output
> committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-1479228275669
> :blk_1073862425_121642
> java.io.EOFException: Premature EOF: no length prefix available
> at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
> at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
> My HDFS datanode says:
>
> 2016-12-09 02:39:24,783 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op:
> HDFS_WRITE, cliID: 
> DFSClient_attempt_201612090102__m_25_0_956624542_193,
> offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: BP-
> 1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration:
> 93026972
> 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder: 
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation
> src: /127.0.0.1:57790 dst: /127.0.0.1:50010
> java.net.SocketTimeoutException: 6 millis timeout while waiting for 
> channel
> to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:57790]
>
> It looks like the datanode is receiving the block on multiple ports
> (threads?) and one of the sending connections terminates early.
>
> I was originally running 6 executors with 6 cores and 24 GB RAM each
> (Total: 36 cores, 144 GB) and experienced many of these issues, where
> occasionally my job would fail altogether. Lowering the number of cores
> appears to reduce the frequency of these errors, however I'm now down to 4
> executors with 2 cores each (Total: 8 cores), which is significantly less,
> and still see approximately 1-3 task failures.
>
> Details:
> - Spark 1.6.3 - Standalone
> - RDD compression enabled
> - HDFS replication disabled
> - Everything running on the same host
> - Otherwise vanilla configs for Hadoop and Spark
>
> Does anybody have any ideas or hints? I can't imagine the problem is
> solely related to the number of executor cores.
>
> Thanks,
> Joe Naegele
>


Re: Spark SQL Syntax

2016-12-19 Thread A Shaikh
I use pyspark on Spark 2.
I used Oracle, Postgres syntax just to get back "unhappy response".
I do get it some of it resolved after some searching but that consumes a
lot of my time, having a platform to test my SQL Syntax and its results
would be very helpful.


On 19 December 2016 at 14:00, Ramesh Krishnan 
wrote:

> What is the version of spark you are using . If it is less than 2.0 ,
> consider using dataset API's to validate compile time checks on syntax.
>
> Thanks,
> Ramesh
>
> On Mon, Dec 19, 2016 at 6:36 PM, A Shaikh  wrote:
>
>> HI,
>>
>> I keep getting Spark SQL Syntax invalid especially for Dates/Timestamps
>> manipulation.
>> What's the best way to test SQL Syntax in Spark Dataframe is valid?
>> Any online site for test or run a demo SQL!
>>
>> Thanks,
>> Afzal
>>
>
>


stratified sampling scales poorly

2016-12-19 Thread Martin Le
Hi all,

I perform sampling on a DStream by taking samples from RDDs in the DStream.
I have used two sampling mechanisms: simple random sampling and stratified
sampling.

Simple random sampling: inputStream.transform(x => x.sample(false,
fraction)).

Stratified sampling: inputStream.transform(x => x.sampleByKeyExact(false,
fractions))

where fractions = Map(“key1”-> fraction,  “key2”-> fraction, …, “keyn”->
fraction).

I have a question is that why stratified sampling scales poorly with
different sampling fractions in this context? meanwhile simple random
sampling scales well with different sampling fractions (I ran experiments
on 4 nodes cluster )?

Thank you,

Martin


Spark SQL Syntax

2016-12-19 Thread A Shaikh
HI,

I keep getting Spark SQL Syntax invalid especially for Dates/Timestamps
manipulation.
What's the best way to test SQL Syntax in Spark Dataframe is valid?
Any online site for test or run a demo SQL!

Thanks,
Afzal


Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
Thanks , It worked !!

On Mon, Dec 19, 2016 at 5:55 PM, Dhaval Modi  wrote:

>
> Replace  with ":"
>
> Regards,
> Dhaval Modi
>
> On 19 December 2016 at 13:10, Rabin Banerjee  > wrote:
>
>> HI All,
>>
>>   I am trying to save data from Spark into HBase using saveHadoopDataSet
>> API . Please refer the below code . Code is working fine .But the table is
>> getting stored in the default namespace.how to set the NameSpace in the
>> below code?
>>
>>
>>
>>
>> wordCounts.foreachRDD ( rdd = {
>>   val conf = HBaseConfiguration.create()
>>   conf.set(TableOutputFormat.OUTPUT_TABLE, "stream_count")
>>   conf.set("hbase.zookeeper.quorum", "localhost:2181")
>>   conf.set("hbase.master", "localhost:6");
>>   conf.set("hbase.rootdir", "file:///tmp/hbase")
>>
>>   val jobConf = new Configuration(conf)
>>   jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
>>   jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].
>> getName)
>>   jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[
>> Text]].getName)
>>
>>   rdd.saveAsNewAPIHadoopDataset(jobConf)
>> })
>>
>> Regards,
>> R Banerjee
>>
>
>


Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Dhaval Modi
Replace  with ":"

Regards,
Dhaval Modi

On 19 December 2016 at 13:10, Rabin Banerjee 
wrote:

> HI All,
>
>   I am trying to save data from Spark into HBase using saveHadoopDataSet
> API . Please refer the below code . Code is working fine .But the table is
> getting stored in the default namespace.how to set the NameSpace in the
> below code?
>
>
>
>
> wordCounts.foreachRDD ( rdd = {
>   val conf = HBaseConfiguration.create()
>   conf.set(TableOutputFormat.OUTPUT_TABLE, "stream_count")
>   conf.set("hbase.zookeeper.quorum", "localhost:2181")
>   conf.set("hbase.master", "localhost:6");
>   conf.set("hbase.rootdir", "file:///tmp/hbase")
>
>   val jobConf = new Configuration(conf)
>   jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
>   jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].
> getName)
>   jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[
> Text]].getName)
>
>   rdd.saveAsNewAPIHadoopDataset(jobConf)
> })
>
> Regards,
> R Banerjee
>


How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
HI All,

  I am trying to save data from Spark into HBase using saveHadoopDataSet
API . Please refer the below code . Code is working fine .But the table is
getting stored in the default namespace.how to set the NameSpace in the
below code?




wordCounts.foreachRDD ( rdd = {
  val conf = HBaseConfiguration.create()
  conf.set(TableOutputFormat.OUTPUT_TABLE, "stream_count")
  conf.set("hbase.zookeeper.quorum", "localhost:2181")
  conf.set("hbase.master", "localhost:6");
  conf.set("hbase.rootdir", "file:///tmp/hbase")

  val jobConf = new Configuration(conf)
  jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
  jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].
getName)
  jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text
]].getName)

  rdd.saveAsNewAPIHadoopDataset(jobConf)
})

Regards,
R Banerjee


Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-19 Thread Eike von Seggern
Hi,

are you using Spark 2.0.*? Then it might be related to
https://issues.apache.org/jira/browse/SPARK-18281 .

Best

Eike

2016-12-18 6:21 GMT+01:00 Russell Jurney :

> Anyone? This is for a book, so I need to figure this out.
>
> On Fri, Dec 16, 2016 at 12:53 AM Russell Jurney 
> wrote:
>
>> I have created a PySpark Streaming application that uses Spark ML to
>> classify flight delays into three categories: on-time, slightly late, very
>> late. After an hour or so something times out and the whole thing crashes.
>>
>> The code and error are on a gist here: https://gist.github.com/rjurney/
>> 17d471bc98fd1ec925c37d141017640d
>>
>> While I am interested in why I am getting an exception, I am more
>> interested in understanding what the correct deployment model is... because
>> long running processes will have new and varied errors and exceptions.
>> Right now with what I've built, Spark is a highly dependable distributed
>> system but in streaming mode the entire thing is dependent on one Python
>> PID going down. This can't be how apps are deployed in the wild because it
>> will never be very reliable, right? But I don't see anything about this in
>> the docs, so I am confused.
>>
>> Note that I use this to run the app, maybe that is the problem?
>>
>> ssc.start()
>> ssc.awaitTermination()
>>
>>
>> What is the actual deployment model for Spark Streaming? All I know to do
>> right now is to restart the PID. I'm new to Spark, and the docs don't
>> really explain this (that I can see).
>>
>> Thanks!
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>>
>>
>>


-- 

*Jan Eike von Seggern*
Data Scientist

*Sevenval Technologies GmbH *

FRONT-END-EXPERTS SINCE 1999

Köpenicker Straße 154 | 10997 Berlin

office   +49 30 707 190 - 229
mail eike.segg...@sevenval.com

www.sevenval.com

Sitz: Köln, HRB 79823
Geschäftsführung: Jan Webering (CEO), Thorsten May, Sascha Langfus,
Joern-Carlos Kuntze

*Wir erhöhen den Return On Investment bei Ihren Mobile und Web-Projekten.
Sprechen Sie uns an:*http://roi.sevenval.com/
---
FOLLOW US on

[image: Sevenval blog]


[image: sevenval on twitter]

 [image: sevenval on linkedin]
[image:
sevenval on pinterest]



Re: Reading xls and xlsx files

2016-12-19 Thread Jörn Franke
I am currently developing one https://github.com/ZuInnoTe/hadoopoffice

It contains working source code, but a release will likely be only beginning of 
the year (will include a Spark data source, but the existing source code can be 
used without issues in a Spark application).


> On 19 Dec 2016, at 11:12, Selvam Raman  wrote:
> 
> Hi,
> 
> Is there a way to read xls and xlsx files using spark?.
> 
> is there any hadoop inputformat available to read xls and xlsx files which 
> could be used in spark?
> 
> -- 
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Reading xls and xlsx files

2016-12-19 Thread Selvam Raman
Hi,

Is there a way to read xls and xlsx files using spark?.

is there any hadoop inputformat available to read xls and xlsx files which
could be used in spark?

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: How to perform Join operation using JAVARDD

2016-12-19 Thread ayan guha
What's your desired output?
On Sat., 17 Dec. 2016 at 9:50 pm, Sree Eedupuganti  wrote:

> I tried like this,
>
> *CrashData_1.csv:*
>
> *CRASH_KEYCRASH_NUMBER  CRASH_DATECRASH_MONTH*
> *2016899114 2016899114  01/02/2016   12:00:00
> AM +*
>
> *CrashData_2.csv:*
>
> *CITY_NAMEZIPCODE CITY STATE*
> *1945 704   PARC PARQUE   PR*
>
>
> Code:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *JavaRDD firstRDD =
> sc.textFile("/Users/apple/Desktop/CrashData_1.csv");*
>
> *JavaRDD secondRDD =
> sc.textFile("/Users/apple/Desktop/CrashData_2.csv");*
>
> *JavaRDD allRDD = firstRDD.union(secondRDD);*
>
>
> *Output i am getting:*
>
> *[CRASH_KEY,CRASH_NUMBER,CRASH_DATE,CRASH_MONTH,
> 2016899114,2016899114,01/02/2016 12:00:00 AM + *
>
> *CITY_NAME,ZIPCODE,CITY,STATE, **1945,704,PARC PARQUE,PR]*
>
>
>
>
> *Any suggesttions please, Thanks in advance*
>
>
>


Re: How to get recent value in spark dataframe

2016-12-19 Thread ayan guha
You have 2 parts to it

1. Do a sub query where for each primary key derive latest value of flag=1
records. Ensure you get exactly 1 record per primary key value. Here you
can use rank() over (partition by primary key order by year desc)

2. Join your original dataset with the above on primary key. If year is
higher than latest flag=1 record then take it else mark it null

If you can have primary keys which may have no flag=1 records then they
wouldn't show up in set 1 above. So if you still want them in result then
adjust 1 accordingly.

Best
Ayan
On Mon., 19 Dec. 2016 at 1:01 pm, Richard Xin
 wrote:

> I am not sure I understood your logic, but it seems to me that you could
> take a look of Hive's Lead/Lag functions.
>
>
> On Monday, December 19, 2016 1:41 AM, Milin korath <
> milin.kor...@impelsys.com> wrote:
>
>
> thanks, I tried with left outer join. My dataset having around 400M
> records and lot of shuffling is happening.Is there any other workaround
> apart from Join,I tried use window function but I am not getting a proper
> solution,
>
>
> Thanks
>
> On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust 
> wrote:
>
> Oh and to get the null for missing years, you'd need to do an outer join
> with a table containing all of the years you are interested in.
>
> On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust 
> wrote:
>
> Are you looking for argmax? Here is an example
> 
> .
>
> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
> wrote:
>
> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>
>   a   0100  2015
>
>   a   050   2015
>
>   a   1200  2014
>
>   a   1300  2013
>
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>
>   a   0100  2015200
>
>   a   050   2015200
>
>   a   1200  2014null
>
>   a   1300  2013null
>
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>
>
>
>
>
>
>
>
>
>
>
>


Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi,

If you are concerned with the performance of the alternative filesystems
(ie. needing a caching client), you can use Alluxio on top of any of NFS
,
Ceph

, GlusterFS
,
or other/multiple storages. Especially since your working sets will not be
huge, you most likely will be able to store all the relevant data within
Alluxio during computation, giving you flexibility to store your data in
your preferred storage without performance penalties.

Hope this helps,
Calvin

On Sun, Dec 18, 2016 at 11:23 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> I am using gluster and i have decent performance with basic maintenance
> effort. Advantage of gluster: you can plug Alluxio on top to improve perf
> but I still need to be validate...
>
> Le 18 déc. 2016 8:50 PM,  a écrit :
>
>> Hello,
>>
>> We are trying out Spark for some file processing tasks.
>>
>> Since each Spark worker node needs to access the same files, we have
>> tried using Hdfs. This worked, but there were some oddities making me a
>> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
>> this version exhibited the odd behaviour with Hdfs. The problem might
>> have nothing to do with Hdfs, but the situation made me curious about
>> the alternatives.
>>
>> Now I'm wondering what kind of file system would be suitable for our
>> deployment.
>>
>> - There won't be a great number of nodes. Maybe 10 or so.
>>
>> - The datasets won't be big by big-data standards(Maybe a couple of
>>   hundred gb)
>>
>> So maybe I could just use a NFS server, with a caching client?
>> Or should I try Ceph, or Glusterfs?
>>
>> Does anyone have any experiences to share?
>>
>> --
>> Joakim Verona
>> joa...@verona.se
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>