Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
This is happening because spark context shuts down without shutting down
the ssc first.
This was behavior till spark 1.4 ans was addressed in later releases.
https://github.com/apache/spark/pull/6307

Which version of spark are you on?

Thanks
Deepak

On Thu, May 12, 2016 at 12:14 PM, Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:

> Yes, it seems to be the case.
> In this case executors should have continued logging values till 300, but
> they are shutdown as soon as i do "yarn kill .."
>
> On Thu, May 12, 2016 at 12:11 PM Deepak Sharma 
> wrote:
>
>> So in your case , the driver is shutting down gracefully , but the
>> executors are not.
>> IS this the problem?
>>
>> Thanks
>> Deepak
>>
>> On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
>> rakes...@flipkart.com> wrote:
>>
>>> Yes, it is set to true.
>>> Log of driver :
>>>
>>> 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
>>> 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
>>> stop(stopGracefully=true) from shutdown hook
>>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
>>> gracefully
>>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
>>> blocks to be consumed for job generation
>>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received 
>>> blocks to be consumed for job generation
>>>
>>> Log of executor:
>>> 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
>>> xx.xx.xx.xx:x disassociated! Shutting down.
>>> 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association with 
>>> remote system [xx.xx.xx.xx:x] has failed, address is now gated for 
>>> [5000] ms. Reason: [Disassociated]
>>> 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
>>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
>>> 204 //This is value i am logging
>>> 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
>>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 205
>>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
>>> wrote:
>>>
 Hi Rakesh
 Did you tried setting *spark.streaming.stopGracefullyOnShutdown to
 true *for your spark configuration instance?
 If not try this , and let us know if this helps.

 Thanks
 Deepak

 On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
 rakes...@flipkart.com> wrote:

> Issue i am having is similar to the one mentioned here :
>
> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>
> I am creating a rdd from sequence of 1 to 300 and creating streaming
> RDD out of it.
>
> val rdd = ssc.sparkContext.parallelize(1 to 300)
> val dstream = new ConstantInputDStream(ssc, rdd)
> dstream.foreachRDD{ rdd =>
>   rdd.foreach{ x =>
> log(x)
> Thread.sleep(50)
>   }
> }
>
>
> When i kill this job, i expect elements 1 to 300 to be logged before
> shutting down. It is indeed the case when i run it locally. It wait for 
> the
> job to finish before shutting down.
>
> But when i launch the job in custer with "yarn-cluster" mode, it
> abruptly shuts down.
> Executor prints following log
>
> ERROR executor.CoarseGrainedExecutorBackend:
> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>
>  and then it shuts down. It is not a graceful shutdown.
>
> Anybody knows how to do it in yarn ?
>
>
>
>


 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Rakesh H (Marketing Platform-BLR)
Yes, it seems to be the case.
In this case executors should have continued logging values till 300, but
they are shutdown as soon as i do "yarn kill .."

On Thu, May 12, 2016 at 12:11 PM Deepak Sharma 
wrote:

> So in your case , the driver is shutting down gracefully , but the
> executors are not.
> IS this the problem?
>
> Thanks
> Deepak
>
> On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> Yes, it is set to true.
>> Log of driver :
>>
>> 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
>> 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
>> stop(stopGracefully=true) from shutdown hook
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
>> gracefully
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
>> blocks to be consumed for job generation
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received 
>> blocks to be consumed for job generation
>>
>> Log of executor:
>> 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
>> xx.xx.xx.xx:x disassociated! Shutting down.
>> 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association with 
>> remote system [xx.xx.xx.xx:x] has failed, address is now gated for 
>> [5000] ms. Reason: [Disassociated]
>> 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 204 
>> //This is value i am logging
>> 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 205
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206
>>
>>
>>
>>
>>
>>
>> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
>> wrote:
>>
>>> Hi Rakesh
>>> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true
>>> *for your spark configuration instance?
>>> If not try this , and let us know if this helps.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
>>> rakes...@flipkart.com> wrote:
>>>
 Issue i am having is similar to the one mentioned here :

 http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn

 I am creating a rdd from sequence of 1 to 300 and creating streaming
 RDD out of it.

 val rdd = ssc.sparkContext.parallelize(1 to 300)
 val dstream = new ConstantInputDStream(ssc, rdd)
 dstream.foreachRDD{ rdd =>
   rdd.foreach{ x =>
 log(x)
 Thread.sleep(50)
   }
 }


 When i kill this job, i expect elements 1 to 300 to be logged before
 shutting down. It is indeed the case when i run it locally. It wait for the
 job to finish before shutting down.

 But when i launch the job in custer with "yarn-cluster" mode, it
 abruptly shuts down.
 Executor prints following log

 ERROR executor.CoarseGrainedExecutorBackend:
 Driver xx.xx.xx.xxx:y disassociated! Shutting down.

  and then it shuts down. It is not a graceful shutdown.

 Anybody knows how to do it in yarn ?




>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
So in your case , the driver is shutting down gracefully , but the
executors are not.
IS this the problem?

Thanks
Deepak

On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:

> Yes, it is set to true.
> Log of driver :
>
> 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
> stop(stopGracefully=true) from shutdown hook
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
> gracefully
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
> blocks to be consumed for job generation
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received blocks 
> to be consumed for job generation
>
> Log of executor:
> 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
> xx.xx.xx.xx:x disassociated! Shutting down.
> 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association with 
> remote system [xx.xx.xx.xx:x] has failed, address is now gated for [5000] 
> ms. Reason: [Disassociated]
> 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 204 
> //This is value i am logging
> 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 205
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206
>
>
>
>
>
>
> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
> wrote:
>
>> Hi Rakesh
>> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for
>> your spark configuration instance?
>> If not try this , and let us know if this helps.
>>
>> Thanks
>> Deepak
>>
>> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
>> rakes...@flipkart.com> wrote:
>>
>>> Issue i am having is similar to the one mentioned here :
>>>
>>> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>>>
>>> I am creating a rdd from sequence of 1 to 300 and creating streaming RDD
>>> out of it.
>>>
>>> val rdd = ssc.sparkContext.parallelize(1 to 300)
>>> val dstream = new ConstantInputDStream(ssc, rdd)
>>> dstream.foreachRDD{ rdd =>
>>>   rdd.foreach{ x =>
>>> log(x)
>>> Thread.sleep(50)
>>>   }
>>> }
>>>
>>>
>>> When i kill this job, i expect elements 1 to 300 to be logged before
>>> shutting down. It is indeed the case when i run it locally. It wait for the
>>> job to finish before shutting down.
>>>
>>> But when i launch the job in custer with "yarn-cluster" mode, it
>>> abruptly shuts down.
>>> Executor prints following log
>>>
>>> ERROR executor.CoarseGrainedExecutorBackend:
>>> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>>>
>>>  and then it shuts down. It is not a graceful shutdown.
>>>
>>> Anybody knows how to do it in yarn ?
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Rakesh H (Marketing Platform-BLR)
Yes, it is set to true.
Log of driver :

16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking
stop(stopGracefully=true) from shutdown hook
16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator gracefully
16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all
received blocks to be consumed for job generation
16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received
blocks to be consumed for job generation

Log of executor:
16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver
xx.xx.xx.xx:x disassociated! Shutting down.
16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association
with remote system [xx.xx.xx.xx:x] has failed, address is now
gated for [5000] ms. Reason: [Disassociated]
16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE
-> 204 //This is value i am logging
16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 205
16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206






On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
wrote:

> Hi Rakesh
> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for
> your spark configuration instance?
> If not try this , and let us know if this helps.
>
> Thanks
> Deepak
>
> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> Issue i am having is similar to the one mentioned here :
>>
>> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>>
>> I am creating a rdd from sequence of 1 to 300 and creating streaming RDD
>> out of it.
>>
>> val rdd = ssc.sparkContext.parallelize(1 to 300)
>> val dstream = new ConstantInputDStream(ssc, rdd)
>> dstream.foreachRDD{ rdd =>
>>   rdd.foreach{ x =>
>> log(x)
>> Thread.sleep(50)
>>   }
>> }
>>
>>
>> When i kill this job, i expect elements 1 to 300 to be logged before
>> shutting down. It is indeed the case when i run it locally. It wait for the
>> job to finish before shutting down.
>>
>> But when i launch the job in custer with "yarn-cluster" mode, it abruptly
>> shuts down.
>> Executor prints following log
>>
>> ERROR executor.CoarseGrainedExecutorBackend:
>> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>>
>>  and then it shuts down. It is not a graceful shutdown.
>>
>> Anybody knows how to do it in yarn ?
>>
>>
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Deepak Sharma
Hi Rakesh
Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for
your spark configuration instance?
If not try this , and let us know if this helps.

Thanks
Deepak

On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:

> Issue i am having is similar to the one mentioned here :
>
> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>
> I am creating a rdd from sequence of 1 to 300 and creating streaming RDD
> out of it.
>
> val rdd = ssc.sparkContext.parallelize(1 to 300)
> val dstream = new ConstantInputDStream(ssc, rdd)
> dstream.foreachRDD{ rdd =>
>   rdd.foreach{ x =>
> log(x)
> Thread.sleep(50)
>   }
> }
>
>
> When i kill this job, i expect elements 1 to 300 to be logged before
> shutting down. It is indeed the case when i run it locally. It wait for the
> job to finish before shutting down.
>
> But when i launch the job in custer with "yarn-cluster" mode, it abruptly
> shuts down.
> Executor prints following log
>
> ERROR executor.CoarseGrainedExecutorBackend:
> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>
>  and then it shuts down. It is not a graceful shutdown.
>
> Anybody knows how to do it in yarn ?
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Graceful shutdown of spark streaming on yarn

2016-05-11 Thread Rakesh H (Marketing Platform-BLR)
Issue i am having is similar to the one mentioned here :
http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn

I am creating a rdd from sequence of 1 to 300 and creating streaming RDD
out of it.

val rdd = ssc.sparkContext.parallelize(1 to 300)
val dstream = new ConstantInputDStream(ssc, rdd)
dstream.foreachRDD{ rdd =>
  rdd.foreach{ x =>
log(x)
Thread.sleep(50)
  }
}


When i kill this job, i expect elements 1 to 300 to be logged before
shutting down. It is indeed the case when i run it locally. It wait for the
job to finish before shutting down.

But when i launch the job in custer with "yarn-cluster" mode, it abruptly
shuts down.
Executor prints following log

ERROR executor.CoarseGrainedExecutorBackend:
Driver xx.xx.xx.xxx:y disassociated! Shutting down.

 and then it shuts down. It is not a graceful shutdown.

Anybody knows how to do it in yarn ?


Will this affect the result of spark?

2016-05-11 Thread sunday2000
Hi,
   When running spark in cluster mode, it give out this warning message 
occasionally, will it affect the final result?
  
  ERROR CoarseGrainedExecutorBackend: Driver 192.168.1.1:45725 disassociated! 
Shutting down.

When start spark-sql, postgresql gives errors.

2016-05-11 Thread Joseph

Hi all,

I use PostgreSQL to store the hive metadata. 

First, I imported a sql script to metastore database as follows:
psql -U postgres -d metastore -h 192.168.50.30 -f 
hive-schema-1.2.0.postgres.sql

Then, when I started $SPARK_HOME/bin/spark-sql,  the PostgreSQL  gave the 
following errors:

ERROR:  syntax error at or near "@@" at character 5
STATEMENT:  SET @@session.sql_mode=ANSI_QUOTES
ERROR:  relation "v$instance" does not exist at character 21
STATEMENT:  SELECT version FROM v$instance
ERROR:  column "version" does not exist at character 10
STATEMENT:  SELECT @@version

This does not affect normal use, but maybe it is a bug! ( I use spark 1.6.1 and 
 hive 1.2.1)







Joseph


Re:Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
Hi Simon


Can you describe your problem in more details? 
I suspect that my problem is because the window function (or may be the groupBy 
agg functions).
If you are the same. May be we should report a bug 






At 2016-05-11 23:46:49, "Simon Schiff [via Apache Spark User List]" 
 wrote:
I have the same Problem with Spark-2.0.0 Snapshot with Streaming. There I use 
Datasets instead of Dataframes. I hope you or someone will find a solution.


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26930.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26934.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
sorry I have to correction again. It may still a memory leak. Because at last
the memory usage goes up again... 

eventually , the stream program crashed.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Telmo Rodrigues
I'm building spark from branch-1.6 source with mvn -DskipTests package and
I'm running the following code with spark shell.

*val* sqlContext *=* *new* org.apache.spark.sql.*SQLContext*(sc)

*import* *sqlContext.implicits._*


*val df = sqlContext.read.json("persons.json")*

*val df2 = sqlContext.read.json("cars.json")*


*df.registerTempTable("t")*

*df2.registerTempTable("u")*


*val d3 =sqlContext.sql("select * from t join u on t.id  =
u.id  where t.id  = 1")*

With the log4j root category level WARN, the last printed messages refers
to the Batch Resolution rules execution.

=== Result of Batch Resolution ===
!'Project [unresolvedalias(*)]  Project [id#0L,id#1L]
!+- 'Filter ('t.id = 1) +- Filter (id#0L = cast(1 as
bigint))
!   +- 'Join Inner, Some(('t.id = 'u.id))  +- Join Inner, Some((id#0L =
id#1L))
!  :- 'UnresolvedRelation `t`, None   :- Subquery t
!  +- 'UnresolvedRelation `u`, None   :  +- Relation[id#0L]
JSONRelation
! +- Subquery u
!+- Relation[id#1L]
JSONRelation


I think that only the analyser rules are being executed.

The optimiser rules should not to run in this case?

2016-05-11 19:24 GMT+01:00 Michael Armbrust :

>
>> logical plan after optimizer execution:
>>
>> Project [id#0L,id#1L]
>> !+- Filter (id#0L = cast(1 as bigint))
>> !   +- Join Inner, Some((id#0L = id#1L))
>> !  :- Subquery t
>> !  :  +- Relation[id#0L] JSONRelation
>> !  +- Subquery u
>> !  +- Relation[id#1L] JSONRelation
>>
>
> I think you are mistaken.  If this was the optimized plan there would be
> no subqueries.
>


Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
You have kept 3rd party jars at hdfs. I don't think executors as of today
can download jars from hdfs..  Can you try with a shared directory..
Application jar is downloaded by executors through http server..

-Raghav
On 12 May 2016 00:04, "Giri P"  wrote:

> Yes..They are  reachable. Application jar which I send as argument is at
> same location as third party jar. Application jar is getting uploaded.
>
> On Wed, May 11, 2016 at 10:51 AM, lalit sharma 
> wrote:
>
>> Point to note as per docs as well :
>>
>> *Note that jars or python files that are passed to spark-submit should be
>> URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically
>> upload local jars.**http://spark.apache.org/docs/latest/running-on-mesos.html
>>  *
>>
>> On Wed, May 11, 2016 at 10:05 PM, Giri P  wrote:
>>
>>> I'm not using docker
>>>
>>> On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey <
>>> raghavendra.pan...@gmail.com> wrote:
>>>
 By any chance, are you using docker to execute?
 On 11 May 2016 21:16, "Raghavendra Pandey" <
 raghavendra.pan...@gmail.com> wrote:

> On 11 May 2016 02:13, "gpatcham"  wrote:
>
> >
>
> > Hi All,
> >
> > I'm using --jars option in spark-submit to send 3rd party jars . But
> I don't
> > see they are actually passed to mesos slaves. Getting Noclass found
> > exceptions.
> >
> > This is how I'm using --jars option
> >
> > --jars hdfs://namenode:8082/user/path/to/jar
> >
> > Am I missing something here or what's the correct  way to do ?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

>>>
>>
>


Re: How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Marcelo Vanzin
Is the class mentioned in the exception below the parent class of the
anonymous "Function" class you're creating?

If so, you may need to make it serializable. Or make your function a
proper "standalone" class (either a nested static class or a top-level
one).

On Wed, May 11, 2016 at 3:55 PM, Andy Davidson
 wrote:
> Caused by: java.io.NotSerializableException:
> com.pws.sparkStreaming.collector.SavePowerTrackActivityStream

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to transform a JSON string into a Java HashMap<> java.io.NotSerializableException

2016-05-11 Thread Andy Davidson
I have a streaming app that receives very complicated JSON (twitter status).
I would like to work with it as a hash map. It would be very difficult to
define a pojo for this JSON. (I can not use twitter4j)
// map json string to map

JavaRDD> jsonMapRDD = powerTrackRDD.map(new
Function>(){

private static final long serialVersionUID = 1L;



@Override

public Hashtable call(String line) throws Exception {

  //HashMap hm = JsonUtils.jsonToHashMap(line);

  //Hashtable ret = new Hashtable(hm);

  Hashtable  ret = null;

  return ret;

}});



Using the sqlContext works how ever I assume that this is going to be slow
and error prone given it likely many key/value pairs are missing



DataFrame df = sqlContext.read().json(getFilePath().toString());

df.printSchema();



Any suggestions would be greatly appriciated



Andy



org.apache.spark.SparkException: Task not serializable

at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scal
a:304)

at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$
clean(ClosureCleaner.scala:294)

at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)

at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)

at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:324)

at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:323)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:15
0)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:11
1)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)

at org.apache.spark.rdd.RDD.map(RDD.scala:323)

at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:96)

at org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:46)

at 
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream.test(SavePower
TrackActivityStream.java:34)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.
java:50)

at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.j
ava:12)

at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.ja
va:47)

at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.jav
a:17)

at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)

at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.jav
a:78)

at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.jav
a:57)

at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)

at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)

at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)

at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)

at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)

at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26
)

at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)

at org.junit.runners.ParentRunner.run(ParentRunner.java:363)

at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestRef
erence.java:86)

at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:3
8)

at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:459)

at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:675)

at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.
java:382)

at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner
.java:192)

Caused by: java.io.NotSerializableException:
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream

Serialization stack:

- object not serializable (class:
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream, value:
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream@f438904)

- field (class: 
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream$1, name:
this$0, type: class
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream)

- object (class 
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream$1,
com.pws.sparkStreaming.collector.SavePowerTrackActivityStream$1@3fa7df1)

- field (class: 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name:
fun$1, type: interface org.apache.spark.api.java.function.Function)

- object (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1,
)

at 
org.apache.spark.serializer.SerializationDebugger$.improveException(Serializ
ationDebugger.scala:40)

at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializ
er.scala:47

Re: kryo

2016-05-11 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtpO0qI3cp06/JodaDateTimeSerializer+spark&subj=Re+NPE+when+using+Joda+DateTime

On Wed, May 11, 2016 at 2:18 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Hi all,
>
> I'm trying to get to use spark.serializer.
> I set it in the spark-default.conf, but I statred getting issues with
> datetimes.
>
> As I understand, I need to disable it.
> Anyways to keep using kryo?
>
> It's seems I can use JodaDateTimeSerializer for datetimes, just not sure
> how to set it, and register it in the spark-default conf.
>
> Thanks,
>
> *Younes Naguib* 
>


kryo

2016-05-11 Thread Younes Naguib
Hi all,

I'm trying to get to use spark.serializer.
I set it in the spark-default.conf, but I statred getting issues with datetimes.

As I understand, I need to disable it.
Anyways to keep using kryo?

It's seems I can use JodaDateTimeSerializer for datetimes, just not sure how to 
set it, and register it in the spark-default conf.

Thanks,
Younes Naguib 


Re: apache spark on gitter?

2016-05-11 Thread Xinh Huynh
Hi Pawel,

I'd like to hear more about your idea. Could you explain more why you would
like to have a gitter channel? What are the advantages over a mailing list
(like this one)? Have you had good experiences using gitter on other open
source projects?

Xinh

On Wed, May 11, 2016 at 11:10 AM, Sean Owen  wrote:

> I don't know of a gitter channel and I don't use it myself, FWIW. I
> think anyone's welcome to start one.
>
> I hesitate to recommend this, simply because it's preferable to have
> one place for discussion rather than split it over several, and, we
> have to keep the @spark.apache.org mailing lists as the "forums of
> records" for project discussions.
>
> If something like gitter doesn't attract any chat, then it doesn't add
> any value. If it does though, then suddenly someone needs to subscribe
> to user@ and gitter to follow all of the conversations.
>
> I think there is a bit of a scalability problem on the user@ list at
> the moment, just because it covers all of Spark. But adding a
> different all-Spark channel doesn't help that.
>
> Anyway maybe that's "why"
>
>
> On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc  wrote:
> > no answer, but maybe one more time, a gitter channel for spark users
> would
> > be a good idea!
> >
> > On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc 
> wrote:
> >>
> >> Hi,
> >>
> >> I was wondering - why Spark does not have a gitter channel?
> >>
> >> --
> >> Regards,
> >> Paul Szulc
> >>
> >> twitter: @rabbitonweb
> >> blog: www.rabbitonweb.com
> >
> >
> >
> >
> > --
> > Regards,
> > Paul Szulc
> >
> > twitter: @rabbitonweb
> > blog: www.rabbitonweb.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Setting Spark Worker Memory

2016-05-11 Thread Mich Talebzadeh
run JPS like below

 jps
19724 SparkSubmit
10612 Worker

and do ps awx|grep  PID

 for each number that represents these two descriptions. something like

ps awx|grep 30208
30208 pts/2Sl+1:05 /usr/java/latest/bin/java -cp
/home/hduser/jars/jconn4.jar:/home/hduser/jars/ojdbc6.jar:/usr/lib/spark-1.6.1-bin-hadoop2.6/conf/:/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark-1.6.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/hduser/hadoop-2.6.0/etc/hadoop/
-Xms4g -Xmx4g *org.apache.spark.deploy.SparkSubmit --master
spark://50.140.197.217:7077  --conf
spark.driver.memory=4g --class CEP_streaming --num-executors 1
--executor-memory 4G --executor-cores 2 *--packages
com.databricks:spark-csv_2.11:1.3.0 --jars
/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar

Also

send the output of OS command for available memory


*free* total   used   free sharedbuffers
cached
Mem:  24546308   24398672 147636  0 347464   17130900
-/+ buffers/cache:6920308   17626000
Swap:  2031608 2262881805320

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 19:47, شجاع الرحمن بیگ  wrote:

> yes, i m running this as standalone mode.
>
> On Wed, May 11, 2016 at 6:23 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> are you running this in standalone  mode? that is one physical host, and
>> the executor will live inside the driver.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 11 May 2016 at 16:45, شجاع الرحمن بیگ  wrote:
>>
>>> yes,
>>>
>>> On Wed, May 11, 2016 at 5:43 PM, Deepak Sharma 
>>> wrote:
>>>
 Since you are registering workers from the same node , do you have
 enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this
 node(11.14.224.24)?

 Thanks
 Deepak

 On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ >>> > wrote:

> Hi All,
>
> I need to set same memory and core for each worker on same machine and
> for this purpose, I have set the following properties in conf/spark-env.sh
>
> export SPARK_EXECUTOR_INSTANCE=3
> export SPARK_WORKER_CORES=3
> export SPARK_WORKER_MEMORY=8g
>
> but only one worker is getting desired memory and cores and other two
> are not. here is the log of master.
>
> ...
> 6/05/11 17:04:40 INFO Master: I have been elected leader! New state:
> ALIVE
> 16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923
> with 3 cores, 8.0 GB RAM
> 16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072
> with 2 cores, 1020.7 GB RAM
> 16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702
> with 2 cores, 1020.7 GB RAM
> ...
>
>
> Could you please let me know the solution
>
> Thanks
> Shuja
>
> --
> Regards
> Shuja-ur-Rehman Baig
> 
>



 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>>>
>>> --
>>> Regards
>>> Shuja-ur-Rehman Baig
>>> 
>>>
>>
>>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> 
>


Re: Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
yes, i m running this as standalone mode.

On Wed, May 11, 2016 at 6:23 PM, Mich Talebzadeh 
wrote:

> are you running this in standalone  mode? that is one physical host, and
> the executor will live inside the driver.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 May 2016 at 16:45, شجاع الرحمن بیگ  wrote:
>
>> yes,
>>
>> On Wed, May 11, 2016 at 5:43 PM, Deepak Sharma 
>> wrote:
>>
>>> Since you are registering workers from the same node , do you have
>>> enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this
>>> node(11.14.224.24)?
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ 
>>> wrote:
>>>
 Hi All,

 I need to set same memory and core for each worker on same machine and
 for this purpose, I have set the following properties in conf/spark-env.sh

 export SPARK_EXECUTOR_INSTANCE=3
 export SPARK_WORKER_CORES=3
 export SPARK_WORKER_MEMORY=8g

 but only one worker is getting desired memory and cores and other two
 are not. here is the log of master.

 ...
 6/05/11 17:04:40 INFO Master: I have been elected leader! New state:
 ALIVE
 16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923
 with 3 cores, 8.0 GB RAM
 16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072
 with 2 cores, 1020.7 GB RAM
 16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702
 with 2 cores, 1020.7 GB RAM
 ...


 Could you please let me know the solution

 Thanks
 Shuja

 --
 Regards
 Shuja-ur-Rehman Baig
 

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Regards
>> Shuja-ur-Rehman Baig
>> 
>>
>
>


-- 
Regards
Shuja-ur-Rehman Baig



How to take executor memory dump

2016-05-11 Thread Nirav Patel
Hi,

I am hitting OutOfMemoryError issues with spark executors. It happens
mainly during shuffle. Executors gets killed with OutOfMemoryError. I have
try setting up spark.executor.extraJavaOptions to take memory dump but its
not happening.

spark.executor.extraJavaOptions = "-XX:+UseCompressedOops
-XX:-HeapDumpOnOutOfMemoryError -*XX:OnOutOfMemoryError='kill -9 %p; jmap
-heap %p > **/home/mycorp/npatel/jmap_%p*' -
*XX:HeapDumpPath=/opt/cores/spark* -XX:+UseG1GC -verbose:gc
-XX:+PrintGCDetails
-Xloggc:/home/mycorp/npatel/insights-jobs/gclogs/gc_%p.log
-XX:+PrintGCTimeStamps"

Following is what I see repeatedly in yarn application logs after job fails.

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=*"kill %p*
*kill -9 %p; jmap -heap %p"*
#   Executing /bin/sh -c "kill 30434
kill -9 30434"...

>From above logs it looks like spark executor by default have
'-XX:OnOutOfMemoryError=kill %p' and then it incorrectly append my custom
arguments.


Following is linux process info for one particular executor which confirms
above.

mycorp   29113 29109 99 08:56 ?04:13:46
/usr/java/jdk1.7.0_51/bin/java -Dorg.jboss.netty.epollBugWorkaround=true
-server *-XX:OnOutOfMemoryError=kill %p *-Xms23000m -Xmx23000m
-XX:+UseCompressedOops -XX:NewRatio=2 -XX:ConcGCThreads=2
-XX:ParallelGCThreads=2 -XX:-HeapDumpOnOutOfMemoryError
*-XX:OnOutOfMemoryError=kill
-9 %p;**jmap -heap %p > **/home/mycorp/npatel/jmap_%p*
-XX:HeapDumpPath=/opt/cores/spark
-XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails
-Xloggc:/home/mycorp/npatel/gclogs/gc%p.log -XX:+PrintGCTimeStamps
-Djava.io.tmpdir=/tmp/hadoop-mycorp/nm-local-dir/usercache/mycorp/appcache/application_1461196034441_24756/container_1461196034441_24756_01_12/tmp
-Dspark.driver.port=43095 -Dspark.akka.threads=32
-Dspark.yarn.app.container.log.dir=/opt/mapr/hadoop/hadoop-2.5.1/logs/userlogs/application_1461196034441_24756/container_1461196034441_24756_01_12
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@10.250.70.116:43095/user/CoarseGrainedScheduler
--executor-id 11 --hostname hdn1.mycorpcorporation.local --cores 6 --app-id
application_1461196034441_24756 --user-class-path
file:/tmp/hadoop-mycorp/nm-local-dir/usercache/mycorp/appcache/application_1461196034441_24756/container_1461196034441_24756_01_12/__app__.jar


Also tried taking dump of running executor using jmap -dump. but it fails
with exception in middle of it. It still generate some dump if I used -F
option. However that file seem corrupted and not getting load into eclipse
MAT or VisualVM.


So what is the correct way to set this executor opts and ultimately take
executor memory dump?

More specifically:

1) To take heap dump on particular location with application id and process
id in file name
2) Put GC logs in particular location with application id and process id in
file name. currently it does but with literal %p in a file name

Thanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Is this possible to do in spark ?

2016-05-11 Thread Pradeep Nayak
Hi -

I have a very unique problem which I am trying to solve and I am not sure
if spark would help here.

I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt.

a.txt contains a unique serial number, say:
12345

and b.txt contains key value pairs.
a,1
b,1,
c,0 etc.

Everyday you receive data for a system Y. so there are multiple a.txt and
b.txt for a serial number.  The serial number doesn't change and that the
key. So there are multiple systems and the data of a whole year is
available and its huge.

I am trying to generate a report of unique serial numbers where the value
of the option a has changed to 1 over the last few months. Lets say the
default is 0. Also figure how many times it was toggled.


I am not sure how to read two text files in spark at the same time and
associated them with the serial number. Is there a way of doing this in
place given that we know the directory structure ? OR we should be
transforming the data anyway to solve this ?


Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Giri P
Yes..They are  reachable. Application jar which I send as argument is at
same location as third party jar. Application jar is getting uploaded.

On Wed, May 11, 2016 at 10:51 AM, lalit sharma 
wrote:

> Point to note as per docs as well :
>
> *Note that jars or python files that are passed to spark-submit should be
> URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically
> upload local jars.**http://spark.apache.org/docs/latest/running-on-mesos.html
>  *
>
> On Wed, May 11, 2016 at 10:05 PM, Giri P  wrote:
>
>> I'm not using docker
>>
>> On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey <
>> raghavendra.pan...@gmail.com> wrote:
>>
>>> By any chance, are you using docker to execute?
>>> On 11 May 2016 21:16, "Raghavendra Pandey" 
>>> wrote:
>>>
 On 11 May 2016 02:13, "gpatcham"  wrote:

 >

 > Hi All,
 >
 > I'm using --jars option in spark-submit to send 3rd party jars . But
 I don't
 > see they are actually passed to mesos slaves. Getting Noclass found
 > exceptions.
 >
 > This is how I'm using --jars option
 >
 > --jars hdfs://namenode:8082/user/path/to/jar
 >
 > Am I missing something here or what's the correct  way to do ?
 >
 > Thanks
 >
 >
 >
 > --
 > View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
 > Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >

>>>
>>
>


Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Michael Armbrust
>
>
> logical plan after optimizer execution:
>
> Project [id#0L,id#1L]
> !+- Filter (id#0L = cast(1 as bigint))
> !   +- Join Inner, Some((id#0L = id#1L))
> !  :- Subquery t
> !  :  +- Relation[id#0L] JSONRelation
> !  +- Subquery u
> !  +- Relation[id#1L] JSONRelation
>

I think you are mistaken.  If this was the optimized plan there would be no
subqueries.


Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
Some how missed that ;)
Anything about Datasets slowness ?

On Wed, May 11, 2016, 21:02 Ted Yu  wrote:

> Which release are you using ?
>
> You can use the following to disable UI:
> --conf spark.ui.enabled=false
>
> On Wed, May 11, 2016 at 10:59 AM, Amit Sela  wrote:
>
>> I've ran a simple WordCount example with a very small List as
>> input lines and ran it in standalone (local[*]), and Datasets is very slow..
>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>> Is this just start-up overhead ? please note that I'm not timing the
>> context creation...
>>
>> And in general, is there a way to run with local[*] "lightweight" mode
>> for testing ? something like without the WebUI server for example (and
>> anything else that's not needed for testing purposes)
>>
>> Thanks,
>> Amit
>>
>
>


Re: apache spark on gitter?

2016-05-11 Thread Sean Owen
I don't know of a gitter channel and I don't use it myself, FWIW. I
think anyone's welcome to start one.

I hesitate to recommend this, simply because it's preferable to have
one place for discussion rather than split it over several, and, we
have to keep the @spark.apache.org mailing lists as the "forums of
records" for project discussions.

If something like gitter doesn't attract any chat, then it doesn't add
any value. If it does though, then suddenly someone needs to subscribe
to user@ and gitter to follow all of the conversations.

I think there is a bit of a scalability problem on the user@ list at
the moment, just because it covers all of Spark. But adding a
different all-Spark channel doesn't help that.

Anyway maybe that's "why"


On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc  wrote:
> no answer, but maybe one more time, a gitter channel for spark users would
> be a good idea!
>
> On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc  wrote:
>>
>> Hi,
>>
>> I was wondering - why Spark does not have a gitter channel?
>>
>> --
>> Regards,
>> Paul Szulc
>>
>> twitter: @rabbitonweb
>> blog: www.rabbitonweb.com
>
>
>
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Ted Yu
Which release are you using ?

You can use the following to disable UI:
--conf spark.ui.enabled=false

On Wed, May 11, 2016 at 10:59 AM, Amit Sela  wrote:

> I've ran a simple WordCount example with a very small List as
> input lines and ran it in standalone (local[*]), and Datasets is very slow..
> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
> Is this just start-up overhead ? please note that I'm not timing the
> context creation...
>
> And in general, is there a way to run with local[*] "lightweight" mode for
> testing ? something like without the WebUI server for example (and anything
> else that's not needed for testing purposes)
>
> Thanks,
> Amit
>


Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-11 Thread Amit Sela
I've ran a simple WordCount example with a very small List as input
lines and ran it in standalone (local[*]), and Datasets is very slow..
We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
Is this just start-up overhead ? please note that I'm not timing the
context creation...

And in general, is there a way to run with local[*] "lightweight" mode for
testing ? something like without the WebUI server for example (and anything
else that's not needed for testing purposes)

Thanks,
Amit


Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread lalit sharma
Point to note as per docs as well :

*Note that jars or python files that are passed to spark-submit should be
URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically
upload local jars.**http://spark.apache.org/docs/latest/running-on-mesos.html
 *

On Wed, May 11, 2016 at 10:05 PM, Giri P  wrote:

> I'm not using docker
>
> On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> By any chance, are you using docker to execute?
>> On 11 May 2016 21:16, "Raghavendra Pandey" 
>> wrote:
>>
>>> On 11 May 2016 02:13, "gpatcham"  wrote:
>>>
>>> >
>>>
>>> > Hi All,
>>> >
>>> > I'm using --jars option in spark-submit to send 3rd party jars . But I
>>> don't
>>> > see they are actually passed to mesos slaves. Getting Noclass found
>>> > exceptions.
>>> >
>>> > This is how I'm using --jars option
>>> >
>>> > --jars hdfs://namenode:8082/user/path/to/jar
>>> >
>>> > Am I missing something here or what's the correct  way to do ?
>>> >
>>> > Thanks
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>


Re: apache spark on gitter?

2016-05-11 Thread Paweł Szulc
no answer, but maybe one more time, a gitter channel for spark users would
be a good idea!

On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc  wrote:

> Hi,
>
> I was wondering - why Spark does not have a gitter channel?
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com
>



-- 
Regards,
Paul Szulc

twitter: @rabbitonweb
blog: www.rabbitonweb.com


Re: Save DataFrame to HBase

2016-05-11 Thread Ted Yu
Please note:
The name of hbase table is specified in:

  def writeCatalog = s"""{
|"table":{"namespace":"default", "name":"table1"},

not by the:

HBaseTableCatalog.newTable -> "5"

FYI

On Tue, May 10, 2016 at 3:11 PM, Ted Yu  wrote:

> I think so.
>
> Please refer to the table population tests in (master branch):
>
> hbase-spark/src/test/scala/org/apache/hadoop/hbase/spark/DefaultSourceSuite.scala
>
> Cheers
>
> On Tue, May 10, 2016 at 2:53 PM, Benjamin Kim  wrote:
>
>> Ted,
>>
>> Will the hbase-spark module allow for creating tables in Spark SQL that
>> reference the hbase tables underneath? In this way, users can query using
>> just SQL.
>>
>> Thanks,
>> Ben
>>
>> On Apr 28, 2016, at 3:09 AM, Ted Yu  wrote:
>>
>> Hbase 2.0 release likely would come after Spark 2.0 release.
>>
>> There're other features being developed in hbase 2.0
>> I am not sure when hbase 2.0 would be released.
>>
>> The refguide is incomplete.
>> Zhan has assigned the doc JIRA to himself. The documentation would be
>> done after fixing bugs in hbase-spark module.
>>
>> Cheers
>>
>> On Apr 27, 2016, at 10:31 PM, Benjamin Kim  wrote:
>>
>> Hi Ted,
>>
>> Do you know when the release will be? I also see some documentation for
>> usage of the hbase-spark module at the hbase website. But, I don’t see an
>> example on how to save data. There is only one for reading/querying data.
>> Will this be added when the final version does get released?
>>
>> Thanks,
>> Ben
>>
>> On Apr 21, 2016, at 6:56 AM, Ted Yu  wrote:
>>
>> The hbase-spark module in Apache HBase (coming with hbase 2.0 release)
>> can do this.
>>
>> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim  wrote:
>>
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>>
>


Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton
This may be related to: https://issues.apache.org/jira/browse/SPARK-13773

Regards,

James

On 11 May 2016 at 15:49, Ted Yu  wrote:

> In master branch, behavior is the same.
>
> Suggest opening a JIRA if you haven't done so.
>
> On Wed, May 11, 2016 at 6:55 AM, Tony Jin  wrote:
>
>> Hi guys,
>>
>> I have a problem about spark DataFrame. My spark version is 1.6.1.
>> Basically, i used udf and df.withColumn to create a "new" column, and
>> then i filter the values on this new columns and call show(action). I see
>> the udf function (which is used to by withColumn to create the new column)
>> is called twice(duplicated). And if filter on "old" column, udf only run
>> once which is expected. I attached the example codes, line 30~38 shows the
>> problem.
>>
>>  Anyone knows the internal reason? Can you give me any advices? Thank you
>> very much.
>>
>>
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> 10
>> 11
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> 20
>> 21
>> 22
>> 23
>> 24
>> 25
>> 26
>> 27
>> 28
>> 29
>> 30
>> 31
>> 32
>> 33
>> 34
>> 35
>> 36
>> 37
>> 38
>> 39
>> 40
>> 41
>> 42
>> 43
>> 44
>> 45
>> 46
>> 47
>>
>> scala> import org.apache.spark.sql.functions._
>> import org.apache.spark.sql.functions._
>>
>> scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", 
>> "b1"))).toDF("old","old1")
>> df: org.apache.spark.sql.DataFrame = [old: string, old1: string]
>>
>> scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
>> udfFunc: org.apache.spark.sql.UserDefinedFunction = 
>> UserDefinedFunction(,StringType,List(StringType))
>>
>> scala> val newDF = df.withColumn("new", udfFunc(df("old")))
>> newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: 
>> string]
>>
>> scala> newDF.show
>> running udf(a)
>> running udf(a1)
>> +---++---+
>> |old|old1|new|
>> +---++---+
>> |  a|   b|  a|
>> | a1|  b1| a1|
>> +---++---+
>>
>>
>> scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
>> filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
>> string, new: string]
>>
>> scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'")
>> filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
>> string, new: string]
>>
>> scala> filteredOnNewColumnDF.show
>> running udf(a)
>> running udf(a)
>> running udf(a1)
>> +---++---+
>> |old|old1|new|
>> +---++---+
>> |  a|   b|  a|
>> +---++---+
>>
>>
>> scala> filteredOnOldColumnDF.show
>> running udf(a)
>> +---++---+
>> |old|old1|new|
>> +---++---+
>> |  a|   b|  a|
>> +---++---+
>>
>>
>>
>> Best wishes.
>> By Linbo
>>
>>
>


Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Giri P
I'm not using docker

On Wed, May 11, 2016 at 8:47 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> By any chance, are you using docker to execute?
> On 11 May 2016 21:16, "Raghavendra Pandey" 
> wrote:
>
>> On 11 May 2016 02:13, "gpatcham"  wrote:
>>
>> >
>>
>> > Hi All,
>> >
>> > I'm using --jars option in spark-submit to send 3rd party jars . But I
>> don't
>> > see they are actually passed to mesos slaves. Getting Noclass found
>> > exceptions.
>> >
>> > This is how I'm using --jars option
>> >
>> > --jars hdfs://namenode:8082/user/path/to/jar
>> >
>> > Am I missing something here or what's the correct  way to do ?
>> >
>> > Thanks
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>


Re: Setting Spark Worker Memory

2016-05-11 Thread Mich Talebzadeh
are you running this in standalone  mode? that is one physical host, and
the executor will live inside the driver.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 16:45, شجاع الرحمن بیگ  wrote:

> yes,
>
> On Wed, May 11, 2016 at 5:43 PM, Deepak Sharma 
> wrote:
>
>> Since you are registering workers from the same node , do you have enough
>> cores and RAM(In this case >=9 cores and > = 24 GB ) on this
>> node(11.14.224.24)?
>>
>> Thanks
>> Deepak
>>
>> On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ 
>> wrote:
>>
>>> Hi All,
>>>
>>> I need to set same memory and core for each worker on same machine and
>>> for this purpose, I have set the following properties in conf/spark-env.sh
>>>
>>> export SPARK_EXECUTOR_INSTANCE=3
>>> export SPARK_WORKER_CORES=3
>>> export SPARK_WORKER_MEMORY=8g
>>>
>>> but only one worker is getting desired memory and cores and other two
>>> are not. here is the log of master.
>>>
>>> ...
>>> 6/05/11 17:04:40 INFO Master: I have been elected leader! New state:
>>> ALIVE
>>> 16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923
>>> with 3 cores, 8.0 GB RAM
>>> 16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072
>>> with 2 cores, 1020.7 GB RAM
>>> 16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702
>>> with 2 cores, 1020.7 GB RAM
>>> ...
>>>
>>>
>>> Could you please let me know the solution
>>>
>>> Thanks
>>> Shuja
>>>
>>> --
>>> Regards
>>> Shuja-ur-Rehman Baig
>>> 
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> 
>


Spark on DSE Cassandra with multiple data centers

2016-05-11 Thread Simone Franzini
I am running Spark on DSE Cassandra with multiple analytics data centers.
It is my understanding that with this setup you should have a CFS file
system for each data center. I was able to create an additional CFS file
system as described here:
http://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/anaCFS.html
I verified that the additional CFS file system is created properly.

I am now following the instructions here to configure Spark on the second
data center to use its own CFS:
http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkConfHistoryServer.html
However, running:
dse hadoop fs -mkdir :/spark/events
fails with:
WARN You are going to access CFS keyspace: cfs in data center:
. It will not work because the replication
factor for this keyspace in this data center is 0.

Bad connection to FS. command aborted. exception: UnavailableException()

That is, it appears that the  in the hadoop command is
being ignored and it is trying to connect to cfs: rather than
additional_cfs.

Anybody else ran into this?


Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini


Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
By any chance, are you using docker to execute?
On 11 May 2016 21:16, "Raghavendra Pandey" 
wrote:

> On 11 May 2016 02:13, "gpatcham"  wrote:
>
> >
>
> > Hi All,
> >
> > I'm using --jars option in spark-submit to send 3rd party jars . But I
> don't
> > see they are actually passed to mesos slaves. Getting Noclass found
> > exceptions.
> >
> > This is how I'm using --jars option
> >
> > --jars hdfs://namenode:8082/user/path/to/jar
> >
> > Am I missing something here or what's the correct  way to do ?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
yes,

On Wed, May 11, 2016 at 5:43 PM, Deepak Sharma 
wrote:

> Since you are registering workers from the same node , do you have enough
> cores and RAM(In this case >=9 cores and > = 24 GB ) on this
> node(11.14.224.24)?
>
> Thanks
> Deepak
>
> On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ 
> wrote:
>
>> Hi All,
>>
>> I need to set same memory and core for each worker on same machine and
>> for this purpose, I have set the following properties in conf/spark-env.sh
>>
>> export SPARK_EXECUTOR_INSTANCE=3
>> export SPARK_WORKER_CORES=3
>> export SPARK_WORKER_MEMORY=8g
>>
>> but only one worker is getting desired memory and cores and other two are
>> not. here is the log of master.
>>
>> ...
>> 6/05/11 17:04:40 INFO Master: I have been elected leader! New state: ALIVE
>> 16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923
>> with 3 cores, 8.0 GB RAM
>> 16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072
>> with 2 cores, 1020.7 GB RAM
>> 16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702
>> with 2 cores, 1020.7 GB RAM
>> ...
>>
>>
>> Could you please let me know the solution
>>
>> Thanks
>> Shuja
>>
>> --
>> Regards
>> Shuja-ur-Rehman Baig
>> 
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Regards
Shuja-ur-Rehman Baig



Re: Not able pass 3rd party jars to mesos executors

2016-05-11 Thread Raghavendra Pandey
On 11 May 2016 02:13, "gpatcham"  wrote:

>

> Hi All,
>
> I'm using --jars option in spark-submit to send 3rd party jars . But I
don't
> see they are actually passed to mesos slaves. Getting Noclass found
> exceptions.
>
> This is how I'm using --jars option
>
> --jars hdfs://namenode:8082/user/path/to/jar
>
> Am I missing something here or what's the correct  way to do ?
>
> Thanks
>
>
>
> --
> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Not-able-pass-3rd-party-jars-to-mesos-executors-tp26918.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Setting Spark Worker Memory

2016-05-11 Thread Deepak Sharma
Since you are registering workers from the same node , do you have enough
cores and RAM(In this case >=9 cores and > = 24 GB ) on this
node(11.14.224.24)?

Thanks
Deepak

On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ 
wrote:

> Hi All,
>
> I need to set same memory and core for each worker on same machine and for
> this purpose, I have set the following properties in conf/spark-env.sh
>
> export SPARK_EXECUTOR_INSTANCE=3
> export SPARK_WORKER_CORES=3
> export SPARK_WORKER_MEMORY=8g
>
> but only one worker is getting desired memory and cores and other two are
> not. here is the log of master.
>
> ...
> 6/05/11 17:04:40 INFO Master: I have been elected leader! New state: ALIVE
> 16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923 with
> 3 cores, 8.0 GB RAM
> 16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072 with
> 2 cores, 1020.7 GB RAM
> 16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702 with
> 2 cores, 1020.7 GB RAM
> ...
>
>
> Could you please let me know the solution
>
> Thanks
> Shuja
>
> --
> Regards
> Shuja-ur-Rehman Baig
> 
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Setting Spark Worker Memory

2016-05-11 Thread شجاع الرحمن بیگ
Hi All,

I need to set same memory and core for each worker on same machine and for
this purpose, I have set the following properties in conf/spark-env.sh

export SPARK_EXECUTOR_INSTANCE=3
export SPARK_WORKER_CORES=3
export SPARK_WORKER_MEMORY=8g

but only one worker is getting desired memory and cores and other two are
not. here is the log of master.

...
6/05/11 17:04:40 INFO Master: I have been elected leader! New state: ALIVE
16/05/11 17:04:43 INFO Master: Registering worker 11.14.224.24:53923 with 3
cores, 8.0 GB RAM
16/05/11 17:04:49 INFO Master: Registering worker 11.14.224.24:55072 with 2
cores, 1020.7 GB RAM
16/05/11 17:05:07 INFO Master: Registering worker 11.14.224.24:49702 with 2
cores, 1020.7 GB RAM
...


Could you please let me know the solution

Thanks
Shuja

-- 
Regards
Shuja-ur-Rehman Baig



Re: Spark 1.6.0: substring on df.select

2016-05-11 Thread Raghavendra Pandey
You can create a column with count of /.  Then take max of it and create
that many columns for every row with null fillers.

Raghav
On 11 May 2016 20:37, "Bharathi Raja"  wrote:

Hi,



I have a dataframe column col1 with values something like
“/client/service/version/method”. The number of “/” are not constant.

Could you please help me to extract all methods from the column col1?



In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).



Thanks in advance.

Regards,

Raja


Spark 1.6.0: substring on df.select

2016-05-11 Thread Bharathi Raja
Hi,

I have a dataframe column col1 with values something like 
“/client/service/version/method”. The number of “/” are not constant. 
Could you please help me to extract all methods from the column col1?

In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).

Thanks in advance.
Regards,
Raja


Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Ted Yu
In master branch, behavior is the same.

Suggest opening a JIRA if you haven't done so.

On Wed, May 11, 2016 at 6:55 AM, Tony Jin  wrote:

> Hi guys,
>
> I have a problem about spark DataFrame. My spark version is 1.6.1.
> Basically, i used udf and df.withColumn to create a "new" column, and then
> i filter the values on this new columns and call show(action). I see the
> udf function (which is used to by withColumn to create the new column) is
> called twice(duplicated). And if filter on "old" column, udf only run once
> which is expected. I attached the example codes, line 30~38 shows the
> problem.
>
>  Anyone knows the internal reason? Can you give me any advices? Thank you
> very much.
>
>
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> 20
> 21
> 22
> 23
> 24
> 25
> 26
> 27
> 28
> 29
> 30
> 31
> 32
> 33
> 34
> 35
> 36
> 37
> 38
> 39
> 40
> 41
> 42
> 43
> 44
> 45
> 46
> 47
>
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
>
> scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", 
> "b1"))).toDF("old","old1")
> df: org.apache.spark.sql.DataFrame = [old: string, old1: string]
>
> scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
> udfFunc: org.apache.spark.sql.UserDefinedFunction = 
> UserDefinedFunction(,StringType,List(StringType))
>
> scala> val newDF = df.withColumn("new", udfFunc(df("old")))
> newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: 
> string]
>
> scala> newDF.show
> running udf(a)
> running udf(a1)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> | a1|  b1| a1|
> +---++---+
>
>
> scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
> filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'")
> filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string, old1: 
> string, new: string]
>
> scala> filteredOnNewColumnDF.show
> running udf(a)
> running udf(a)
> running udf(a1)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> +---++---+
>
>
> scala> filteredOnOldColumnDF.show
> running udf(a)
> +---++---+
> |old|old1|new|
> +---++---+
> |  a|   b|  a|
> +---++---+
>
>
>
> Best wishes.
> By Linbo
>
>


dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread Tony Jin
Hi guys,

I have a problem about spark DataFrame. My spark version is 1.6.1.
Basically, i used udf and df.withColumn to create a "new" column, and then
i filter the values on this new columns and call show(action). I see the
udf function (which is used to by withColumn to create the new column) is
called twice(duplicated). And if filter on "old" column, udf only run once
which is expected. I attached the example codes, line 30~38 shows the
problem.

 Anyone knows the internal reason? Can you give me any advices? Thank you
very much.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = sc.parallelize(Seq(("a", "b"), ("a1", "b1"))).toDF("old","old1")
df: org.apache.spark.sql.DataFrame = [old: string, old1: string]

scala> val udfFunc = udf((s: String) => {println(s"running udf($s)"); s })
udfFunc: org.apache.spark.sql.UserDefinedFunction =
UserDefinedFunction(,StringType,List(StringType))

scala> val newDF = df.withColumn("new", udfFunc(df("old")))
newDF: org.apache.spark.sql.DataFrame = [old: string, old1: string, new: string]

scala> newDF.show
running udf(a)
running udf(a1)
+---++---+
|old|old1|new|
+---++---+
|  a|   b|  a|
| a1|  b1| a1|
+---++---+


scala> val filteredOnNewColumnDF = newDF.filter("new <> 'a1'")
filteredOnNewColumnDF: org.apache.spark.sql.DataFrame = [old: string,
old1: string, new: string]

scala> val filteredOnOldColumnDF = newDF.filter("old <> 'a1'")
filteredOnOldColumnDF: org.apache.spark.sql.DataFrame = [old: string,
old1: string, new: string]

scala> filteredOnNewColumnDF.show
running udf(a)
running udf(a)
running udf(a1)
+---++---+
|old|old1|new|
+---++---+
|  a|   b|  a|
+---++---+


scala> filteredOnOldColumnDF.show
running udf(a)
+---++---+
|old|old1|new|
+---++---+
|  a|   b|  a|
+---++---+



Best wishes.
By Linbo


Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
Will try with JSON relation, but with Spark's temp tables (Spark version
1.6 ) I get an optimized plan as you have mentioned. Should not be much
different though.

Query : "select t1.col2, t1.col3 from t1, t2 where t1.col1=t2.col1 and
t1.col3=7"

Plan :

Project [COL2#1,COL3#2]
+- Join Inner, Some((COL1#0 = COL1#3))
   :- Filter (COL3#2 = 7)
   :  +- LogicalRDD [col1#0,col2#1,col3#2], MapPartitionsRDD[4] at apply at
Transformer.scala:22
   +- Project [COL1#3]
  +- LogicalRDD [col1#3,col2#4,col3#5], MapPartitionsRDD[5] at apply at
Transformer.scala:22

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Wed, May 11, 2016 at 4:56 PM, Telmo Rodrigues <
telmo.galante.rodrig...@gmail.com> wrote:

> In this case, isn't better to perform the filter earlier as possible even
> there could be unhandled predicates?
>
> Telmo Rodrigues
>
> No dia 11/05/2016, às 09:49, Rishi Mishra 
> escreveu:
>
> It does push the predicate. But as a relations are generic and might or
> might not handle some of the predicates , it needs to apply filter of
> un-handled predicates.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Wed, May 11, 2016 at 6:27 AM, Telmo Rodrigues <
> telmo.galante.rodrig...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a question about the Catalyst optimizer in Spark 1.6.
>>
>> initial logical plan:
>>
>> !'Project [unresolvedalias(*)]
>> !+- 'Filter ('t.id = 1)
>> !   +- 'Join Inner, Some(('t.id = 'u.id))
>> !  :- 'UnresolvedRelation `t`, None
>> !  +- 'UnresolvedRelation `u`, None
>>
>>
>> logical plan after optimizer execution:
>>
>> Project [id#0L,id#1L]
>> !+- Filter (id#0L = cast(1 as bigint))
>> !   +- Join Inner, Some((id#0L = id#1L))
>> !  :- Subquery t
>> !  :  +- Relation[id#0L] JSONRelation
>> !  +- Subquery u
>> !  +- Relation[id#1L] JSONRelation
>>
>>
>> Shouldn't the optimizer push down predicates to subquery t in order to
>> the filter be executed before join?
>>
>> Thanks
>>
>>
>>
>


Re: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Ted Yu
Looks like the exception was thrown from this line:

  ByteBuffer.wrap(taskBinary.value),
Thread.currentThread.getContextClassLoader)

Comment for taskBinary says:

 * @param taskBinary broadcasted version of the serialized RDD and the
function to apply on each
 *   partition of the given RDD. Once deserialized, the
type should be
 *   (RDD[T], (TaskContext, Iterator[T]) => U).

Can you write a simple test which reproduces this problem ?

Thanks

On Wed, May 11, 2016 at 3:40 AM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> I'm running a very simple job (textFile->map->groupby->count) and hitting
> this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception
> when running on yarn-client and not in local mode:
>
> 16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
> 1, ip-172-31-33-97.ec2.internal, partition 0,NODE_LOCAL, 15116 bytes)
> 16/05/11 10:29:26 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1,
> ip-172-31-33-97.ec2.internal): java.lang.ClassCastException:
> org.apache.spark.util.SerializableConfiguration cannot be cast to [B
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> If found a jira that relates to streaming and accumulators but I'm using
> neither.
>
> Any ideas ?
> Should I file a jira?
>
> Thank you,
> Daniel
>


Error: "Answer from Java side is empty"

2016-05-11 Thread AlexModestov
I use Sparkling Water 1.6.3, Spark 1.6.I use Java Oracle 8 or
OpenJDK-7:(every time I get this error when I transform Spark DataFrame into
H2O DataFrame. Spark cluster dies..):ERROR:py4j.java_gateway:Error while
sending or receiving.Traceback (most recent call last):  File
".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746,
in send_commandraise Py4JError("Answer from Java side is
empty")Py4JError: Answer from Java side is emptyERROR:py4j.java_gateway:An
error occurred while trying to connect to the Java serverTraceback (most
recent call last):  File
".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690,
in startself.socket.connect((self.address, self.port))  File
"/usr/local/anaconda/lib/python2.7/socket.py", line 228, in methreturn
getattr(self._sock,name)(*args)error: [Errno 111] Connection
refusedERROR:py4j.java_gateway:An error occurred while trying to connect to
the Java serverTraceback (most recent call last):My
conf-file:spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1500mbspark.driver.memory
65gspark.driver.extraJavaOptions -XX:-PrintGCDetails -XX:PermSize=35480m
-XX:-PrintGCTimeStamps -XX:-PrintTenuringDistribution 
spark.python.worker.memory 65gspark.local.dir
/data/spark-tmpspark.ext.h2o.client.log.dir /data/h2ospark.logConf
falsespark.master local[*]spark.driver.maxResultSize 0spark.eventLog.enabled
Truespark.eventLog.dir /data/spark_logIn the code I use "persist" data
(amount of data is 5.7 GB).There is nothing in the h2olog-files.I guess that
there is enough memory.Could anyone help me?Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-Answer-from-Java-side-is-empty-tp26929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Telmo Rodrigues
In this case, isn't better to perform the filter earlier as possible even there 
could be unhandled predicates?

Telmo Rodrigues

No dia 11/05/2016, às 09:49, Rishi Mishra  escreveu:

> It does push the predicate. But as a relations are generic and might or might 
> not handle some of the predicates , it needs to apply filter of un-handled 
> predicates. 
> 
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
> 
> https://in.linkedin.com/in/rishiteshmishra
> 
>> On Wed, May 11, 2016 at 6:27 AM, Telmo Rodrigues 
>>  wrote:
>> Hello,
>> 
>> I have a question about the Catalyst optimizer in Spark 1.6.
>> 
>> initial logical plan:
>> 
>> !'Project [unresolvedalias(*)]
>> !+- 'Filter ('t.id = 1)
>> !   +- 'Join Inner, Some(('t.id = 'u.id))
>> !  :- 'UnresolvedRelation `t`, None
>> !  +- 'UnresolvedRelation `u`, None
>> 
>> 
>> logical plan after optimizer execution:
>> 
>> Project [id#0L,id#1L]
>> !+- Filter (id#0L = cast(1 as bigint))
>> !   +- Join Inner, Some((id#0L = id#1L))
>> !  :- Subquery t
>> !  :  +- Relation[id#0L] JSONRelation
>> !  +- Subquery u
>> !  +- Relation[id#1L] JSONRelation
>> 
>> 
>> Shouldn't the optimizer push down predicates to subquery t in order to the 
>> filter be executed before join?
>> 
>> Thanks
> 


java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Daniel Haviv
Hi,
I'm running a very simple job (textFile->map->groupby->count) and hitting
this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception
when running on yarn-client and not in local mode:

16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, ip-172-31-33-97.ec2.internal, partition 0,NODE_LOCAL, 15116 bytes)
16/05/11 10:29:26 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1,
ip-172-31-33-97.ec2.internal): java.lang.ClassCastException:
org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


If found a jira that relates to streaming and accumulators but I'm using
neither.

Any ideas ?
Should I file a jira?

Thank you,
Daniel


Re: Will the HiveContext cause memory leak ?

2016-05-11 Thread kramer2...@126.com
After 8 hours. The usage of memory become stable. Use the Top command will
find it will be 75%. So means 12GB memory.


But it still do not make sense. Because my workload is very small.


I use this spark to calculate on one csv file every 20 seconds. The size of
the csv file is 1.3M.


So spark is using almost 10 000 times of memory than my workload. Does that
mean I need prepare 1TB RAM if the workload is 100M?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Ankit Singhal
You can use Joins as a substitute to subqueries.

On Wed, May 11, 2016 at 1:27 PM, Divya Gehlot 
wrote:

> Hi,
> I am using Spark 1.5.2  with Apache Phoenix 4.4
> As Spark 1.5.2 doesn't support subquery in where conditions .
> https://issues.apache.org/jira/browse/SPARK-4226
>
> Is there any alternative way to find foreign key constraints.
> Would really appreciate the help.
>
>
>
> Thanks,
> Divya
>


Re: Spark 1.6 Catalyst optimizer

2016-05-11 Thread Rishi Mishra
It does push the predicate. But as a relations are generic and might or
might not handle some of the predicates , it needs to apply filter of
un-handled predicates.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Wed, May 11, 2016 at 6:27 AM, Telmo Rodrigues <
telmo.galante.rodrig...@gmail.com> wrote:

> Hello,
>
> I have a question about the Catalyst optimizer in Spark 1.6.
>
> initial logical plan:
>
> !'Project [unresolvedalias(*)]
> !+- 'Filter ('t.id = 1)
> !   +- 'Join Inner, Some(('t.id = 'u.id))
> !  :- 'UnresolvedRelation `t`, None
> !  +- 'UnresolvedRelation `u`, None
>
>
> logical plan after optimizer execution:
>
> Project [id#0L,id#1L]
> !+- Filter (id#0L = cast(1 as bigint))
> !   +- Join Inner, Some((id#0L = id#1L))
> !  :- Subquery t
> !  :  +- Relation[id#0L] JSONRelation
> !  +- Subquery u
> !  +- Relation[id#1L] JSONRelation
>
>
> Shouldn't the optimizer push down predicates to subquery t in order to the
> filter be executed before join?
>
> Thanks
>
>
>


Re: [Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Alonso Isidoro Roman
I think that Impala and Hive have this feature. Impala is faster than hive,
hive is probably better to use in batch mode.

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-11 9:57 GMT+02:00 Divya Gehlot :

> Hi,
> I am using Spark 1.5.2  with Apache Phoenix 4.4
> As Spark 1.5.2 doesn't support subquery in where conditions .
> https://issues.apache.org/jira/browse/SPARK-4226
>
> Is there any alternative way to find foreign key constraints.
> Would really appreciate the help.
>
>
>
> Thanks,
> Divya
>


Re: How to use pyspark streaming module "slice"?

2016-05-11 Thread sethirot
ok, thanks anyway

On Wed, May 11, 2016 at 12:15 AM, joyceye04 [via Apache Spark User List] <
ml-node+s1001560n26919...@n3.nabble.com> wrote:

> Not yet. And I turned to another way to bypass it just to finish my work.
> Still waiting for answers :(
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-pyspark-streaming-module-slice-tp26813p26919.html
> To unsubscribe from How to use pyspark streaming module "slice"?, click
> here
> 
> .
> NAML
> 
>



-- 
Asaf Botovsky, CTO




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-pyspark-streaming-module-slice-tp26813p26926.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Use Collaborative Filtering and Clustering Algorithm in Spark MLIB

2016-05-11 Thread Imre Nagi
Hi All,

I'm newbie in spark mlib. In my office I have a statistician who work on
improving our matrix model for our recommendation engine. However he works
on R. He told me that it's quite possible to combine the collaborative
filtering and latent dirichlet allocation (LDA) by doing some computations.

However, in spark mlib, I found that both LDA and Collaborative Filtering
have its own model that we can use later for our needs. My question is how
to combine both model to achieve the same goal as our statistician did?
What kind of operation that I can do to combine these models?

Thanks.
Imre


[Spark 1.5.2]Check Foreign Key constraint

2016-05-11 Thread Divya Gehlot
Hi,
I am using Spark 1.5.2  with Apache Phoenix 4.4
As Spark 1.5.2 doesn't support subquery in where conditions .
https://issues.apache.org/jira/browse/SPARK-4226

Is there any alternative way to find foreign key constraints.
Would really appreciate the help.



Thanks,
Divya


Re: Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-11 Thread Mich Talebzadeh
Ok you can see that the process 10603 Worker is running as the worker/slave
in your drive manager connection to GUI port webui-port 8081
spark://ES01:7077.  That you can access through web
Also you have process 12420 running as SparkSubmit. that is telling you the
JVM you have submitted for this job

12420 ?Sl18:47 java -cp /opt/spark-1.6.0-bin-hadoop2.
6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-
1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/
datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.
6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
-Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit
--master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2
*--executor-memory
4G --num-executors 1 --total-executor-cores 1* /opt/flowSpark/sparkStream/
ForAsk01.py

So I don't see any issue here really

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 06:25, 李明伟  wrote:

>
>
> [root@ES01 test]# jps
> 10409 Master
> 12578 CoarseGrainedExecutorBackend
> 24089 NameNode
> 17705 Jps
> 24184 DataNode
> 10603 Worker
> 12420 SparkSubmit
>
>
>
> [root@ES01 test]# ps -awx | grep -i spark | grep java
> 10409 ?Sl 1:52 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master
> --ip ES01 --port 7077 --webui-port 8080
> 10603 ?Sl 6:50 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker
> --webui-port 8081 spark://ES01:7077
> 12420 ?Sl18:47 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit
> --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2
> --executor-memory 4G --num-executors 1 --total-executor-cores 1
> /opt/flowSpark/sparkStream/ForAsk01.py
> 12578 ?Sl38:18 java -cp
> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/
> -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://
> CoarseGrainedScheduler@10.79.148.184:52931 --executor-id 0 --hostname
> 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url
> spark://Worker@10.79.148.184:52660
>
>
>
> 在 2016-05-11 13:18:10,"Mich Talebzadeh"  写道:
>
> what does jps returning?
>
> jps
> 16738 ResourceManager
> 14786 Worker
> 17059 JobHistoryServer
> 12421 QuorumPeerMain
> 9061 RunJar
> 9286 RunJar
> 5190 SparkSubmit
> 16806 NodeManager
> 16264 DataNode
> 16138 NameNode
> 16430 SecondaryNameNode
> 22036 SparkSubmit
> 9557 Jps
> 13240 Kafka
> 2522 Master
>
> and
>
> ps -awx | grep -i spark | grep java
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 May 2016 at 03:01, 李明伟  wrote:
>
>> Hi Mich
>>
>> From the ps command. I can find four process. 10409 is the master and
>> 10603 is the worker. 12420 is the driver program and 12578 should be the
>> executor (worker). Am I right?
>> So you mean the 12420 is actually running both the driver and the worker
>> role?
>>
>> [root@ES01 ~]# ps -awx | grep spark | grep java
>> 10409 ?Sl 1:40 java -cp
>> /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api