how to kill application

2018-03-26 Thread Shuxin Yang

Hi,

   I apologize if this question was asked before. I try to find the 
answer, but in vain.


   I'm running PySpark on Google Cloud Platform with Spark 2.2.0 and 
YARN resource manager.


   Let S1 be the set of application-ids collected via 'curl 
'http://127.0.0.1:18080/api/v1/applications?status=running'; and S2 be 
the application ids collected via 'yarn application -list'.


   Sometimes I found S1 != S2, how could this this take place?

   For those in the difference of S2 - S1 (i.e. alive YARN app, dead 
Spark app), I can kill them using command 'yarn application -kill id'.


   How can I kill those application in S1 - S2 (i.e. alive Spark app, 
dead YARN app)? Looking not closing the SparkContext could cause this 
problem. However, I'm not always able to close the context, for example 
my program crash prematurely.


   Tons thanks in advance!

Shuxin Yang



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Felix Cheung
If your data can be split into groups and you can call into your favorite R 
package on each group of data (in parallel):

https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect



From: Nisha Muktewar 
Sent: Monday, March 26, 2018 2:27:52 PM
To: Josh Goldsborough
Cc: user
Subject: Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a specific 
format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough 
> wrote:
The company I work for is trying to do some mixed-effects regression modeling 
in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single 
threaded.  So we're looking for tricks/techniques to process large data sets.


This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough



Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
Thanks for the update.

What about cores per executor?

On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia  wrote:

> Thanks Fawze!
>
> On the memory front, I am currently working on GC and CPU aware task
> scheduling. I see wonderful results based on my tests so far.  Once the
> feature is complete and available, spark will work with whatever memory is
> provided (at least enough for the largest possible task). It will also
> allow you to run say 64 concurrent tasks on 8 core machine, if the nature
> of tasks doesn't leads to memory or CPU contention. Essentially why worry
> about tuning memory when you can let spark take care of it automatically
> based on memory pressure. Will post details when we are ready.  So yes we
> are working on memory, but it will not be a tool but a transparent feature.
>
> thanks,
> rohitk
>
>
>
>
> On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber  wrote:
>
>> Hi Rohit,
>>
>> I would like to thank you for the unlimited patience and support that you
>> are providing here and behind the scene for all of us.
>>
>> The tool is amazing and easy to use and understand most of the metrics ...
>>
>> Thinking if we need to run it in cluster mode and all the time, i think
>> we can skip it as one or few runs can give you the large picture of how the
>> job is running with different configuration and it's not too much
>> complicated to run it using spark-submit.
>>
>> I think it will be so helpful if the sparklens can also include how the
>> job is running with different configuration of cores and memory, Spark job
>> with 1 exec and 1 core will run different from spark job with 1  exec and 3
>> cores and for sure the same compare with different exec memory.
>>
>> Overall, it is so good starting point, but it will be a GAME CHANGER
>> getting these metrics on the tool.
>>
>> @Rohit , Huge THANY YOU
>>
>> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia 
>> wrote:
>>
>>> Hi Shmuel,
>>>
>>> In general it is hard to pin point to exact code which is responsible
>>> for a specific stage. For example when using spark sql, depending upon the
>>> kind of joins, aggregations used in the the single line of query, we will
>>> have multiple stages in the spark application. I usually try to split the
>>> code into smaller chunks and also use the spark UI which has special
>>> section for SQL. It can also show specific backtraces, but as I explained
>>> earlier they might not be very helpful. Sparklens does help you ask the
>>> right questions, but is not mature enough to answer all of them.
>>>
>>> Understanding the report:
>>>
>>> *1) The first part of total aggregate metrics for the application.*
>>>
>>> Printing application meterics.
>>>
>>>  AggregateMetrics (Application Metrics) total measurements 1869
>>> NAMESUMMIN  
>>>  MAXMEAN
>>>  diskBytesSpilled0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  executorRuntime15.1 hh 3.0 ms 
>>> 4.0 mm 29.1 ss
>>>  inputBytesRead 26.1 GB 0.0 KB
>>> 43.8 MB 14.3 MB
>>>  jvmGCTime  11.0 mm 0.0 ms 
>>> 2.1 ss354.0 ms
>>>  memoryBytesSpilled314.2 GB 0.0 KB 
>>> 1.1 GB172.1 MB
>>>  outputBytesWritten  0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  peakExecutionMemory 0.0 KB 0.0 KB 
>>> 0.0 KB  0.0 KB
>>>  resultSize 12.9 MB 2.0 KB
>>> 40.9 KB  7.1 KB
>>>  shuffleReadBytesRead  107.7 GB 0.0 KB   
>>> 276.0 MB 59.0 MB
>>>  shuffleReadFetchWaitTime2.0 ms 0.0 ms 
>>> 0.0 ms  0.0 ms
>>>  shuffleReadLocalBlocks   2,318  0  
>>>68   1
>>>  shuffleReadRecordsRead   3,413,511,099  0  
>>> 8,251,926   1,826,383
>>>  shuffleReadRemoteBlocks291,126  0  
>>>   824 155
>>>  shuffleWriteBytesWritten  107.6 GB 0.0 KB   
>>> 257.6 MB 58.9 MB
>>>  shuffleWriteRecordsWritten   3,408,133,175  0  
>>> 7,959,055   1,823,506
>>>  shuffleWriteTime8.7 mm 0.0 ms 
>>> 1.8 ss278.2 ms
>>>  taskDuration   15.4 hh12.0 ms 
>>> 4.1 mm 29.7 ss
>>>
>>>
>>> *2) Here we show number of hosts used and executors per host. I have seen 
>>> users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB 
>>> of memory.*
>>>

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Rohit Karlupia
Thanks Fawze!

On the memory front, I am currently working on GC and CPU aware task
scheduling. I see wonderful results based on my tests so far.  Once the
feature is complete and available, spark will work with whatever memory is
provided (at least enough for the largest possible task). It will also
allow you to run say 64 concurrent tasks on 8 core machine, if the nature
of tasks doesn't leads to memory or CPU contention. Essentially why worry
about tuning memory when you can let spark take care of it automatically
based on memory pressure. Will post details when we are ready.  So yes we
are working on memory, but it will not be a tool but a transparent feature.

thanks,
rohitk




On Tue, Mar 27, 2018 at 7:53 AM, Fawze Abujaber  wrote:

> Hi Rohit,
>
> I would like to thank you for the unlimited patience and support that you
> are providing here and behind the scene for all of us.
>
> The tool is amazing and easy to use and understand most of the metrics ...
>
> Thinking if we need to run it in cluster mode and all the time, i think we
> can skip it as one or few runs can give you the large picture of how the
> job is running with different configuration and it's not too much
> complicated to run it using spark-submit.
>
> I think it will be so helpful if the sparklens can also include how the
> job is running with different configuration of cores and memory, Spark job
> with 1 exec and 1 core will run different from spark job with 1  exec and 3
> cores and for sure the same compare with different exec memory.
>
> Overall, it is so good starting point, but it will be a GAME CHANGER
> getting these metrics on the tool.
>
> @Rohit , Huge THANY YOU
>
> On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia  wrote:
>
>> Hi Shmuel,
>>
>> In general it is hard to pin point to exact code which is responsible for
>> a specific stage. For example when using spark sql, depending upon the kind
>> of joins, aggregations used in the the single line of query, we will have
>> multiple stages in the spark application. I usually try to split the code
>> into smaller chunks and also use the spark UI which has special section for
>> SQL. It can also show specific backtraces, but as I explained earlier they
>> might not be very helpful. Sparklens does help you ask the right questions,
>> but is not mature enough to answer all of them.
>>
>> Understanding the report:
>>
>> *1) The first part of total aggregate metrics for the application.*
>>
>> Printing application meterics.
>>
>>  AggregateMetrics (Application Metrics) total measurements 1869
>> NAMESUMMIN   
>> MAXMEAN
>>  diskBytesSpilled0.0 KB 0.0 KB 
>> 0.0 KB  0.0 KB
>>  executorRuntime15.1 hh 3.0 ms 
>> 4.0 mm 29.1 ss
>>  inputBytesRead 26.1 GB 0.0 KB
>> 43.8 MB 14.3 MB
>>  jvmGCTime  11.0 mm 0.0 ms 
>> 2.1 ss354.0 ms
>>  memoryBytesSpilled314.2 GB 0.0 KB 
>> 1.1 GB172.1 MB
>>  outputBytesWritten  0.0 KB 0.0 KB 
>> 0.0 KB  0.0 KB
>>  peakExecutionMemory 0.0 KB 0.0 KB 
>> 0.0 KB  0.0 KB
>>  resultSize 12.9 MB 2.0 KB
>> 40.9 KB  7.1 KB
>>  shuffleReadBytesRead  107.7 GB 0.0 KB   
>> 276.0 MB 59.0 MB
>>  shuffleReadFetchWaitTime2.0 ms 0.0 ms 
>> 0.0 ms  0.0 ms
>>  shuffleReadLocalBlocks   2,318  0   
>>   68   1
>>  shuffleReadRecordsRead   3,413,511,099  0  
>> 8,251,926   1,826,383
>>  shuffleReadRemoteBlocks291,126  0   
>>  824 155
>>  shuffleWriteBytesWritten  107.6 GB 0.0 KB   
>> 257.6 MB 58.9 MB
>>  shuffleWriteRecordsWritten   3,408,133,175  0  
>> 7,959,055   1,823,506
>>  shuffleWriteTime8.7 mm 0.0 ms 
>> 1.8 ss278.2 ms
>>  taskDuration   15.4 hh12.0 ms 
>> 4.1 mm 29.7 ss
>>
>>
>> *2) Here we show number of hosts used and executors per host. I have seen 
>> users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB 
>> of memory.*
>>
>> Total Hosts 135
>>
>>
>> Host server86.cluster.com startTime 02:26:21:081 executors count 3
>> Host server164.cluster.com startTime 02:30:12:204 executors count 1
>> Host server28.cluster.com startTime 02:31:09:023 executors count 1
>> Host 

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
Hi Rohit,

I would like to thank you for the unlimited patience and support that you
are providing here and behind the scene for all of us.

The tool is amazing and easy to use and understand most of the metrics ...

Thinking if we need to run it in cluster mode and all the time, i think we
can skip it as one or few runs can give you the large picture of how the
job is running with different configuration and it's not too much
complicated to run it using spark-submit.

I think it will be so helpful if the sparklens can also include how the job
is running with different configuration of cores and memory, Spark job with
1 exec and 1 core will run different from spark job with 1  exec and 3
cores and for sure the same compare with different exec memory.

Overall, it is so good starting point, but it will be a GAME CHANGER
getting these metrics on the tool.

@Rohit , Huge THANY YOU

On Mon, Mar 26, 2018 at 1:35 PM, Rohit Karlupia  wrote:

> Hi Shmuel,
>
> In general it is hard to pin point to exact code which is responsible for
> a specific stage. For example when using spark sql, depending upon the kind
> of joins, aggregations used in the the single line of query, we will have
> multiple stages in the spark application. I usually try to split the code
> into smaller chunks and also use the spark UI which has special section for
> SQL. It can also show specific backtraces, but as I explained earlier they
> might not be very helpful. Sparklens does help you ask the right questions,
> but is not mature enough to answer all of them.
>
> Understanding the report:
>
> *1) The first part of total aggregate metrics for the application.*
>
> Printing application meterics.
>
>  AggregateMetrics (Application Metrics) total measurements 1869
> NAMESUMMIN   
> MAXMEAN
>  diskBytesSpilled0.0 KB 0.0 KB 
> 0.0 KB  0.0 KB
>  executorRuntime15.1 hh 3.0 ms 
> 4.0 mm 29.1 ss
>  inputBytesRead 26.1 GB 0.0 KB
> 43.8 MB 14.3 MB
>  jvmGCTime  11.0 mm 0.0 ms 
> 2.1 ss354.0 ms
>  memoryBytesSpilled314.2 GB 0.0 KB 
> 1.1 GB172.1 MB
>  outputBytesWritten  0.0 KB 0.0 KB 
> 0.0 KB  0.0 KB
>  peakExecutionMemory 0.0 KB 0.0 KB 
> 0.0 KB  0.0 KB
>  resultSize 12.9 MB 2.0 KB
> 40.9 KB  7.1 KB
>  shuffleReadBytesRead  107.7 GB 0.0 KB   
> 276.0 MB 59.0 MB
>  shuffleReadFetchWaitTime2.0 ms 0.0 ms 
> 0.0 ms  0.0 ms
>  shuffleReadLocalBlocks   2,318  0
>  68   1
>  shuffleReadRecordsRead   3,413,511,099  0  
> 8,251,926   1,826,383
>  shuffleReadRemoteBlocks291,126  0
> 824 155
>  shuffleWriteBytesWritten  107.6 GB 0.0 KB   
> 257.6 MB 58.9 MB
>  shuffleWriteRecordsWritten   3,408,133,175  0  
> 7,959,055   1,823,506
>  shuffleWriteTime8.7 mm 0.0 ms 
> 1.8 ss278.2 ms
>  taskDuration   15.4 hh12.0 ms 
> 4.1 mm 29.7 ss
>
>
> *2) Here we show number of hosts used and executors per host. I have seen 
> users set executor memory to 33GB on a 64GB executor. Direct waste of 31GB of 
> memory.*
>
> Total Hosts 135
>
>
> Host server86.cluster.com startTime 02:26:21:081 executors count 3
> Host server164.cluster.com startTime 02:30:12:204 executors count 1
> Host server28.cluster.com startTime 02:31:09:023 executors count 1
> Host server78.cluster.com startTime 02:26:08:844 executors count 5
> Host server124.cluster.com startTime 02:26:10:523 executors count 3
> Host server100.cluster.com startTime 02:30:24:073 executors count 1
> Done printing host timeline
> *3) Time at which executers were added. Not all executors are available at 
> the start of the application. *
>
> Printing executors timeline
> Total Hosts 135
> Total Executors 250
> At 02:26 executors added 52 & removed  0 currently available 52
> At 02:27 executors added 10 & removed  0 currently available 62
> At 02:28 executors added 13 & removed  0 currently available 75
> At 02:29 executors added 81 & removed  0 currently available 156
> At 02:30 executors added 48 & removed  0 currently available 204
> At 02:31 executors added 45 & removed  0 currently available 249
> At 02:32 executors added 1 & removed  0 currently available 

Re: Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
 |-- myMap: map (nullable = true)
 ||-- key: struct
 ||-- value: double (valueContainsNull = true)
 |||-- _1: string (nullable = true)
 |||-- _2: string (nullable = true)
 |-- count: long (nullable = true)

On Mon, Mar 26, 2018 at 1:41 PM, Gauthier Feuillen 
wrote:

> Can you give the output of “printSchema” ?
>
>
> On 26 Mar 2018, at 22:39, Nikhil Goyal  wrote:
>
> Hi guys,
>
> I have a Map[(String, String), Double] as one of my columns. Using
>
> input.getAs[Map[(String, String), Double]](0)
>
>  throws exception: Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to scala.Tuple2
>
> Even the schema says that key is of type struct of (string, string).
>
> Any idea why this is happening?
>
>
> Thanks
>
> Nikhil
>
>
>


Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Nisha Muktewar
Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a
specific format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough <
joshgoldsboroughs...@gmail.com> wrote:

> The company I work for is trying to do some mixed-effects regression
> modeling in our new big data platform including SparkR.
>
> We can run via SparkR's support of native R & use lme4.  But it runs
> single threaded.  So we're looking for tricks/techniques to process large
> data sets.
>
>
> This was asked a couple years ago:
> https://stackoverflow.com/questions/39790820/mixed-
> effects-models-in-spark-or-other-technology
>
> But I wanted to ask again, in case anyone had an answer now.
>
> Thanks,
> Josh Goldsborough
>


Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Jörn Franke
SparkR does not mean all libraries of R are executed by magic in a distributed 
fashion that scales with the data. In fact that is similar to many other 
analytical software. They have the possibility to run things in parallel but 
the libraries themselves are not using them. Reason is that it is very hard to 
write ml algorithms correctly that scale in a distributed fashion.
What choices do you have now?
1) Work only with a random small sample from the population to train your 
model. This way is anyway recommended for large Dataset because usually you 
have to evaluate many different algorithms and parameters in parallel. If you 
do this all the time on the full Dataset you bring your platform to the limit. 
Golden rule is that at maximum you evaluate only good models (tested before on 
a smaller dataset) on the larger Dataset. Note that this approach is only 
possible if this is possible for your use case (some algorithms simply require 
a lot of data so they work).
2) use the r bindings for Spark Ml Lib and implemented your model yourself by 
leveraging some of the functionality there
https://spark.apache.org/docs/latest/ml-guide.html




> On 26. Mar 2018, at 22:46, Josh Goldsborough  
> wrote:
> 
> The company I work for is trying to do some mixed-effects regression modeling 
> in our new big data platform including SparkR.
> 
> We can run via SparkR's support of native R & use lme4.  But it runs single 
> threaded.  So we're looking for tricks/techniques to process large data sets.
> 
> 
> This was asked a couple years ago:
> https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology
> 
> But I wanted to ask again, in case anyone had an answer now.
> 
> Thanks,
> Josh Goldsborough


[Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Josh Goldsborough
The company I work for is trying to do some mixed-effects regression
modeling in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single
threaded.  So we're looking for tricks/techniques to process large data
sets.


This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough


Re: Class cast exception while using Data Frames

2018-03-26 Thread Gauthier Feuillen
Can you give the output of “printSchema” ? 

> On 26 Mar 2018, at 22:39, Nikhil Goyal  wrote:
> 
> Hi guys,
> 
> I have a Map[(String, String), Double] as one of my columns. Using
> input.getAs[Map[(String, String), Double]](0)
>  throws exception: Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to scala.Tuple2
> Even the schema says that key is of type struct of (string, string).
> Any idea why this is happening?
> 
> Thanks
> Nikhil



Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
Hi guys,

I have a Map[(String, String), Double] as one of my columns. Using

input.getAs[Map[(String, String), Double]](0)

 throws exception: Caused by: java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
be cast to scala.Tuple2

Even the schema says that key is of type struct of (string, string).

Any idea why this is happening?


Thanks

Nikhil


Re: Local dirs

2018-03-26 Thread Gauthier Feuillen
Thanks

> On 26 Mar 2018, at 22:09, Marcelo Vanzin  wrote:
> 
> On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen
>  wrote:
>> Is there a way to change this value without changing yarn-site.xml ?
> 
> No. Local dirs are defined by the NodeManager, and Spark cannot override them.
> 
> -- 
> Marcelo


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Local dirs

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen
 wrote:
> Is there a way to change this value without changing yarn-site.xml ?

No. Local dirs are defined by the NodeManager, and Spark cannot override them.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Local dirs

2018-03-26 Thread Gauthier Feuillen
Hi

I am trying to change the spark.local.dir property. I am running spark on yarn 
and have already tried the following properties:

export LOCAL_DIRS=

spark.yarn.appMasterEnv.LOCAL_DIRS=
spark.yarn.appMasterEnv.SPARK_LOCAL_DIRS=
spark.yarn.nodemanager.local-dirs=/
spark.local.dir=

But still it does not change the executors local directory. Is there a way to 
change this value without changing yarn-site.xml ? 

Thanks in advance !
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma
Hi Keith,

Thanks  for the suggestion!
I have solved this already.
The problem was, that the yarn process was not responding to
start/stop commands and has not applied my configuration changes.
I have killed it and restarted my cluster, and after that yarn has
started using yarn.nodemanager.local-dirs parameter defined in
yarn-site.xml.
After this change, -Djava.io.tmpdir for the spark executor was set
correctly,  according to yarn.nodemanager.local-dirs parameter.

Best,
Michael


On Mon, Mar 26, 2018 at 9:15 PM, Keith Chapman  wrote:
> Hi Michael,
>
> sorry for the late reply. I guess you may have to set it through the hdfs
> core-site.xml file. The property you need to set is "hadoop.tmp.dir" which
> defaults to "/tmp/hadoop-${user.name}"
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Mon, Mar 19, 2018 at 1:05 PM, Michael Shtelma  wrote:
>>
>> Hi Keith,
>>
>> Thank you for the idea!
>> I have tried it, so now the executor command is looking in the following
>> way :
>>
>> /bin/bash -c /usr/java/latest//bin/java -server -Xmx51200m
>> '-Djava.io.tmpdir=my_prefered_path'
>>
>> -Djava.io.tmpdir=/tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp
>>
>> JVM is using the second Djava.io.tmpdir parameter and writing
>> everything to the same directory as before.
>>
>> Best,
>> Michael
>> Sincerely,
>> Michael Shtelma
>>
>>
>> On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman 
>> wrote:
>> > Can you try setting spark.executor.extraJavaOptions to have
>> > -Djava.io.tmpdir=someValue
>> >
>> > Regards,
>> > Keith.
>> >
>> > http://keith-chapman.com
>> >
>> > On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma 
>> > wrote:
>> >>
>> >> Hi Keith,
>> >>
>> >> Thank you for your answer!
>> >> I have done this, and it is working for spark driver.
>> >> I would like to make something like this for the executors as well, so
>> >> that the setting will be used on all the nodes, where I have executors
>> >> running.
>> >>
>> >> Best,
>> >> Michael
>> >>
>> >>
>> >> On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman
>> >> 
>> >> wrote:
>> >> > Hi Michael,
>> >> >
>> >> > You could either set spark.local.dir through spark conf or
>> >> > java.io.tmpdir
>> >> > system property.
>> >> >
>> >> > Regards,
>> >> > Keith.
>> >> >
>> >> > http://keith-chapman.com
>> >> >
>> >> > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma 
>> >> > wrote:
>> >> >>
>> >> >> Hi everybody,
>> >> >>
>> >> >> I am running spark job on yarn, and my problem is that the
>> >> >> blockmgr-*
>> >> >> folders are being created under
>> >> >> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
>> >> >> The size of this folder can grow to a significant size and does not
>> >> >> really fit into /tmp file system for one job, which makes a real
>> >> >> problem for my installation.
>> >> >> I have redefined hadoop.tmp.dir in core-site.xml and
>> >> >> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
>> >> >> location and expected that the block manager will create the files
>> >> >> there and not under /tmp, but this is not the case. The files are
>> >> >> created under /tmp.
>> >> >>
>> >> >> I am wondering if there is a way to make spark not use /tmp at all
>> >> >> and
>> >> >> configure it to create all the files somewhere else ?
>> >> >>
>> >> >> Any assistance would be greatly appreciated!
>> >> >>
>> >> >> Best,
>> >> >> Michael
>> >> >>
>> >> >>
>> >> >> -
>> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >> >>
>> >> >
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Keith Chapman
Hi Michael,

sorry for the late reply. I guess you may have to set it through the hdfs
core-site.xml file. The property you need to set is "hadoop.tmp.dir" which
defaults to "/tmp/hadoop-${user.name}"

Regards,
Keith.

http://keith-chapman.com

On Mon, Mar 19, 2018 at 1:05 PM, Michael Shtelma  wrote:

> Hi Keith,
>
> Thank you for the idea!
> I have tried it, so now the executor command is looking in the following
> way :
>
> /bin/bash -c /usr/java/latest//bin/java -server -Xmx51200m
> '-Djava.io.tmpdir=my_prefered_path'
> -Djava.io.tmpdir=/tmp/hadoop-msh/nm-local-dir/usercache/
> msh/appcache/application_1521110306769_0041/container_
> 1521110306769_0041_01_04/tmp
>
> JVM is using the second Djava.io.tmpdir parameter and writing
> everything to the same directory as before.
>
> Best,
> Michael
> Sincerely,
> Michael Shtelma
>
>
> On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman 
> wrote:
> > Can you try setting spark.executor.extraJavaOptions to have
> > -Djava.io.tmpdir=someValue
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
> >
> > On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma 
> > wrote:
> >>
> >> Hi Keith,
> >>
> >> Thank you for your answer!
> >> I have done this, and it is working for spark driver.
> >> I would like to make something like this for the executors as well, so
> >> that the setting will be used on all the nodes, where I have executors
> >> running.
> >>
> >> Best,
> >> Michael
> >>
> >>
> >> On Mon, Mar 19, 2018 at 6:07 PM, Keith Chapman  >
> >> wrote:
> >> > Hi Michael,
> >> >
> >> > You could either set spark.local.dir through spark conf or
> >> > java.io.tmpdir
> >> > system property.
> >> >
> >> > Regards,
> >> > Keith.
> >> >
> >> > http://keith-chapman.com
> >> >
> >> > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma 
> >> > wrote:
> >> >>
> >> >> Hi everybody,
> >> >>
> >> >> I am running spark job on yarn, and my problem is that the blockmgr-*
> >> >> folders are being created under
> >> >> /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/*
> >> >> The size of this folder can grow to a significant size and does not
> >> >> really fit into /tmp file system for one job, which makes a real
> >> >> problem for my installation.
> >> >> I have redefined hadoop.tmp.dir in core-site.xml and
> >> >> yarn.nodemanager.local-dirs in yarn-site.xml pointing to other
> >> >> location and expected that the block manager will create the files
> >> >> there and not under /tmp, but this is not the case. The files are
> >> >> created under /tmp.
> >> >>
> >> >> I am wondering if there is a way to make spark not use /tmp at all
> and
> >> >> configure it to create all the files somewhere else ?
> >> >>
> >> >> Any assistance would be greatly appreciated!
> >> >>
> >> >> Best,
> >> >> Michael
> >> >>
> >> >> 
> -
> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>
> >> >
> >
> >
>


Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 11:01 AM, Fawze Abujaber  wrote:
> Weird, I just ran spark-shell and it's log is comprised but  my spark jobs
> that scheduled using oozie is not getting compressed.

Ah, then it's probably a problem with how Oozie is generating the
config for the Spark job. Given your env it's potentially related to
Cloudera Manager so I'd try to ask questions in the Cloudera forums
first...

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi Marcelo,

Weird, I just ran spark-shell and it's log is comprised but  my spark jobs
that scheduled using oozie is not getting compressed.

On Mon, Mar 26, 2018 at 8:56 PM, Marcelo Vanzin  wrote:

> You're either doing something wrong, or talking about different logs.
> I just added that to my config and ran spark-shell.
>
> $ hdfs dfs -ls /user/spark/applicationHistory | grep
> application_1522085988298_0002
> -rwxrwx---   3 blah blah   9844 2018-03-26 10:54
> /user/spark/applicationHistory/application_1522085988298_0002.snappy
>
>
>
> On Mon, Mar 26, 2018 at 10:48 AM, Fawze Abujaber 
> wrote:
> > I distributed this config to all the nodes cross the cluster and with no
> > success, new spark logs still uncompressed.
> >
> > On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin 
> wrote:
> >>
> >> Spark should be using the gateway's configuration. Unless you're
> >> launching the application from a different node, if the setting is
> >> there, Spark should be using it.
> >>
> >> You can also look in the UI's environment page to see the
> >> configuration that the app is using.
> >>
> >> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber 
> >> wrote:
> >> > I see this configuration only on the spark gateway server, and my
> spark
> >> > is
> >> > running on Yarn, so I think I missing something ...
> >> >
> >> > I’m using cloudera manager to set this parameter, maybe I need to add
> >> > this
> >> > parameter in other configuration
> >> >
> >> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin 
> wrote:
> >> >>
> >> >> If the spark-defaults.conf file in the machine where you're starting
> >> >> the Spark app has that config, then that's all that should be needed.
> >> >>
> >> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber 
> >> >> wrote:
> >> >> > Thanks Marcelo,
> >> >> >
> >> >> > Yes I was was expecting to see the new apps compressed but I don’t
> ,
> >> >> > do
> >> >> > I
> >> >> > need to perform restart to spark or Yarn?
> >> >> >
> >> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin 
> >> >> > wrote:
> >> >> >>
> >> >> >> Log compression is a client setting. Doing that will make new apps
> >> >> >> write event logs in compressed format.
> >> >> >>
> >> >> >> The SHS doesn't compress existing logs.
> >> >> >>
> >> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber <
> fawz...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Hi All,
> >> >> >> >
> >> >> >> > I'm trying to compress the logs at SPark history server, i added
> >> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark
> Spark
> >> >> >> > Client
> >> >> >> > Advanced Configuration Snippet (Safety Valve) for
> >> >> >> > spark-conf/spark-defaults.conf
> >> >> >> >
> >> >> >> > which i see applied only to the spark gateway servers spark
> conf.
> >> >> >> >
> >> >> >> > What i missing to get this working ?
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Marcelo
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >>
> >>
> >>
> >> --
> >> Marcelo
> >
> >
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I distributed this config to all the nodes cross the cluster and with no
success, new spark logs still uncompressed.

On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin  wrote:

> Spark should be using the gateway's configuration. Unless you're
> launching the application from a different node, if the setting is
> there, Spark should be using it.
>
> You can also look in the UI's environment page to see the
> configuration that the app is using.
>
> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber 
> wrote:
> > I see this configuration only on the spark gateway server, and my spark
> is
> > running on Yarn, so I think I missing something ...
> >
> > I’m using cloudera manager to set this parameter, maybe I need to add
> this
> > parameter in other configuration
> >
> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin  wrote:
> >>
> >> If the spark-defaults.conf file in the machine where you're starting
> >> the Spark app has that config, then that's all that should be needed.
> >>
> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber 
> >> wrote:
> >> > Thanks Marcelo,
> >> >
> >> > Yes I was was expecting to see the new apps compressed but I don’t ,
> do
> >> > I
> >> > need to perform restart to spark or Yarn?
> >> >
> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin 
> wrote:
> >> >>
> >> >> Log compression is a client setting. Doing that will make new apps
> >> >> write event logs in compressed format.
> >> >>
> >> >> The SHS doesn't compress existing logs.
> >> >>
> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber 
> >> >> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I'm trying to compress the logs at SPark history server, i added
> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
> >> >> > Client
> >> >> > Advanced Configuration Snippet (Safety Valve) for
> >> >> > spark-conf/spark-defaults.conf
> >> >> >
> >> >> > which i see applied only to the spark gateway servers spark conf.
> >> >> >
> >> >> > What i missing to get this working ?
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Marcelo
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
You're either doing something wrong, or talking about different logs.
I just added that to my config and ran spark-shell.

$ hdfs dfs -ls /user/spark/applicationHistory | grep
application_1522085988298_0002
-rwxrwx---   3 blah blah   9844 2018-03-26 10:54
/user/spark/applicationHistory/application_1522085988298_0002.snappy



On Mon, Mar 26, 2018 at 10:48 AM, Fawze Abujaber  wrote:
> I distributed this config to all the nodes cross the cluster and with no
> success, new spark logs still uncompressed.
>
> On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin  wrote:
>>
>> Spark should be using the gateway's configuration. Unless you're
>> launching the application from a different node, if the setting is
>> there, Spark should be using it.
>>
>> You can also look in the UI's environment page to see the
>> configuration that the app is using.
>>
>> On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber 
>> wrote:
>> > I see this configuration only on the spark gateway server, and my spark
>> > is
>> > running on Yarn, so I think I missing something ...
>> >
>> > I’m using cloudera manager to set this parameter, maybe I need to add
>> > this
>> > parameter in other configuration
>> >
>> > On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin  wrote:
>> >>
>> >> If the spark-defaults.conf file in the machine where you're starting
>> >> the Spark app has that config, then that's all that should be needed.
>> >>
>> >> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber 
>> >> wrote:
>> >> > Thanks Marcelo,
>> >> >
>> >> > Yes I was was expecting to see the new apps compressed but I don’t ,
>> >> > do
>> >> > I
>> >> > need to perform restart to spark or Yarn?
>> >> >
>> >> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin 
>> >> > wrote:
>> >> >>
>> >> >> Log compression is a client setting. Doing that will make new apps
>> >> >> write event logs in compressed format.
>> >> >>
>> >> >> The SHS doesn't compress existing logs.
>> >> >>
>> >> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber 
>> >> >> wrote:
>> >> >> > Hi All,
>> >> >> >
>> >> >> > I'm trying to compress the logs at SPark history server, i added
>> >> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> >> >> > Client
>> >> >> > Advanced Configuration Snippet (Safety Valve) for
>> >> >> > spark-conf/spark-defaults.conf
>> >> >> >
>> >> >> > which i see applied only to the spark gateway servers spark conf.
>> >> >> >
>> >> >> > What i missing to get this working ?
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Marcelo
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo
>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Spark should be using the gateway's configuration. Unless you're
launching the application from a different node, if the setting is
there, Spark should be using it.

You can also look in the UI's environment page to see the
configuration that the app is using.

On Mon, Mar 26, 2018 at 10:10 AM, Fawze Abujaber  wrote:
> I see this configuration only on the spark gateway server, and my spark is
> running on Yarn, so I think I missing something ...
>
> I’m using cloudera manager to set this parameter, maybe I need to add this
> parameter in other configuration
>
> On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin  wrote:
>>
>> If the spark-defaults.conf file in the machine where you're starting
>> the Spark app has that config, then that's all that should be needed.
>>
>> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber 
>> wrote:
>> > Thanks Marcelo,
>> >
>> > Yes I was was expecting to see the new apps compressed but I don’t , do
>> > I
>> > need to perform restart to spark or Yarn?
>> >
>> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin  wrote:
>> >>
>> >> Log compression is a client setting. Doing that will make new apps
>> >> write event logs in compressed format.
>> >>
>> >> The SHS doesn't compress existing logs.
>> >>
>> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber 
>> >> wrote:
>> >> > Hi All,
>> >> >
>> >> > I'm trying to compress the logs at SPark history server, i added
>> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> >> > Client
>> >> > Advanced Configuration Snippet (Safety Valve) for
>> >> > spark-conf/spark-defaults.conf
>> >> >
>> >> > which i see applied only to the spark gateway servers spark conf.
>> >> >
>> >> > What i missing to get this working ?
>> >>
>> >>
>> >>
>> >> --
>> >> Marcelo
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I see this configuration only on the spark gateway server, and my spark is
running on Yarn, so I think I missing something ...

I’m using cloudera manager to set this parameter, maybe I need to add this
parameter in other configuration

On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin  wrote:

> If the spark-defaults.conf file in the machine where you're starting
> the Spark app has that config, then that's all that should be needed.
>
> On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber 
> wrote:
> > Thanks Marcelo,
> >
> > Yes I was was expecting to see the new apps compressed but I don’t , do I
> > need to perform restart to spark or Yarn?
> >
> > On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin  wrote:
> >>
> >> Log compression is a client setting. Doing that will make new apps
> >> write event logs in compressed format.
> >>
> >> The SHS doesn't compress existing logs.
> >>
> >> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber 
> wrote:
> >> > Hi All,
> >> >
> >> > I'm trying to compress the logs at SPark history server, i added
> >> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
> >> > Client
> >> > Advanced Configuration Snippet (Safety Valve) for
> >> > spark-conf/spark-defaults.conf
> >> >
> >> > which i see applied only to the spark gateway servers spark conf.
> >> >
> >> > What i missing to get this working ?
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
If the spark-defaults.conf file in the machine where you're starting
the Spark app has that config, then that's all that should be needed.

On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber  wrote:
> Thanks Marcelo,
>
> Yes I was was expecting to see the new apps compressed but I don’t , do I
> need to perform restart to spark or Yarn?
>
> On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin  wrote:
>>
>> Log compression is a client setting. Doing that will make new apps
>> write event logs in compressed format.
>>
>> The SHS doesn't compress existing logs.
>>
>> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber  wrote:
>> > Hi All,
>> >
>> > I'm trying to compress the logs at SPark history server, i added
>> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark
>> > Client
>> > Advanced Configuration Snippet (Safety Valve) for
>> > spark-conf/spark-defaults.conf
>> >
>> > which i see applied only to the spark gateway servers spark conf.
>> >
>> > What i missing to get this working ?
>>
>>
>>
>> --
>> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Thanks Marcelo,

Yes I was was expecting to see the new apps compressed but I don’t , do I
need to perform restart to spark or Yarn?

On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin  wrote:

> Log compression is a client setting. Doing that will make new apps
> write event logs in compressed format.
>
> The SHS doesn't compress existing logs.
>
> On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber  wrote:
> > Hi All,
> >
> > I'm trying to compress the logs at SPark history server, i added
> > spark.eventLog.compress=true to spark-defaults.conf to spark Spark Client
> > Advanced Configuration Snippet (Safety Valve) for
> > spark-conf/spark-defaults.conf
> >
> > which i see applied only to the spark gateway servers spark conf.
> >
> > What i missing to get this working ?
>
>
>
> --
> Marcelo
>


Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Log compression is a client setting. Doing that will make new apps
write event logs in compressed format.

The SHS doesn't compress existing logs.

On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber  wrote:
> Hi All,
>
> I'm trying to compress the logs at SPark history server, i added
> spark.eventLog.compress=true to spark-defaults.conf to spark Spark Client
> Advanced Configuration Snippet (Safety Valve) for
> spark-conf/spark-defaults.conf
>
> which i see applied only to the spark gateway servers spark conf.
>
> What i missing to get this working ?



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi All,

I'm trying to compress the logs at SPark history server, i
added spark.eventLog.compress=true to spark-defaults.conf to spark Spark
Client Advanced Configuration Snippet (Safety Valve) for
spark-conf/spark-defaults.conf

which i see applied only to the spark gateway servers spark conf.

What i missing to get this working ?


What do I need to set to see the number of records and processing time for each batch in SPARK UI?

2018-03-26 Thread kant kodali
Hi All,

I am using spark 2.3.0 and I wondering what do I need to set to see the
number of records and processing time for each batch in SPARK UI? The
default UI doesn't seem to show this.

Thanks@


spark 2.3 dataframe join bug

2018-03-26 Thread 李斌松
Hi, sparks:
 I'm using spark2.3 and had found a bug in spark dataframe, here is my
codes:

sc = sparkSession.sparkContext
tmp = sparkSession.createDataFrame(sc.parallelize([[1, 2, 3, 4],
[1, 2, 5, 6], [2, 3, 4, 5], [2, 3, 5, 6]])).toDF('a', 'b', 'c', 'd')
tmp.createOrReplaceTempView('tdl_spark_test')
sparkSession.sql('cache table tdl_spark_test')

df = sparkSession.sql('select a, b from tdl_spark_test group by a,
b')
df.printSchema()

df1 = sparkSession.sql('select a, b, collect_set(array(c)) as c
from tdl_spark_test group by a, b')
df1 = df1.withColumnRenamed('a', 'a1').withColumnRenamed('b', 'b1')
cond = [df.a==df1.a1, df.b==df1.b1]
df = df.join(df1, cond, 'inner').drop('a1', 'b1')

df2 = sparkSession.sql('select a, b, collect_set(array(d)) as d
from tdl_spark_test group by a, b')
df2 = df2.withColumnRenamed('a', 'a1').withColumnRenamed('b', 'b1')
cond = [df.a==df2.a1, df.b==df2.b1]
df = df.join(df2, cond, 'inner').drop('a1', 'b1')

df.show()
sparkSession.sql('uncache table tdl_spark_test')


as you can see, the above code is just create a dataframe and two
child dataframe,the expected answer is that:
   +---+---+--+--+
|  a|  b  | c   | d   |
   +---+---+--+--+
|  2|  3  |[[5], [4]]|[[5], [6]] |
|  1|  2  |[[5], [3]]|[[6], [4]] |
   +---+---+--+--+

however,we had got the unexpected answer:
+---+---+--+--+
 |  a  |  b | c   | d  |
+---+---+--+--+
 |  2|  3  |[[5], [4]]|[[5], [4]] |
 |  1|  2  |[[5], [3]]|[[5], [3]] |
+---+---+--+--+

 it seems that the column of the first dataframe had coverd the
column of the second dataframe.

 In addition, this error occurred as long as the following options
occurred at the same time:
 1. the first root table is cached.
 2. the "group by" is used in child dataframe.
 3. the "array" is used in "collect_set" in child dataframe.
 4. the join condition is "df.a==df2.a1, df.b==df2.b1" instead of
"['a', 'b']"


Re: Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread Shmuel Blitz
I agree.

Just pointed out the option, in case you missed it.

Cheers,
Shmuel

On Mon, Mar 26, 2018 at 10:57 AM, 1427357...@qq.com <1427357...@qq.com>
wrote:

> Hi,
>
> Using concat is one of the way.
> But the + is more intuitive and easy to understand.
>
> --
> 1427357...@qq.com
>
>
> *From:* Shmuel Blitz 
> *Date:* 2018-03-26 15:31
> *To:* 1427357...@qq.com
> *CC:* spark?users ; dev 
> *Subject:* Re: the issue about the + in column,can we support the string
> please?
> Hi,
>
> you can get the same with:
>
> import org.apache.spark.sql.functions._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{IntegerType, StringType, StructField,
> StructType}
>
> val schema = StructType(Array(StructField("name", StringType),
> StructField("age", IntegerType) ))
>
> val lst = List(Row("Shmuel", 13), Row("Blitz", 23))
> val rdd = sc.parallelize(lst)
>
> val df = sqlContext.createDataFrame(rdd,schema)
>
> df.withColumn("newName", concat($"name" ,  lit("abc"))  ).show()
>
> On Mon, Mar 26, 2018 at 6:36 AM, 1427357...@qq.com <1427357...@qq.com>
> wrote:
>
>> Hi  all,
>>
>> I have a table like below:
>>
>> +---+-+---+
>> | id| name|sharding_id|
>> +---+-+---+
>> |  1|leader us|  1|
>> |  3|mycat|  1|
>> +---+-+---+
>>
>> My schema is :
>> root
>>  |-- id: integer (nullable = false)
>>  |-- name: string (nullable = true)
>>  |-- sharding_id: integer (nullable = false)
>>
>> I want add a new column named newName. The new column is based on "name"
>> and append "abc" after it. My code looks like:
>>
>> stud_scoreDF.withColumn("newName", stud_scoreDF.col("name") +  "abc"  
>> ).show()
>>
>> When I run the code, I got the reslult:
>> +---+-+---+---+
>> | id| name|sharding_id|newName|
>> +---+-+---+---+
>> |  1|leader us|  1|   null|
>> |  3|mycat|  1|   null|
>> +---+-+---+---+
>>
>>
>> I checked the code, the key code is  in arithmetic.scala. line 165.
>> It looks like:
>>
>> override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
>> dataType match {
>>   case dt: DecimalType =>
>> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$$plus($eval2)")
>>   case ByteType | ShortType =>
>> defineCodeGen(ctx, ev,
>>   (eval1, eval2) => s"(${ctx.javaType(dataType)})($eval1 $symbol 
>> $eval2)")
>>   case CalendarIntervalType =>
>> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.add($eval2)")
>>   case _ =>
>> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1 $symbol $eval2")
>> }
>>
>>
>> My issue is:
>> Can we add case StringType in this class to support string append please?
>>
>>
>>
>> --
>> 1427357...@qq.com
>>
>
>
>
> --
> Shmuel Blitz
> Big Data Developer
> Email: shmuel.bl...@similarweb.com
> www.similarweb.com
> 
> 
> 
>
>


-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.bl...@similarweb.com
www.similarweb.com

 


Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Rohit Karlupia
Hi Shmuel,

In general it is hard to pin point to exact code which is responsible for a
specific stage. For example when using spark sql, depending upon the kind
of joins, aggregations used in the the single line of query, we will have
multiple stages in the spark application. I usually try to split the code
into smaller chunks and also use the spark UI which has special section for
SQL. It can also show specific backtraces, but as I explained earlier they
might not be very helpful. Sparklens does help you ask the right questions,
but is not mature enough to answer all of them.

Understanding the report:

*1) The first part of total aggregate metrics for the application.*

Printing application meterics.

 AggregateMetrics (Application Metrics) total measurements 1869
NAMESUMMIN
  MAXMEAN
 diskBytesSpilled0.0 KB 0.0 KB
0.0 KB  0.0 KB
 executorRuntime15.1 hh 3.0 ms
4.0 mm 29.1 ss
 inputBytesRead 26.1 GB 0.0 KB
   43.8 MB 14.3 MB
 jvmGCTime  11.0 mm 0.0 ms
2.1 ss354.0 ms
 memoryBytesSpilled314.2 GB 0.0 KB
1.1 GB172.1 MB
 outputBytesWritten  0.0 KB 0.0 KB
0.0 KB  0.0 KB
 peakExecutionMemory 0.0 KB 0.0 KB
0.0 KB  0.0 KB
 resultSize 12.9 MB 2.0 KB
   40.9 KB  7.1 KB
 shuffleReadBytesRead  107.7 GB 0.0 KB
  276.0 MB 59.0 MB
 shuffleReadFetchWaitTime2.0 ms 0.0 ms
0.0 ms  0.0 ms
 shuffleReadLocalBlocks   2,318  0
68   1
 shuffleReadRecordsRead   3,413,511,099  0
 8,251,926   1,826,383
 shuffleReadRemoteBlocks291,126  0
   824 155
 shuffleWriteBytesWritten  107.6 GB 0.0 KB
  257.6 MB 58.9 MB
 shuffleWriteRecordsWritten   3,408,133,175  0
 7,959,055   1,823,506
 shuffleWriteTime8.7 mm 0.0 ms
1.8 ss278.2 ms
 taskDuration   15.4 hh12.0 ms
4.1 mm 29.7 ss


*2) Here we show number of hosts used and executors per host. I have
seen users set executor memory to 33GB on a 64GB executor. Direct
waste of 31GB of memory.*

Total Hosts 135


Host server86.cluster.com startTime 02:26:21:081 executors count 3
Host server164.cluster.com startTime 02:30:12:204 executors count 1
Host server28.cluster.com startTime 02:31:09:023 executors count 1
Host server78.cluster.com startTime 02:26:08:844 executors count 5
Host server124.cluster.com startTime 02:26:10:523 executors count 3
Host server100.cluster.com startTime 02:30:24:073 executors count 1
Done printing host timeline
*3) Time at which executers were added. Not all executors are
available at the start of the application. *

Printing executors timeline
Total Hosts 135
Total Executors 250
At 02:26 executors added 52 & removed  0 currently available 52
At 02:27 executors added 10 & removed  0 currently available 62
At 02:28 executors added 13 & removed  0 currently available 75
At 02:29 executors added 81 & removed  0 currently available 156
At 02:30 executors added 48 & removed  0 currently available 204
At 02:31 executors added 45 & removed  0 currently available 249
At 02:32 executors added 1 & removed  0 currently available 250


*4) How the stages within the jobs were scheduled. Helps you
understand which stages ran in parallel and which are dependent on
others.
*

Printing Application timeline
02:26:47:654  Stage 3 ended : maxTaskTime 3117 taskCount 1
02:26:47:708  Stage 4 started : duration 00m 02s
02:26:49:898  Stage 4 ended : maxTaskTime 226 taskCount 200
02:26:49:901 JOB 3 ended
02:26:56:234 JOB 4 started : duration 08m 28s
[  5 |||
  ]
[  6  |||
  ]
[  9   
  ]
[ 10 ||
  ]
[ 11
  ]
[ 12 ||
  ]
[ 13   
  ]
[ 14   |||
  ]
[ 15
|| ]
02:26:58:095  Stage 5 started : duration 00m 44s
02:27:42:816  Stage 5 ended : maxTaskTime 37214 taskCount 23
02:27:03:478  Stage 6 started : duration 02m 04s
02:29:07:517  Stage 6 ended : maxTaskTime 35578 taskCount 601
02:28:56:449  Stage 9 started : duration 00m 46s
02:29:42:625  Stage 9 ended : maxTaskTime 7196 

Re: Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread 1427357...@qq.com
Hi,

Using concat is one of the way.
But the + is more intuitive and easy to understand.



1427357...@qq.com
 
From: Shmuel Blitz
Date: 2018-03-26 15:31
To: 1427357...@qq.com
CC: spark?users; dev
Subject: Re: the issue about the + in column,can we support the string please?
Hi,

you can get the same with:

import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, 
StructType} 

val schema = StructType(Array(StructField("name", StringType),
StructField("age", IntegerType) ))

val lst = List(Row("Shmuel", 13), Row("Blitz", 23))
val rdd = sc.parallelize(lst)

val df = sqlContext.createDataFrame(rdd,schema)

df.withColumn("newName", concat($"name" ,  lit("abc"))  ).show()

On Mon, Mar 26, 2018 at 6:36 AM, 1427357...@qq.com <1427357...@qq.com> wrote:
Hi  all,

I have a table like below:

+---+-+---+
| id| name|sharding_id|
+---+-+---+
|  1|leader us|  1|
|  3|mycat|  1|
+---+-+---+

My schema is :
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- sharding_id: integer (nullable = false)

I want add a new column named newName. The new column is based on "name" and 
append "abc" after it. My code looks like:

stud_scoreDF.withColumn("newName", stud_scoreDF.col("name") +  "abc"  ).show()
When I run the code, I got the reslult:
+---+-+---+---+
| id| name|sharding_id|newName|
+---+-+---+---+
|  1|leader us|  1|   null|
|  3|mycat|  1|   null|
+---+-+---+---+


I checked the code, the key code is  in arithmetic.scala. line 165.
It looks like:

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = dataType 
match {
  case dt: DecimalType =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$$plus($eval2)")
  case ByteType | ShortType =>
defineCodeGen(ctx, ev,
  (eval1, eval2) => s"(${ctx.javaType(dataType)})($eval1 $symbol $eval2)")
  case CalendarIntervalType =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.add($eval2)")
  case _ =>
defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1 $symbol $eval2")
}

My issue is:
Can we add case StringType in this class to support string append please?





1427357...@qq.com



-- 
Shmuel Blitz 
Big Data Developer 
Email: shmuel.bl...@similarweb.com 
www.similarweb.com 


Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread Shmuel Blitz
Hi,

you can get the same with:

import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField,
StructType}

val schema = StructType(Array(StructField("name", StringType),
StructField("age", IntegerType) ))

val lst = List(Row("Shmuel", 13), Row("Blitz", 23))
val rdd = sc.parallelize(lst)

val df = sqlContext.createDataFrame(rdd,schema)

df.withColumn("newName", concat($"name" ,  lit("abc"))  ).show()

On Mon, Mar 26, 2018 at 6:36 AM, 1427357...@qq.com <1427357...@qq.com>
wrote:

> Hi  all,
>
> I have a table like below:
>
> +---+-+---+
> | id| name|sharding_id|
> +---+-+---+
> |  1|leader us|  1|
> |  3|mycat|  1|
> +---+-+---+
>
> My schema is :
> root
>  |-- id: integer (nullable = false)
>  |-- name: string (nullable = true)
>  |-- sharding_id: integer (nullable = false)
>
> I want add a new column named newName. The new column is based on "name"
> and append "abc" after it. My code looks like:
>
> stud_scoreDF.withColumn("newName", stud_scoreDF.col("name") +  "abc"  ).show()
>
> When I run the code, I got the reslult:
> +---+-+---+---+
> | id| name|sharding_id|newName|
> +---+-+---+---+
> |  1|leader us|  1|   null|
> |  3|mycat|  1|   null|
> +---+-+---+---+
>
>
> I checked the code, the key code is  in arithmetic.scala. line 165.
> It looks like:
>
> override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = 
> dataType match {
>   case dt: DecimalType =>
> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.$$plus($eval2)")
>   case ByteType | ShortType =>
> defineCodeGen(ctx, ev,
>   (eval1, eval2) => s"(${ctx.javaType(dataType)})($eval1 $symbol $eval2)")
>   case CalendarIntervalType =>
> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1.add($eval2)")
>   case _ =>
> defineCodeGen(ctx, ev, (eval1, eval2) => s"$eval1 $symbol $eval2")
> }
>
>
> My issue is:
> Can we add case StringType in this class to support string append please?
>
>
>
> --
> 1427357...@qq.com
>



-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.bl...@similarweb.com
www.similarweb.com