Re: Why do I see five attempts on my Spark application

2017-12-13 Thread sanat kumar Patnaik
It should be within your yarn-site.xml config file.The parameter name is
yarn.resourcemanager.am.max-attempts.

The directory should be /usr/lib/spark/conf/yarn-conf. Try to find this
directory on your gateway node if using Cloudera distribution.

On Wed, Dec 13, 2017 at 2:33 PM, Subhash Sriram 
wrote:

> There are some more properties specifically for YARN here:
>
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> Thanks,
> Subhash
>
> On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram 
> wrote:
>
>> http://spark.apache.org/docs/latest/configuration.html
>>
>> On Wed, Dec 13, 2017 at 2:31 PM, Toy  wrote:
>>
>>> Hi,
>>>
>>> Can you point me to the config for that please?
>>>
>>> On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin  wrote:
>>>
 On Wed, Dec 13, 2017 at 11:21 AM, Toy  wrote:
 > I'm wondering why am I seeing 5 attempts for my Spark application?
 Does Spark application restart itself?

 It restarts itself if it fails (up to a limit that can be configured
 either per Spark application or globally in YARN).


 --
 Marcelo

>>>
>>
>


-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Databricks Certification Registration

2017-10-24 Thread sanat kumar Patnaik
Hello All,

Can anybody here please provide me a link to register for Databricks Spark
developer certification(US based). I have been googling but always end up
with this page at end:

http://www.oreilly.com/data/sparkcert.html?cmp=ex-data-confreg-lp-na_databricks&__hssc=249029528.5.1508846982378&__hstc=249029528.4931db1bbf4dc75e6f600fda8e0dd9ed.1505493209384.1508604695490.1508846982378.5&__hsfp=1916184901=1c78b453-7a67-41f9-9140-194bd9d477c2%7C01c9252f-c93b-407b-bdf2-ed82dd18e69b

There is no link which would take me to the next. I have tried contacting
O'reilly and they asked me to contact Databricks. Trying to get through to
Databricks as well in parallel.

Any help here is much appreciated.

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Re: Executors - running out of memory

2017-01-19 Thread sanat kumar Patnaik
Please try and play with spark-defaults.conf for EMR. Dynamic allocation =
true is there by default for EMR 4.4 and above.
What is the EMR version you are using?

http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#d0e20458

On Thu, Jan 19, 2017 at 5:02 PM, Venkata D  wrote:

> blondowski,
>
> How big is your JSON file. Is it possible to post the spark params or
> configurations here, maybe that might get to some idea about the issue.
>
> Thanks
>
> On Thu, Jan 19, 2017 at 4:21 PM, blondowski 
> wrote:
>
>> Please bear with me..I'm fairly new to spark.  Running pyspark 2.0.1 on
>> AWS
>> EMR (6 node cluster with 475GB of RAM)
>>
>> We have a job that creates a dataframe from json files, then does some
>> manipulation (adds columns) and then calls a UDF.
>>
>> The job fails on the UDF call with Container killed by YARN for exceeding
>> memory limits. 6.7 GB of 6.6 GB physical memory used. Consider boosting
>> spark.yarn.executor.memoryOverhead.
>>
>> I've tried adjusting executor-memory to 48GB, but that also failed.
>>
>> What I've noticed that during reading json & creation of dataframe it uses
>> 100+ executors and all of the memory on the cluster is being used.
>>
>> When it gets to the part where it's calling UDF it only allocates 3
>> executors. And they all die one by one.
>> Can somebody please explain to me how the executors get allocated?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Executors-running-out-of-memory-tp28325.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
The performance I mentioned here is all on local(my laptop).
I have tried the same thing on cluster(Elastic MapReduce) and have seen
even worse results.

Is there a way this can be done efficiently?If any of you might have tried
it.


On Wednesday, September 14, 2016, Jörn Franke <jornfra...@gmail.com> wrote:

> It could be that by using the rdd it converts the data from the internal
> format to Java objects (-> much more memory is needed), which may lead to
> spill over to disk. This conversion takes a lot of time. Then, you need to
> transfer these Java objects via network to one single node (repartition
> ...), which takes on a 1 gbit network for 3 gb (since it may transfer Java
> objects this might be even more for 3 gb) under optimal conditions ca 25
> seconds (if no other transfers happening at the same time, jumbo frames
> activated etc). On the destination node we may have again spill over to
> disk. Then you store them to a single disk (potentially multiple if you
> have and use HDFS) which takes also time (assuming that no other process
> uses this disk).
>
> Btw spark-csv can be used with different dataframes.
> As said, other options are compression, avoid repartitioning (to avoid
> network transfer), avoid spilling to disk (provide memory in yarn etc),
> increase network bandwidth ...
>
> On 14 Sep 2016, at 14:22, sanat kumar Patnaik <patnaik.sa...@gmail.com
> <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote:
>
> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
>
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
>
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
>
> That is where I am concerned, the time to write a text file compared to
> json grows exponentially.
>
> On Wednesday, September 14, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com
> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>
>> These intermediate file what sort of files are there. Are there csv type
>> files.
>>
>> I agree that DF is more efficient than an RDD as it follows tabular
>> format (I assume that is what you mean by "columnar" format). So if you
>> read these files in a bath process you may not worry too much about
>> execution time?
>>
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I
>> think it is pretty efficient.
>>
>> For myself, I would do something like below
>>
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <
>> patnaik.sa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>>- I am writing a batch application using Spark SQL and Dataframes.
>>>This application has a bunch of file joins and there are intermediate
>>>points where I need to drop a file for downstream applications to 
>>> consume.
>>>- The problem is all these downstream applications are still on
>>>legacy, so they still require us to drop them a text file.As you all must
>>>be knowing Dataframe stores the data in columnar format internally.
>>>
>>> Only way I found out how to do this and which looks awfully slow is this:
>>>
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>
>>> Is there any better way to do this?
>>>
>>> *P.S: *The other workaround would be to use RDDs for all my operations.
>>> But I am wary of using them as the documentation says Dataframes are way
>>> faster because of the Catalyst engine running behind the scene.
>>>
>>> Please suggest if any of you might have tried something similar.
>>>
>>
>>
>
> --
> Regards,
> Sanat Patnaik
> Cell->804-882-6424
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
These are not csv files, utf8 files with a specific delimiter.
I tried this out with a file(3 GB):

myDF.write.json("output/myJson")
Time taken- 60 secs approximately.

myDF.rdd.repartition(1).saveAsTextFile("output/text")
Time taken 160 secs

That is where I am concerned, the time to write a text file compared to
json grows exponentially.

On Wednesday, September 14, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> These intermediate file what sort of files are there. Are there csv type
> files.
>
> I agree that DF is more efficient than an RDD as it follows tabular format
> (I assume that is what you mean by "columnar" format). So if you read these
> files in a bath process you may not worry too much about execution time?
>
> A textFile saving is simply a one to one mapping from your DF to HDFS. I
> think it is pretty efficient.
>
> For myself, I would do something like below
>
> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 12:46, sanat kumar Patnaik <
> patnaik.sa...@gmail.com
> <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote:
>
>> Hi All,
>>
>>
>>- I am writing a batch application using Spark SQL and Dataframes.
>>This application has a bunch of file joins and there are intermediate
>>points where I need to drop a file for downstream applications to consume.
>>- The problem is all these downstream applications are still on
>>legacy, so they still require us to drop them a text file.As you all must
>>be knowing Dataframe stores the data in columnar format internally.
>>
>> Only way I found out how to do this and which looks awfully slow is this:
>>
>> myDF=sc.textFile("inputpath").toDF()
>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>
>> Is there any better way to do this?
>>
>> *P.S: *The other workaround would be to use RDDs for all my operations.
>> But I am wary of using them as the documentation says Dataframes are way
>> faster because of the Catalyst engine running behind the scene.
>>
>> Please suggest if any of you might have tried something similar.
>>
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
Hi All,


   - I am writing a batch application using Spark SQL and Dataframes. This
   application has a bunch of file joins and there are intermediate points
   where I need to drop a file for downstream applications to consume.
   - The problem is all these downstream applications are still on legacy,
   so they still require us to drop them a text file.As you all must be
   knowing Dataframe stores the data in columnar format internally.

Only way I found out how to do this and which looks awfully slow is this:

myDF=sc.textFile("inputpath").toDF()
myDF.rdd.repartition(1).saveAsTextFile("mypath/output")

Is there any better way to do this?

*P.S: *The other workaround would be to use RDDs for all my operations. But
I am wary of using them as the documentation says Dataframes are way faster
because of the Catalyst engine running behind the scene.

Please suggest if any of you might have tried something similar.