Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
Following link you will get all required details

https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/

Let me know if you required further informations.


Regards,
Vaquar khan




On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh 
wrote:

> Couple of points
>
> Why use spot or pre-empt intantes when your application as you stated
> shuffles heavily.
> Have you looked at why you are having these shuffles? What is the cause of
> these large transformations ending up in shuffle
>
> Also on your point:
> "..then ideally we should expect that when an executor is killed/OOM'd
> and a new executor is spawned on the same host, the new executor registers
> the shuffle files to itself. Is that so?"
>
> What guarantee is that the new executor with inherited shuffle files will
> succeed?
>
> Also OOM is often associated with some form of skewed data
>
> HTH
> .
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 May 2023 at 13:11, Faiz Halde 
> wrote:
>
>> Hello,
>>
>> We've been in touch with a few spark specialists who suggested us a
>> potential solution to improve the reliability of our jobs that are shuffle
>> heavy
>>
>> Here is what our setup looks like
>>
>>- Spark version: 3.3.1
>>- Java version: 1.8
>>- We do not use external shuffle service
>>- We use spot instances
>>
>> We run spark jobs on clusters that use Amazon EBS volumes. The
>> spark.local.dir is mounted on this EBS volume. One of the offerings from
>> the service we use is EBS migration which basically means if a host is
>> about to get evicted, a new host is created and the EBS volume is attached
>> to it
>>
>> When Spark assigns a new executor to the newly created instance, it
>> basically can recover all the shuffle files that are already persisted in
>> the migrated EBS volume
>>
>> Is this how it works? Do executors recover / re-register the shuffle
>> files that they found?
>>
>> So far I have not come across any recovery mechanism. I can only see
>>
>> KubernetesLocalDiskShuffleDataIO
>>
>>  that has a pre-init step where it tries to register the available
>> shuffle files to itself
>>
>> A natural follow-up on this,
>>
>> If what they claim is true, then ideally we should expect that when an
>> executor is killed/OOM'd and a new executor is spawned on the same host,
>> the new executor registers the shuffle files to itself. Is that so?
>>
>> Thanks
>>
>> --
>> Confidentiality note: This e-mail may contain confidential information
>> from Nu Holdings Ltd and/or its affiliates. If you have received it by
>> mistake, please let us know by e-mail reply and delete it from your system;
>> you may not copy this message or disclose its contents to anyone; for
>> details about what personal information we collect and why, please refer to
>> our privacy policy
>> <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>
>> .
>>
>


Re: Online classes for spark topics

2023-03-12 Thread vaquar khan
I saw you are looking holden video .please find following link.

https://www.oreilly.com/library/view/debugging-apache-spark/9781492039174/

Regards,
Vaquar khan


On Sun, Mar 12, 2023, 6:56 PM Mich Talebzadeh 
wrote:

> Hi Denny,
>
> Thanks for the offer. How do you envisage that structure to be?
>
>
> Also it would be good to have a webinar (for a given topic)  for different
> target audiences as we have a mixture of members in Spark forums. For
> example, beginners, intermediate and advanced.
>
>
> do we have a confluence page for Spark so we can use it. I guess that
> would be part of the structure you mentioned.
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 12 Mar 2023 at 22:59, Denny Lee  wrote:
>
>> Looks like we have some good topics here - I'm glad to help with setting
>> up the infrastructure to broadcast if it helps?
>>
>> On Thu, Mar 9, 2023 at 6:19 AM neeraj bhadani <
>> bhadani.neeraj...@gmail.com> wrote:
>>
>>> I am happy to be a part of this discussion as well.
>>>
>>> Regards,
>>> Neeraj
>>>
>>> On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:
>>>
>>>> +1, any webinar on Spark related topic is appreciated 
>>>>
>>>> Thank You & Best Regards
>>>> Winston Lai
>>>> --
>>>> *From:* asma zgolli 
>>>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>>>> *To:* karan alang 
>>>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com
>>>> ; User 
>>>> *Subject:* Re: Online classes for spark topics
>>>>
>>>> +1
>>>>
>>>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>>>> écrit :
>>>>
>>>> +1 .. I'm happy to be part of these discussions as well !
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I guess I can schedule this work over a course of time. I for myself
>>>> can contribute plus learn from others.
>>>>
>>>> So +1 for me.
>>>>
>>>> Let us see if anyone else is interested.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>>>> wrote:
>>>>
>>>>
>>>> Hello Mich.
>>>>
>>>> Greetings. Would you be able to arrange for Spark Structured Streaming
>>>> learning webinar.?
>>>>
>>>> This is something I haven been struggling with recently. it will be
>>>> very helpful.
>>>>
>>>> Thanks and Regard
>>>>
>>>> AK
>>>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> This might  be a worthwhile exercise on the assumption that the
>>>> contributors will find the time and bandwidth to chip in so to speak.
>>>>
>>>> I am sure there are many but on top of my head I can think of Holden
>>>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>>>> experienced.
>>>>
>>>> Anyone else 樂
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>>>  wrote:
>>>>
>>>> Hello gurus,
>>>>
>>>> Does Spark arranges online webinars for special topics like Spark on
>>>> K8s, data science and Spark Structured Streaming?
>>>>
>>>> I would be most grateful if experts can share their experience with
>>>> learners with intermediate knowledge like myself. Hopefully we will find
>>>> the practical experiences told valuable.
>>>>
>>>> Respectively,
>>>>
>>>> AK
>>>>
>>>>
>>>>
>>>>
>>>


Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
@ Gourav Sengupta why you are sending unnecessary emails ,if you think
snowflake good plz use it ,here question was different and you are talking
totally different topic.

Plz respects group guidelines


Regards,
Vaquar khan

On Wed, Dec 28, 2022, 10:29 AM vaquar khan  wrote:

> Here you can find all details , you just need to pass spark dataframe and
> deequ also generate recommendations for rules and you can also write custom
> complex rules.
>
>
> https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/
>
> Regards,
> Vaquar khan
>
> On Wed, Dec 28, 2022, 9:40 AM rajat kumar 
> wrote:
>
>> Thanks for the input folks.
>>
>> Hi Vaquar ,
>>
>> I saw that we have various types of checks in GE and Deequ. Could you
>> please suggest what types of check did you use for Metric based columns
>>
>>
>> Regards
>> Rajat
>>
>> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan 
>> wrote:
>>
>>> I would suggest Deequ , I have implemented many time easy and effective.
>>>
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Tue, Dec 27, 2022, 10:30 PM ayan guha  wrote:
>>>
>>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>>> binding called pydeequ) and others like Delta Live tables with expectations
>>>> from Data Quality feature perspective. All these tools have their pros and
>>>> cons, and all of them are compatible with spark as a compute engine.
>>>>
>>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>>> thing.
>>>>
>>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen  wrote:
>>>>
>>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>>> comparing maybe a web server to Java.
>>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>>
>>>>> But none of that relates to the question of data quality tools. You
>>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>>> it? It's probably one of the most common tools people use with Spark for
>>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're 
>>>>> getting
>>>>> at.
>>>>>
>>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>>
>>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>>
>>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, 
>>>>>> hair
>>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>>
>>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>>> keep your options open.
>>>>>>
>>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>>>> superb alternatives now and the industry, in this recession, should focus
>>>>>> on getting more value for every single dollar they spend.
>>>>>>
>>>>>> Best of luck.
>>>>>>
>>>>>> Regards,
>>>>>> Gourav Sengupta
>>>>>>
>>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>>> talking about data lineage here?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar <
>>>>>>> kumar.rajat20...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Folks
>>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>>> detect issues in data in advance. I have heard about few frameworks 
>>>>>>>> like
>>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get 
>>>>>>>> started
>>>>>>>> on it?
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Rajat
>>>>>>>>
>>>>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>


Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
Here you can find all details , you just need to pass spark dataframe and
deequ also generate recommendations for rules and you can also write custom
complex rules.

https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/

Regards,
Vaquar khan

On Wed, Dec 28, 2022, 9:40 AM rajat kumar 
wrote:

> Thanks for the input folks.
>
> Hi Vaquar ,
>
> I saw that we have various types of checks in GE and Deequ. Could you
> please suggest what types of check did you use for Metric based columns
>
>
> Regards
> Rajat
>
> On Wed, Dec 28, 2022 at 12:15 PM vaquar khan 
> wrote:
>
>> I would suggest Deequ , I have implemented many time easy and effective.
>>
>>
>> Regards,
>> Vaquar khan
>>
>> On Tue, Dec 27, 2022, 10:30 PM ayan guha  wrote:
>>
>>> The way I would approach is to evaluate GE, Deequ (there is a python
>>> binding called pydeequ) and others like Delta Live tables with expectations
>>> from Data Quality feature perspective. All these tools have their pros and
>>> cons, and all of them are compatible with spark as a compute engine.
>>>
>>> Also, you may want to look at dbt based DQ toolsets if sql is your
>>> thing.
>>>
>>> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen  wrote:
>>>
>>>> I think this is kind of mixed up. Data warehouses are simple SQL
>>>> creatures; Spark is (also) a distributed compute framework. Kind of like
>>>> comparing maybe a web server to Java.
>>>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>>>> more complicated, but it's also just a data warehousey SQL surface.
>>>>
>>>> But none of that relates to the question of data quality tools. You
>>>> could use GE with Redshift, or indeed with Spark - are you familiar with
>>>> it? It's probably one of the most common tools people use with Spark for
>>>> this in fact. It's just a Python lib at heart and you can apply it with
>>>> Spark, but _not_ with a data warehouse, so I'm not sure what you're getting
>>>> at.
>>>>
>>>> Deequ is also commonly seen. It's actually built on Spark, so again,
>>>> confused about this "use Redshift or Snowflake not Spark".
>>>>
>>>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> SPARK is just another querying engine with a lot of hype.
>>>>>
>>>>> I would highly suggest using Redshift (storage and compute decoupled
>>>>> mode) or Snowflake without all this super complicated understanding of
>>>>> containers/ disk-space, mind numbing variables, rocket science tuning, 
>>>>> hair
>>>>> splitting failure scenarios, etc. After that try to choose solutions like
>>>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>>>
>>>>> Try out solutions like  "great expectations" if you are looking for
>>>>> data quality and not entirely sucked into the world of SPARK and want to
>>>>> keep your options open.
>>>>>
>>>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>>>> superb alternatives now and the industry, in this recession, should focus
>>>>> on getting more value for every single dollar they spend.
>>>>>
>>>>> Best of luck.
>>>>>
>>>>> Regards,
>>>>> Gourav Sengupta
>>>>>
>>>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Well, you need to qualify your statement on data quality. Are you
>>>>>> talking about data lineage here?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Folks
>>>>>>> Hoping you are doing well, I want to implement data quality to
>>>>>>> detect issues in data in advance. I have heard about few frameworks like
>>>>>>> GE/Deequ. Can anyone pls suggest which one is good and how do I get 
>>>>>>> started
>>>>>>> on it?
>>>>>>>
>>>>>>> Regards
>>>>>>> Rajat
>>>>>>>
>>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>


Re: Profiling data quality with Spark

2022-12-27 Thread vaquar khan
I would suggest Deequ , I have implemented many time easy and effective.


Regards,
Vaquar khan

On Tue, Dec 27, 2022, 10:30 PM ayan guha  wrote:

> The way I would approach is to evaluate GE, Deequ (there is a python
> binding called pydeequ) and others like Delta Live tables with expectations
> from Data Quality feature perspective. All these tools have their pros and
> cons, and all of them are compatible with spark as a compute engine.
>
> Also, you may want to look at dbt based DQ toolsets if sql is your thing.
>
> On Wed, 28 Dec 2022 at 3:14 pm, Sean Owen  wrote:
>
>> I think this is kind of mixed up. Data warehouses are simple SQL
>> creatures; Spark is (also) a distributed compute framework. Kind of like
>> comparing maybe a web server to Java.
>> Are you thinking of Spark SQL? then I dunno sure you may well find it
>> more complicated, but it's also just a data warehousey SQL surface.
>>
>> But none of that relates to the question of data quality tools. You could
>> use GE with Redshift, or indeed with Spark - are you familiar with it? It's
>> probably one of the most common tools people use with Spark for this in
>> fact. It's just a Python lib at heart and you can apply it with Spark, but
>> _not_ with a data warehouse, so I'm not sure what you're getting at.
>>
>> Deequ is also commonly seen. It's actually built on Spark, so again,
>> confused about this "use Redshift or Snowflake not Spark".
>>
>> On Tue, Dec 27, 2022 at 9:55 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> SPARK is just another querying engine with a lot of hype.
>>>
>>> I would highly suggest using Redshift (storage and compute decoupled
>>> mode) or Snowflake without all this super complicated understanding of
>>> containers/ disk-space, mind numbing variables, rocket science tuning, hair
>>> splitting failure scenarios, etc. After that try to choose solutions like
>>> Athena, or Trino/ Presto, and then come to SPARK.
>>>
>>> Try out solutions like  "great expectations" if you are looking for data
>>> quality and not entirely sucked into the world of SPARK and want to keep
>>> your options open.
>>>
>>> Dont get me wrong, SPARK used to be great in 2016-2017, but there are
>>> superb alternatives now and the industry, in this recession, should focus
>>> on getting more value for every single dollar they spend.
>>>
>>> Best of luck.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Dec 27, 2022 at 7:30 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well, you need to qualify your statement on data quality. Are you
>>>> talking about data lineage here?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 27 Dec 2022 at 19:25, rajat kumar 
>>>> wrote:
>>>>
>>>>> Hi Folks
>>>>> Hoping you are doing well, I want to implement data quality to detect
>>>>> issues in data in advance. I have heard about few frameworks like 
>>>>> GE/Deequ.
>>>>> Can anyone pls suggest which one is good and how do I get started on it?
>>>>>
>>>>> Regards
>>>>> Rajat
>>>>>
>>>> --
> Best Regards,
> Ayan Guha
>


Re: Writing to Google Cloud Storage with v2 algorithm safe?

2021-04-03 Thread vaquar khan
Hi Jecek ,

I have answered , hope you find it useful.

Regards,
Viquar khan

On Sat, Apr 3, 2021 at 11:19 AM Jacek Laskowski  wrote:

> Hi,
>
> I've just posted a question on StackOverflow [1] about the safety of the
> v2 algorithm while writing out to Google Cloud Storage. I think I'm missing
> some fundamentals on how cloud object stores work (GCS in particular) and
> hence the question.
>
> Is this all about File.rename and how many HTTP calls are there under the
> covers? How to know it for GCS?
>
> Thank you for any help you can provide. Merci beaucoup mes amis :)
>
> [1] https://stackoverflow.com/q/66933229/1305344
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Coalesce vs reduce operation parameter

2021-03-20 Thread vaquar khan
HI Pedro,

What is your usecase ,why you used coqlesce ,coalesce() is very expensive
operations as they shuffle the data across many partitions hence try to
minimize repartition as much as possible.

Regards,
Vaquar khan


On Thu, Mar 18, 2021, 5:47 PM Pedro Tuero  wrote:

> I was reviewing a spark java application running on aws emr.
>
> The code was like:
> RDD.reduceByKey(func).coalesce(number).saveAsTextFile()
>
> That stage took hours to complete.
> I changed to:
> RDD.reduceByKey(func, number).saveAsTextFile()
> And it now takes less than 2 minutes, and the final output is the same.
>
> So, is it a bug or a feature?
> Why spark doesn't treat a coalesce after a reduce like a reduce with
> output partitions parameterized?
>
> Just for understanding,
> Thanks,
> Pedro.
>
>
>
>


Re: How to submit a job via REST API?

2020-11-24 Thread vaquar khan
Hi Yang,

Please find following link

https://stackoverflow.com/questions/63677736/spark-application-as-a-rest-service/63678337#63678337

Regards,
Vaquar khan

On Wed, Nov 25, 2020 at 12:40 AM Sonal Goyal  wrote:

> You should be able to supply the --conf and its values as part of appArgs
> argument
>
> Cheers,
> Sonal
> Nube Technologies <http://www.nubetech.co/>
> Join me at
> Data Con LA Oct 23 | Big Data Conference Europe. Nov 24 | GIDS AI/ML Dec 3
>
>
>
>
> On Tue, Nov 24, 2020 at 11:31 AM Dennis Suhari 
> wrote:
>
>> Hi Yang,
>>
>> I am using Livy Server for submitting jobs.
>>
>> Br,
>>
>> Dennis
>>
>>
>>
>> Von meinem iPhone gesendet
>>
>> Am 24.11.2020 um 03:34 schrieb Zhou Yang :
>>
>> 
>> Dear experts,
>>
>> I found a convenient way to submit job via Rest API at
>> https://gist.github.com/arturmkrtchyan/5d8559b2911ac951d34a#file-submit_job-sh
>> .
>> But I did not know whether can I append `—conf` parameter like what I did
>> in spark-submit. Can someone can help me with this issue?
>>
>> *Regards, Yang*
>>
>>

-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Read text file row by row and apply conditions

2019-09-30 Thread vaquar khan
Hi Swetha,

It would be great if you ask same question in stackoverflow , we have very
active community and monitor stack for each spark questions.

If you ask same question via stack other ppl also get benefits for similar
problems.

Regards,
Vaquar khan

On Sun, Sep 29, 2019, 10:26 PM swetha kadiyala 
wrote:

> dear friends,
>
> I am new to spark. can you please help me to read the below text file
> using spark and scala.
>
> Sample data
>
> bret|lee|A|12345|ae545|gfddfg|86786786
> 142343345||D|ae342
> 67567|6|U|aadfsd|34k4|84304|020|sdnfsdfn|3243|benej|32432|jsfsdf|3423
> 67564|67747|U|aad434|3435|843454|203|sdn454dfn|3233|gdfg|34325|sdfsddf|7657
>
>
> I am receiving indicator type with 3 rd column of each row. if my
> indicator type=A, then i need to store that particular row data into a
> table called Table1.
> if indicator type=D then i have to store data into seperate table called
> TableB and same as indicator type=U then i have to store all rows data into
> a separate table called Table3.
>
> Can anyone help me how to read row by row and split the columns and apply
> the condition based on indicator type and store columns data into
> respective tables.
>
> Thanks,
> Swetha
>


Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan
Hi Deepak,

You can use textFileStream.

https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Plz start using stackoverflow to ask question to other ppl so get benefits
of answer


Regards,
Vaquar khan

On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma  wrote:

> I am using spark streaming application to read from  kafka.
> The value coming from kafka message is path to hdfs file.
> I am using spark 2.x , spark.read.stream.
> What is the best way to read this path in spark streaming and then read
> the json stored at the hdfs path , may be using spark.read.json , into a df
> inside the spark streaming app.
> Thanks a lot in advance
>
> --
> Thanks
> Deepak
>


Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
Sure let me check Jira

Regards,
Vaquar khan

On Thu, Jun 21, 2018, 4:42 PM Takeshi Yamamuro 
wrote:

> In this ticket SPARK-24201, the ambiguous statement in the doc had been
> pointed out.
> can you make pr for that?
>
> On Fri, Jun 22, 2018 at 6:17 AM, vaquar khan 
> wrote:
>
>> https://spark.apache.org/docs/2.3.0/
>>
>> Avoid confusion we need to updated doc with supported java version "*Java8
>> + " *word confusing for users
>>
>> Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API,
>> Spark 2.3.0 uses Scala 2.11. You will need to use a compatible Scala
>> version (2.11.x).
>>
>>
>> Regards,
>> Vaquar khan
>>
>> On Thu, Jun 21, 2018 at 11:56 AM, chriswakare <
>> chris.newski...@intellibridge.co> wrote:
>>
>>> Hi Rahul,
>>> This will work only in Java 8.
>>> Installation does not work with both version 9 and 10
>>>
>>> Thanks,
>>> Christopher
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783
>> Greater Chicago
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread vaquar khan
https://spark.apache.org/docs/2.3.0/

Avoid confusion we need to updated doc with supported java version "*Java8
+ " *word confusing for users

Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API,
Spark 2.3.0 uses Scala 2.11. You will need to use a compatible Scala
version (2.11.x).


Regards,
Vaquar khan

On Thu, Jun 21, 2018 at 11:56 AM, chriswakare <
chris.newski...@intellibridge.co> wrote:

> Hi Rahul,
> This will work only in Java 8.
> Installation does not work with both version 9 and 10
>
> Thanks,
> Christopher
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: G1GC vs ParallelGC

2018-06-20 Thread vaquar khan
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Regards,
Vaquar khan

On Wed, Jun 20, 2018, 1:18 AM Aakash Basu 
wrote:

> Hi guys,
>
> I just wanted to know, why my ParallelGC (*--conf
> "spark.executor.extraJavaOptions=-XX:+UseParallelGC"*) in a very long
> Spark ML Pipeline works faster than when I set G1GC (*--conf
> "spark.executor.extraJavaOptions=-XX:+UseG1GC"*), even though the Spark
> community suggests G1GC to be much better than the ParallelGC.
>
> Any pointers?
>
> Thanks,
> Aakash.
>


Re: load hbase data using spark

2018-06-20 Thread vaquar khan
Why you need tool,you can directly connect Hbase using spark.

Regards,
Vaquar khan

On Jun 18, 2018 4:37 PM, "Lian Jiang"  wrote:

Hi,

I am considering tools to load hbase data using spark. One choice is
https://github.com/Huawei-Spark/Spark-SQL-on-HBase. However, this seems to
be out-of-date (e.g. "This version of 1.0.0 requires Spark 1.4.0."). Which
tool should I use for this purpose? Thanks for any hint.


Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-17 Thread vaquar khan
Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from
programs using DataFrame and Dataset are compiled into Java bytecode, the
size of byte code of one method must not be 64 KB or more, This conflicts
with the limitation of the Java class file, which is an exception that
occurs.

In order to avoid occurrence of an exception due to this restriction,
within Spark, a solution is to split the methods that compile and make Java
bytecode that is likely to be over 64 KB into multiple methods when
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  wrote:

> Hi Akash,
> such errors might appear in large spark pipelines, the root cause is a
> 64kb jvm limitation.
> the reason that your job isn't failing at the end is due to spark fallback
> - if code gen is failing, spark compiler will try to create the flow
> without the code gen (less optimized)
> if you do not want to see this error, you can either disable code gen
> using the flag:  spark.sql.codegen.wholeStage= "false"
> or you can try to split your complex pipeline into several spark flows if
> possible
>
> hope that helps
>
> Eyal
>
> On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I already went through it, that's one use case. I've a complex and very
>> big pipeline of multiple jobs under one spark session. Not getting, on how
>> to solve this, as it is happening over Logistic Regression and Random
>> Forest models, which I'm just using from Spark ML package rather than doing
>> anything by myself.
>>
>> Thanks,
>> Aakash.
>>
>> On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
>>
>>> Hi Akash,
>>>
>>> Please check stackoverflow.
>>>
>>> https://stackoverflow.com/questions/41098953/codegen-grows-
>>> beyond-64-kb-error-when-normalizing-large-pyspark-dataframe
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu >> > wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I'm getting an error when I'm feature engineering on 30+ columns to
>>>> create about 200+ columns. It is not failing the job, but the ERROR shows.
>>>> I want to know how can I avoid this.
>>>>
>>>> Spark - 2.3.1
>>>> Python - 3.6
>>>>
>>>> Cluster Config -
>>>> 1 Master - 32 GB RAM, 16 Cores
>>>> 4 Slaves - 16 GB RAM, 8 Cores
>>>>
>>>>
>>>> Input data - 8 partitions of parquet file with snappy compression.
>>>>
>>>> My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
>>>> --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5
>>>> --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf
>>>> spark.driver.maxResultSize=2G --conf 
>>>> "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
>>>> --conf spark.scheduler.listenerbus.eventqueue.capacity=2 --conf
>>>> spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py
>>>> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_
>>>> 33_col.txt
>>>>
>>>> Stack-Trace below -
>>>>
>>>> ERROR CodeGenerator:91 - failed to compile:
>>>>> org.codehaus.janino.InternalCompilerException: Compiling
>>>>> "GeneratedClass": Code of method "processNext()V" of class
>>>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge
>>>>> neratedIteratorForCodegenStage3426" grows beyond 64 KB
>>>>> org.codehaus.janino.InternalCompilerException: Compiling
>>>>> "GeneratedClass": Code of method "processNext()V" of class
>>>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$Ge
>>>>> neratedIteratorForCodegenStage3426" grows beyond 64 KB
>>>>> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.
>>>>> java:361)
>>>>> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:
>>>>> 234)
>>>>> at org.codehaus.janino.SimpleCompiler.compileToClassLoader(Simp
>>>>> leCompiler.java:446)
>>>>> at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassB
>>>>> odyEvaluator.java:313)
>>>>> at org.codehaus.janino.ClassBodyE

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-16 Thread vaquar khan
Hi Akash,

Please check stackoverflow.

https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe

Regards,
Vaquar khan

On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu 
wrote:

> Hi guys,
>
> I'm getting an error when I'm feature engineering on 30+ columns to create
> about 200+ columns. It is not failing the job, but the ERROR shows. I want
> to know how can I avoid this.
>
> Spark - 2.3.1
> Python - 3.6
>
> Cluster Config -
> 1 Master - 32 GB RAM, 16 Cores
> 4 Slaves - 16 GB RAM, 8 Cores
>
>
> Input data - 8 partitions of parquet file with snappy compression.
>
> My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
> --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5
> --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf
> spark.driver.maxResultSize=2G --conf "spark.executor.
> extraJavaOptions=-XX:+UseParallelGC" --conf 
> spark.scheduler.listenerbus.eventqueue.capacity=2
> --conf spark.sql.codegen=true 
> /appdata/bblite-codebase/pipeline_data_test_run.py
> > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt
>
> Stack-Trace below -
>
> ERROR CodeGenerator:91 - failed to compile: 
> org.codehaus.janino.InternalCompilerException:
>> Compiling "GeneratedClass": Code of method "processNext()V" of class
>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>> GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
>> org.codehaus.janino.InternalCompilerException: Compiling
>> "GeneratedClass": Code of method "processNext()V" of class
>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$
>> GeneratedIteratorForCodegenStage3426" grows beyond 64 KB
>> at org.codehaus.janino.UnitCompiler.compileUnit(
>> UnitCompiler.java:361)
>> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>> at org.codehaus.janino.SimpleCompiler.compileToClassLoader(
>> SimpleCompiler.java:446)
>> at org.codehaus.janino.ClassBodyEvaluator.compileToClass(
>> ClassBodyEvaluator.java:313)
>> at org.codehaus.janino.ClassBodyEvaluator.cook(
>> ClassBodyEvaluator.java:235)
>> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>> at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$
>> CodeGenerator$$doCompile(CodeGenerator.scala:1417)
>> at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
>> at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
>> at org.spark_project.guava.cache.LocalCache$LoadingValueReference.
>> loadFuture(LocalCache.java:3599)
>> at org.spark_project.guava.cache.LocalCache$Segment.loadSync(
>> LocalCache.java:2379)
>> at org.spark_project.guava.cache.LocalCache$Segment.
>> lockedGetOrLoad(LocalCache.java:2342)
>> at org.spark_project.guava.cache.LocalCache$Segment.get(
>> LocalCache.java:2257)
>> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>> at org.spark_project.guava.cache.LocalCache.getOrLoad(
>> LocalCache.java:4004)
>> at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.
>> get(LocalCache.java:4874)
>> at org.apache.spark.sql.catalyst.expressions.codegen.
>> CodeGenerator$.compile(CodeGenerator.scala:1365)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec.
>> liftedTree1$1(WholeStageCodegenExec.scala:579)
>> at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(
>> WholeStageCodegenExec.scala:578)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:131)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> execute$1.apply(SparkPlan.scala:127)
>> at org.apache.spark.sql.execution.SparkPlan$$anonfun$
>> executeQuery$1.apply(SparkPlan.scala:155)
>> at org.apache.spark.rdd.RDDOperationScope$.withScope(
>> RDDOperationScope.scala:151)
>> at org.apache.spark.sql.execution.SparkPlan.
>> executeQuery(SparkPlan.scala:152)
>> at org.apache.spark.sql.execution.SparkPlan.execute(
>> SparkPlan.scala:127)
>> at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.
>> prepareShuffleDependency(ShuffleExchangeExec.scala:92)
>> at org.apache.spark.sql.execution.exchange.
>

Re: Not able to sort out environment settings to start spark from windows

2018-06-16 Thread vaquar khan
Plz check ur Java Home path .
May be spacial char or space on ur path.

Regards,
Vaquar khan

On Sat, Jun 16, 2018, 1:36 PM Raymond Xie  wrote:

> I am trying to run spark-shell in Windows but receive error of:
>
> \Java\jre1.8.0_151\bin\java was unexpected at this time.
>
> Environment:
>
> System variables:
>
> SPARK_HOME:
>
> c:\spark
>
> Path:
>
> C:\Program Files (x86)\Common
> Files\Oracle\Java\javapath;C:\ProgramData\Anaconda2;C:\ProgramData\Anaconda2\Library\mingw-w64\bin;C:\ProgramData\Anaconda2\Library\usr\bin;C:\ProgramData\Anaconda2\Library\bin;C:\ProgramData\Anaconda2\Scripts;C:\ProgramData\Oracle\Java\javapath;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;I:\Anaconda2;I:\Anaconda2\Scripts;I:\Anaconda2\Library\bin;C:\Program
> Files (x86)\sbt\\bin;C:\Program Files (x86)\Microsoft SQL
> Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
> Server\100\Tools\Binn\;C:\Program Files\Microsoft SQL
> Server\100\DTS\Binn\;C:\Program Files (x86)\Microsoft SQL
> Server\100\Tools\Binn\VSShell\Common7\IDE\;C:\Program Files (x86)\Microsoft
> Visual Studio 9.0\Common7\IDE\PrivateAssemblies\;C:\Program Files
> (x86)\Microsoft SQL
> Server\100\DTS\Binn\;%DDPATH%;%USERPROFILE%\.dnx\bin;C:\Program
> Files\Microsoft DNX\Dnvm\;C:\Program Files\Microsoft SQL
> Server\130\Tools\Binn\;C:\jre1.8.0_151\bin\server;C:\Program Files
> (x86)\OpenSSH\bin;C:\Program Files (x86)\Calibre2\;C:\Program
> Files\nodejs\;C:\Program Files (x86)\Skype\Phone\;
> %JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;C:\Program Files
> (x86)\scala\bin;C:\hadoop\bin;C:\Program Files\Git\cmd;I:\Program
> Files\EmEditor; C:\RXIE\Learning\Spark\bin;C:\spark\bin
>
> JAVA_HOME:
>
> C:\jdk1.8.0_151\bin
>
> JDK_HOME:
>
> C:\jdk1.8.0_151
>
> I also copied all  C:\jdk1.8.0_151 to  C:\Java\jdk1.8.0_151, and received
> the same error.
>
> Any help is greatly appreciated.
>
> Thanks.
>
>
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Re: spark optimized pagination

2018-06-11 Thread vaquar khan
Spark is processing engine not storage or cache  ,you can dump your results
back to Cassandra, if you see latency then you can use cache to dump spark
results.

In short answer is NO,spark doesn't supporter give  any api to give you
cache kind of storage.

 Directly reading from dataset millions of records will be big delay in
response.

Regards,
Vaquar khan

On Mon, Jun 11, 2018, 2:59 AM Teemu Heikkilä  wrote:

> So you are now providing the data on-demand through spark?
>
> I suggest you change your API to query from cassandra and store the
> results from Spark back there, that way you will have to process the whole
> dataset just once and cassandra is suitable for that kind of workloads.
>
> -T
>
> On 10 Jun 2018, at 8.12, onmstester onmstester 
> wrote:
>
> Hi,
> I'm using spark on top of cassandra as backend CRUD of a Restfull
> Application.
> Most of Rest API's retrieve huge amount of data from cassandra and doing a
> lot of aggregation on them  in spark which take some seconds.
>
> Problem: sometimes the output result would be a big list which make client
> browser throw stop script, so we should paginate the result at the
> server-side,
> but it would be so annoying for user to wait some seconds on each page to
> cassandra-spark processings,
>
> Current Dummy Solution: For now i was thinking about assigning a UUID to
> each request which would be sent back and forth between server-side and
> client-side,
> the first time a rest API invoked, the result would be saved in a
> temptable  and in subsequent similar requests (request for next pages) the
> result would be fetch from
> temptable (instead of common flow of retrieve from cassandra + aggregation
> in spark which would take some time). On memory limit, the old results
> would be deleted.
>
> Is there any built-in clean caching strategy in spark to handle such
> scenarios?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>


Re: Process large JSON file without causing OOM

2017-11-13 Thread vaquar khan
https://stackoverflow.com/questions/26562033/how-to-set-apache-spark-executor-memory

Regards,
Vaquar khan

On Mon, Nov 13, 2017 at 6:22 PM, Alec Swan <alecs...@gmail.com> wrote:

> Hello,
>
> I am using the Spark library to convert JSON/Snappy files to ORC/ZLIB
> format. Effectively, my Java service starts up an embedded Spark cluster
> (master=local[*]) and uses Spark SQL to convert JSON to ORC. However, I
> keep getting OOM errors with large (~1GB) files.
>
> I've tried different ways to reduce memory usage, e.g. by partitioning
> data with dataSet.partitionBy("customer).save(filePath), or capping
> memory usage by setting spark.executor.memory=1G, but to no vail.
>
> I am wondering if there is a way to avoid OOM besides splitting the source
> JSON file into multiple smaller ones and processing the small ones
> individually? Does Spark SQL have to read the JSON/Snappy (row-based) file
> in it's entirety before converting it to ORC (columnar)? If so, would it
> make sense to create a custom receiver that reads the Snappy file and use
> Spark streaming for ORC conversion?
>
> Thanks,
>
> Alec
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Use of Accumulators

2017-11-13 Thread vaquar khan
Confirmed ,you can use Accumulators :)

Regards,
Vaquar khan

On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit <
kedarnath_di...@persistent.com> wrote:

> Hi,
>
>
> We need some way to toggle the flag of  a variable in transformation.
>
>
> We are thinking to make use of spark  Accumulators for this purpose.
>
>
> Can we use these as below:
>
>
> Variables  -> Initial Value
>
>  Variable1 -> 0
>
>  Variable2 -> 0
>
>
> In one of the transformations if we need to make Variable2's value to 1.
> Can we achieve this using Accumulators? Please confirm.
>
>
> Thanks!
>
>
> With Regards,
>
> *~Kedar Dixit*
>
> kedarnath_di...@persistent.com | @kedarsdixit | M +91 90499 15588
> <+91%2090499%2015588> | T +91 (20) 6703 4783 <+91%2020%206703%204783>
>
> *Persistent Systems | **Partners In Innovation** | www.persistent.com
> <http://www.persistent.com>*
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread vaquar khan
If you're running in a clustered mode you need to copy the file across all
the nodes of same shared file system.

1) put it into a distributed filesystem as HDFS or via (s)ftp

2) you  have to transfer /sftp the file into the worker node before running
the Spark job and then you have to put as an argument of textFile the path
of the file in the worker filesystem.

Regards,
Vaquar khan

On Fri, Sep 29, 2017 at 2:00 PM, JG Perrin <jper...@lumeris.com> wrote:

> On a test system, you can also use something like
> Owncloud/Nextcloud/Dropbox to insure that the files are synchronized. Would
> not do it for TB of data ;) ...
>
> -Original Message-
> From: Jörn Franke [mailto:jornfra...@gmail.com]
> Sent: Friday, September 29, 2017 5:14 AM
> To: Gaurav1809 <gauravhpan...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: [Spark-Submit] Where to store data files while running job in
> cluster mode?
>
> You should use a distributed filesystem such as HDFS. If you want to use
> the local filesystem then you have to copy each file to each node.
>
> > On 29. Sep 2017, at 12:05, Gaurav1809 <gauravhpan...@gmail.com> wrote:
> >
> > Hi All,
> >
> > I have multi node architecture of (1 master,2 workers) Spark cluster,
> > the job runs to read CSV file data and it works fine when run on local
> > mode (Local(*)).
> > However, when the same job is ran in cluster mode(Spark://HOST:PORT),
> > it is not able to read it.
> > I want to know how to reference the files Or where to store them?
> > Currently the CSV data file is on master(from where the job is
> submitted).
> >
> > Following code works fine in local mode but not in cluster mode.
> >
> > val spark = SparkSession
> >  .builder()
> >  .appName("SampleFlightsApp")
> >  .master("spark://masterIP:7077") // change it to
> > .master("local[*]) for local mode
> >  .getOrCreate()
> >
> >val flightDF =
> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
> >flightDF.printSchema()
> >
> > Error: FileNotFoundException: File
> > file:/home/username/sampleflightdata does not exist
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-23 Thread vaquar khan
http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide

Regards,
Vaquar khan

On Fri, Sep 22, 2017 at 4:41 PM, Gokula Krishnan D <email2...@gmail.com>
wrote:

> Thanks for the reply. Forgot to mention that, our Batch ETL Jobs are in
> Core-Spark.
>
>
> On Sep 22, 2017, at 3:13 PM, Vadim Semenov <vadim.seme...@datadoghq.com>
> wrote:
>
> 1. 40s is pretty negligible unless you run your job very frequently, there
> can be many factors that influence that.
>
> 2. Try to compare the CPU time instead of the wall-clock time
>
> 3. Check the stages that got slower and compare the DAGs
>
> 4. Test with dynamic allocation disabled
>
> On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D <email2...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade
>> into Spark 2.1.0.
>>
>> With minor code changes (like configuration and Spark Session.sc) able to
>> execute the existing JOB into Spark 2.1.0.
>>
>> But noticed that JOB completion timings are much better in Spark 1.6.0
>> but no in Spark 2.1.0.
>>
>> For the instance, JOB A completed in 50s in Spark 1.6.0.
>>
>> And with the same input and JOB A completed in 1.5 mins in Spark 2.1.0.
>>
>> Is there any specific factor needs to be considered when switching to
>> Spark 2.1.0 from Spark 1.6.0.
>>
>>
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Apache Spark - MLLib challenges

2017-09-23 Thread vaquar khan
MLIB is old RDD-based API  since  Apache Spark 2 is recommended to use
dataset based APIs to get good performance  and introduce ML.

ML contains new API build around Dataset and ML Pipelines ,mllib is slowly
being deprecated (this already happened in case of linear regression)
MLIB currently entered into maintenance mode.


Regards,
Vaquar khan

On Sat, Sep 23, 2017 at 4:04 PM, Koert Kuipers <ko...@tresata.com> wrote:

> our main challenge has been the lack of support for missing values
> generally
>
> On Sat, Sep 23, 2017 at 3:41 AM, Irfan Kabli <irfan.kabli...@gmail.com>
> wrote:
>
>> Dear All,
>>
>> We are looking to position MLLib in our organisation for machine learning
>> tasks and are keen to understand if their are any challenges that you might
>> have seen with MLLib in production. We will be going with the pure
>> open-source approach here, rather than using one of the hadoop
>> distributions out their in the market.
>>
>> Furthemore, with a multi-tenant hadoop cluster, and data in memory, would
>> spark support encrypting the data in memory with DataFrames.
>>
>> --
>> Best Regards,
>> Irfan Kabli
>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Do we always need to go through spark-submit?

2017-08-30 Thread vaquar khan
Hi Kant,

Ans :Yes

The org.apache.spark.launcher
<https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/launcher/package-summary.html>
package
provides classes for launching Spark jobs as child processes using a simple
Java API.
*Doc:*  https://spark.apache.org/docs/latest/rdd-programming-guide.html


*Library for launching Spark applications.*

This library allows applications to launch Spark programmatically. There's
only one entry point to the library - the SparkLauncher
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html>
 class.

The SparkLauncher.startApplication(
org.apache.spark.launcher.SparkAppHandle.Listener...)
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html#startApplication-org.apache.spark.launcher.SparkAppHandle.Listener...->
can
be used to start Spark and provide a handle to monitor and control the
running application:


   import org.apache.spark.launcher.SparkAppHandle;
   import org.apache.spark.launcher.SparkLauncher;

   public class MyLauncher {
 public static void main(String[] args) throws Exception {
   SparkAppHandle handle = new SparkLauncher()
 .setAppResource("/my/app.jar")
 .setMainClass("my.spark.app.Main")
 .setMaster("local")
 .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
 .startApplication();
   // Use handle API to monitor / control application.
 }
   }


It's also possible to launch a raw child process, using the
SparkLauncher.launch()
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html#launch-->
 method:



   import org.apache.spark.launcher.SparkLauncher;

   public class MyLauncher {
 public static void main(String[] args) throws Exception {
   Process spark = new SparkLauncher()
 .setAppResource("/my/app.jar")
 .setMainClass("my.spark.app.Main")
 .setMaster("local")
 .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
 .launch();
   spark.waitFor();
 }
   }


*Note :*

a user application is launched using the bin/spark-submit script. This
script takes care of setting up the classpath with Spark and its
dependencies, and can support different cluster managers and deploy modes
that Spark supports:

Regards,
Vaquar khan

On Wed, Aug 30, 2017 at 3:58 PM, Irving Duran <irving.du...@gmail.com>
wrote:

> I don't know how this would work, but maybe your .jar calls spark-submit
> from within your jar if you were to compile the jar with the spark-submit
> class.
>
>
> Thank You,
>
> Irving Duran
>
> On Wed, Aug 30, 2017 at 10:57 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Hi All,
>>
>> I understand spark-submit sets up its own class loader and other things
>> but I am wondering if it  is possible to just compile the code and run it
>> using "java -jar mysparkapp.jar" ?
>>
>> Thanks,
>> kant
>>
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread vaquar khan
Hi Alex,

Hope following links help you to understand why Spark is good for your
usecase.



   - https://www.youtube.com/watch?v=tKkneWcAIqU=youtu.be
   -
   
https://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/
   - http://ampcamp.berkeley.edu/6/exercises/time-series-tutorial-taxis.html


Regards,
Vaquar khan

On Wed, Aug 30, 2017 at 1:21 PM, Irving Duran <irving.du...@gmail.com>
wrote:

> I think it will work.  Might want to explore spark streams.
>
>
> Thank You,
>
> Irving Duran
>
> On Wed, Aug 30, 2017 at 10:50 AM, <kanth...@gmail.com> wrote:
>
>> I don't see why not
>>
>> Sent from my iPhone
>>
>> > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov <
>> alexandr.poru...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > I am new in Apache Spark. I need to process different time series data
>> (numeric values which depend on time) and react on next actions:
>> > 1. Data is changing up or down too fast.
>> > 2. Data is changing constantly up or down too long.
>> >
>> > For example, if the data have changed 30% up or down in the last five
>> minutes (or less), then I need to send a special event.
>> > If the data have changed 50% up or down in two hours (or less), then I
>> need to send a special event.
>> >
>> > Frequency of data changing is about 1000-3000 per second. And I need to
>> react as soon as possible.
>> >
>> > Does Apache Spark fit well for this scenario or I need to search for
>> another solution?
>> > Sorry for stupid question, but I am a total newbie.
>> >
>> > Regards
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Spark 2.1.1 Error:java.lang.NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;

2017-07-17 Thread vaquar khan
Following error we are getting because of dependency mismatch.

Regards,
vaquar khan

On Jul 17, 2017 3:50 AM, "zzcclp" <441586...@qq.com> wrote:

Hi guys:
  I am using spark 2.1.1 to test on CDH 5.7.1, when i run on yarn with
following command, error 'NoSuchMethodError:
org.apache.spark.network.client.TransportClient.
getChannel()Lio/netty/channel/Channel;'
appears sometimes:

  command:
  *su cloudera-scm -s "/bin/sh" -c "/opt/spark2/bin/spark-shell --master
yarn --deploy-mode client --files
/opt/spark2/conf/log4j_all.properties#log4j.properties --driver-memory 8g
--num-executors 2 --executor-memory 8g --executor-cores 5
--driver-library-path :/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
--driver-class-path /opt/spark2/libs/mysql-connector-java-5.1.36.jar --jars
/opt/spark2/libs/mysql-connector-java-5.1.36.jar " *

  error messages:
  2017-07-17 17:15:25,255 - WARN -
org.apache.spark.network.server.TransportChannelHandler.exceptionCaught(
TransportChannelHandler.java:78)
- rpc-client-1-1 -Exception in connection from /ip:60099
java.lang.NoSuchMethodError:
org.apache.spark.network.client.TransportClient.
getChannel()Lio/netty/channel/Channel;
at
org.apache.spark.rpc.netty.NettyRpcHandler.channelActive(
NettyRpcEnv.scala:614)
at
org.apache.spark.network.server.TransportRequestHandler.channelActive(
TransportRequestHandler.java:87)
at
org.apache.spark.network.server.TransportChannelHandler.channelActive(
TransportChannelHandler.java:88)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:219)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:205)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
fireChannelActive(AbstractChannelHandlerContext.java:198)
at
org.spark_project.io.netty.channel.ChannelInboundHandlerAdapter.
channelActive(ChannelInboundHandlerAdapter.java:64)
at
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelActive(
IdleStateHandler.java:251)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:219)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:205)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
fireChannelActive(AbstractChannelHandlerContext.java:198)
at
org.spark_project.io.netty.channel.ChannelInboundHandlerAdapter.
channelActive(ChannelInboundHandlerAdapter.java:64)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:219)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:205)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
fireChannelActive(AbstractChannelHandlerContext.java:198)
at
org.spark_project.io.netty.channel.ChannelInboundHandlerAdapter.
channelActive(ChannelInboundHandlerAdapter.java:64)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:219)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:205)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
fireChannelActive(AbstractChannelHandlerContext.java:198)
at
org.spark_project.io.netty.channel.DefaultChannelPipeline$
HeadContext.channelActive(DefaultChannelPipeline.java:1282)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:219)
at
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.
invokeChannelActive(AbstractChannelHandlerContext.java:205)
at
org.spark_project.io.netty.channel.DefaultChannelPipeline.fireChannelActive(
DefaultChannelPipeline.java:887)
at
org.spark_project.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.
fulfillConnectPromise(AbstractNioChannel.java:262)
at
org.spark_project.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.
finishConnect(AbstractNioChannel.java:292)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKey(
NioEventLoop.java:640)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.
processSelectedKeysOptimized(NioEventLoop.java:575)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.processSelectedKeys(
NioEventLoop.java:489)
at
org.spark_project.io.netty.channel.nio.NioEventLoop.run(
NioEventLoop.java:451)
at
org.spark_project.io.netty.util.concurrent.SingleThreadEventExecutor$2.
ru

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread vaquar khan
Hi Kant,

Kafka is the message broker that using as Producers and Consumers and Spark
Streaming is used as the real time processing ,Kafka and Spark Streaming
work together not competitors.
Spark Streaming is reading data from Kafka and process into micro batching
for streaming data, In easy terms collects data for some time, build RDD
and then process these micro batches.


Please read doc :
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming is an extension of the core Spark API that enables
scalable, high-throughput, fault-tolerant stream processing of live data
streams. Data can be ingested from many sources like *Kafka, Flume,
Kinesis, or TCP sockets*, and can be processed using complex algorithms
expressed with high-level functions like map, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and
live dashboards. In fact, you can apply Spark’s machine learning
<https://spark.apache.org/docs/latest/ml-guide.html> and graph processing
<https://spark.apache.org/docs/latest/graphx-programming-guide.html> algorithms
on data streams.

Regards,

Vaquar khan

On Sun, Jun 11, 2017 at 3:12 AM, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I am trying hard to figure out what is the real difference between Kafka
> Streaming vs Spark Streaming other than saying one can be used as part of
> Micro services (since Kafka streaming is just a library) and the other is a
> Standalone framework by itself.
>
> If I can accomplish same job one way or other this is a sort of a puzzling
> question for me so it would be great to know what Spark streaming can do
> that Kafka Streaming cannot do efficiently or whatever ?
>
> Thanks!
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago


Re: Read Data From NFS

2017-06-11 Thread vaquar khan
As per spark doc :
The textFile method also takes an optional second argument for controlling
the number of partitions of the file.* By default, Spark creates one
partition for each block of the file (blocks being 128MB by default in
HDFS)*, but you can also ask for a higher number of partitions by passing a
larger value. Note that you cannot have fewer partitions than blocks.


sc.textFile doesn't commence any reading. It simply defines a
driver-resident data structure which can be used for further processing.

It is not until an action is called on an RDD that Spark will build up a
strategy to perform all the required transforms (including the read) and
then return the result.

If there is an action called to run the sequence, and your next
transformation after the read is to map, then Spark will need to read a
small section of lines of the file (according to the partitioning strategy
based on the number of cores) and then immediately start to map it until it
needs to return a result to the driver, or shuffle before the next sequence
of transformations.

If your partitioning strategy (defaultMinPartitions) seems to be swamping
the workers because the java representation of your partition (an InputSplit in
HDFS terms) is bigger than available executor memory, then you need to
specify the number of partitions to read as the second parameter to textFile.
You can calculate the ideal number of partitions by dividing your file size
by your target partition size (allowing for memory growth). A simple check
that the file can be read would be:

sc.textFile(file, numPartitions).count()

You can get good explanation here :
https://stackoverflow.com/questions/29011574/how-does-
partitioning-work-for-data-from-files-on-hdfs



Regards,
Vaquar khan


On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote:

> Hi
>
> My question is what happens if I have 1 file of say 100gb. Then how many
> partitions will be there?
>
> Best
> Ayan
> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <vaquar.k...@gmail.com> wrote:
>
>> Hi Ayan,
>>
>> If you have multiple files (example 12 files )and you are using following
>> code then you will get 12 partition.
>>
>> r = sc.textFile("file://my/file/*")
>>
>> Not sure what you want to know about file system ,please check API doc.
>>
>>
>> Regards,
>> Vaquar khan
>>
>>
>> On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@gmail.com> wrote:
>>
>> Any one?
>>
>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi Guys
>>>
>>> Quick one: How spark deals (ie create partitions) with large files
>>> sitting on NFS, assuming the all executors can see the file exactly same
>>> way.
>>>
>>> ie, when I run
>>>
>>> r = sc.textFile("file://my/file")
>>>
>>> what happens if the file is on NFS?
>>>
>>> is there any difference from
>>>
>>> r = sc.textFile("hdfs://my/file")
>>>
>>> Are the input formats used same in both cases?
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
> Best Regards,
> Ayan Guha
>


Re: Is there a way to do conditional group by in spark 2.1.1?

2017-06-10 Thread vaquar khan
Avoid groupby and use reducebykey.

Regards,
Vaquar khan

On Jun 4, 2017 8:32 AM, "Guy Cohen" <g...@gettaxi.com> wrote:

> Try this one:
>
> df.groupBy(
>   when(expr("field1='foo'"),"field1").when(expr("field2='bar'"),"field2"))
>
>
> On Sun, Jun 4, 2017 at 3:16 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
> wrote:
>
>> You should be able to project a new column that is your group column.
>> Then you can group on the projected column.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>>
>> On Sat, Jun 3, 2017 at 6:26 PM -0400, "upendra 1991" <
>> upendra1...@yahoo.com.invalid> wrote:
>>
>> Use a function
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>
>>> On Sat, Jun 3, 2017 at 5:01 PM, kant kodali
>>> <kanth...@gmail.com> wrote:
>>> Hi All,
>>>
>>> Is there a way to do conditional group by in spark 2.1.1? other words, I
>>> want to do something like this
>>>
>>> if (field1 == "foo") {
>>>df.groupBy(field1)
>>> } else if (field2 == "bar")
>>>   df.groupBy(field2)
>>>
>>> Thanks
>>>
>>>
>


Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-10 Thread vaquar khan
You can add memory in your command make sure given memory available on your
executor

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000


https://spark.apache.org/docs/1.1.0/submitting-applications.html

Also try to avoid function need memory like collect etc.


Regards,
Vaquar khan


On Jun 4, 2017 5:46 AM, "Abdulfattah Safa" <fattah.s...@gmail.com> wrote:

I'm working on Spark with Standalone Cluster mode. I need to increase the
Driver Memory as I got OOM in t he driver thread. If found that when
setting  the Driver Memory to > Executor Memory, the submitted job is stuck
at Submitted in the driver and the application never starts.


Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-10 Thread vaquar khan
Hi ,
Pleaae check your firewall security setting sharing link one good link.

http://belablotski.blogspot.in/2016/01/access-hive-tables-from-spark-using.html?m=1



Regards,
Vaquar khan

On Jun 8, 2017 1:53 AM, "Patrik Medvedev" <patrik.medve...@gmail.com> wrote:

> Hello guys,
>
> Can somebody help me with my problem?
> Let me know, if you need more details.
>
>
> ср, 7 июн. 2017 г. в 16:43, Patrik Medvedev <patrik.medve...@gmail.com>:
>
>> No, I don't.
>>
>> ср, 7 июн. 2017 г. в 16:42, Jean Georges Perrin <j...@jgp.net>:
>>
>>> Do you have some other security in place like Kerberos or impersonation?
>>> It may affect your access.
>>>
>>>
>>> jg
>>>
>>>
>>> On Jun 7, 2017, at 02:15, Patrik Medvedev <patrik.medve...@gmail.com>
>>> wrote:
>>>
>>> Hello guys,
>>>
>>> I need to execute hive queries on remote hive server from spark, but for
>>> some reasons i receive only column names(without data).
>>> Data available in table, i checked it via HUE and java jdbc connection.
>>>
>>> Here is my code example:
>>> val test = spark.read
>>> .option("url", "jdbc:hive2://remote.hive.
>>> server:1/work_base")
>>> .option("user", "user")
>>> .option("password", "password")
>>> .option("dbtable", "some_table_with_data")
>>> .option("driver", "org.apache.hive.jdbc.HiveDriver")
>>> .format("jdbc")
>>> .load()
>>> test.show()
>>>
>>>
>>> Scala version: 2.11
>>> Spark version: 2.1.0, i also tried 2.1.1
>>> Hive version: CDH 5.7 Hive 1.1.1
>>> Hive JDBC version: 1.1.1
>>>
>>> But this problem available on Hive with later versions, too.
>>> Could you help me with this issue, because i didn't find anything in
>>> mail group answers and StackOverflow.
>>> Or could you help me find correct solution how to query remote hive from
>>> spark?
>>>
>>> --
>>> *Cheers,*
>>> *Patrick*
>>>
>>>


Re: Scala, Python or Java for Spark programming

2017-06-10 Thread vaquar khan
It's depends on programming style ,I would like to say setup few rules to
avoid complex code in scala , if needed ask programmer to add proper
comments.


Regards,
Vaquar khan

On Jun 8, 2017 4:17 AM, "JB Data" <jbdat...@gmail.com> wrote:

> Java is Object langage borned to Data, Python is Data langage borned to
> Objects or else... Eachone has its owns uses.
>
>
>
> @JBD <http://jbigdata.fr>
>
>
> 2017-06-08 8:44 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:
>
>> A slight advantage of Java is also the tooling that exist around it -
>> better support by build tools and plugins, advanced static code analysis
>> (security, bugs, performance) etc.
>>
>> On 8. Jun 2017, at 08:20, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> What I like about Scala is that it is less ceremonial compared to Java.
>> Java users claim that Scala is built on Java so the error tracking is very
>> difficult. Also Scala sits on top of Java and that makes it virtually
>> depending on Java.
>>
>> For me the advantage of Scala is its simplicity and compactness. I can
>> write a Spark streaming code in Sala pretty fast or import massive RDBMS
>> table into Hive and table of my design equally very fast using Scala.
>>
>> I don't know may be I cannot be bothered writing 100 lines of Java for a
>> simple query from a table :)
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 8 June 2017 at 00:11, Matt Tenenbaum <matt.tenenb...@rockyou.com>
>> wrote:
>>
>>> A lot depends on your context as well. If I'm using Spark _for
>>> analysis_, I frequently use python; it's a starting point, from which I can
>>> then leverage pandas, matplotlib/seaborn, and other powerful tools
>>> available on top of python.
>>>
>>> If the Spark outputs are the ends themselves, rather than the means to
>>> further exploration, Scala still feels like the "first class"
>>> language---most thorough feature set, best debugging support, etc.
>>>
>>> More crudely: if the eventual goal is a dataset, I tend to prefer Scala;
>>> if it's a visualization or some summary values, I tend to prefer Python.
>>>
>>> Of course, I also agree that this is more theological than technical.
>>> Appropriately size your grains of salt.
>>>
>>> Cheers
>>> -mt
>>>
>>> On Wed, Jun 7, 2017 at 12:39 PM, Bryan Jeffrey <bryan.jeff...@gmail.com>
>>> wrote:
>>>
>>>> Mich,
>>>>
>>>> We use Scala for a large project.  On our team we've set a few
>>>> standards to ensure readability (we try to avoid excessive use of tuples,
>>>> use named functions, etc.)  Given these constraints, I find Scala to be
>>>> very readable, and far easier to use than Java.  The Lambda functionality
>>>> of Java provides a lot of similar features, but the amount of typing
>>>> required to set down a small function is excessive at best!
>>>>
>>>> Regards,
>>>>
>>>> Bryan Jeffrey
>>>>
>>>> On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think this is a religious question ;-)
>>>>> Java is often underestimated, because people are not aware of its
>>>>> lambda functionality which makes the code very readable. Scala - it 
>>>>> depends
>>>>> who programs it. People coming with the normal Java background write
>>>>> Java-like code in scala which might not be so good. People from a
>>>>> functional background write it more functional like - i.e. You have a lot
>>>>> of things in one line of code which can be a curse even for other
>>>>> functional programmers, especially if the application is distributed as in
>>>>> the case of Spar

Re: Read Data From NFS

2017-06-10 Thread vaquar khan
Hi Ayan,

If you have multiple files (example 12 files )and you are using following
code then you will get 12 partition.

r = sc.textFile("file://my/file/*")

Not sure what you want to know about file system ,please check API doc.


Regards,
Vaquar khan

On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@gmail.com> wrote:

Any one?

On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote:

> Hi Guys
>
> Quick one: How spark deals (ie create partitions) with large files sitting
> on NFS, assuming the all executors can see the file exactly same way.
>
> ie, when I run
>
> r = sc.textFile("file://my/file")
>
> what happens if the file is on NFS?
>
> is there any difference from
>
> r = sc.textFile("hdfs://my/file")
>
> Are the input formats used same in both cases?
>
>
> --
> Best Regards,
> Ayan Guha
>
-- 
Best Regards,
Ayan Guha


Re: Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread vaquar khan
You can add filter or replace null with value like 0 or string.

df.na.fill(0, Seq("y"))

Regards,
Vaquar khan

On Jun 2, 2017 11:25 AM, "Alonso Isidoro Roman" <alons...@gmail.com> wrote:

not sure if this can help you, but you can infer programmatically the
schema providing a json schema file,

val path: Path = new Path(schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)

val inputStream: FSDataInputStream = fileSystem.open(path)

val schema_json = Stream.cons(inputStream.readLine(),
Stream.continually(inputStream.readLine))

logger.debug("schema_json looks like " + schema_json.head)

val mySchemaStructType =
DataType.fromJson(schema_json.head).asInstanceOf[StructType]

logger.debug("mySchemaStructType is " + mySchemaStructType)


where schema_parquet can be something like this:

{"type" : "struct","fields" : [ {"name" : "column0","type" :
"string","nullable" : false},{"name":"column1", "type":"string",
"nullable":false},{"name":"column2", "type":"string",
"nullable":true}, {"name":"column3", "type":"string",
"nullable":false}]}



Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>

2017-06-02 16:11 GMT+02:00 Aseem Bansal <asmbans...@gmail.com>:

> When we read files in spark it infers the schema. We have the option to
> not infer the schema. Is there a way to ask spark to infer the schema again
> just like when reading json?
>
> The reason we want to get this done is because we have a problem in our
> data files. We have a json file containing this
>
> {"a": NESTED_JSON_VALUE}
> {"a":"null"}
>
> It should have been empty json but due to a bug it became "null" instead.
> Now, when we read the file the column "a" is considered as a String.
> Instead what we want to do is ask spark to read the file considering "a" as
> a String, filter the "null" out/replace with empty json and then ask spark
> to infer schema of "a" after the fix so we can access the nested json
> properly.
>


Re: Spark SQL, dataframe join questions.

2017-03-29 Thread vaquar khan
HI ,

I found following two links are helpful sharing with you .

http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join

http://spark.apache.org/docs/latest/configuration.html


Regards,
Vaquar khan

On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet <sjayatheer...@gmail.com>
wrote:

> In repartition, every element in the partition is moved to a new
> partition..doing a full shuffle compared to shuffles done by reduceBy
> clauses. With this in mind, repartition would increase your query
> performance. ReduceBy key will also shuffle based on the aggregation.
>
> The best way to design is to check the query plan of your data frame join
> query and do RDD joins accordingly, if needed.
>
>
> On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> You don't need to repartition your data just for join purpose. But if the
>> either parties of join is already partitioned, Spark will use this
>> advantage as part of join optimization.
>>
>> Should you reduceByKey before the join really depend on your join logic.
>> ReduceByKey will shuffle, and following join COULD cause another shuffle.
>> So I am not sure if it is a smart way.
>>
>> Yong
>>
>> --
>> *From:* shyla deshpande <deshpandesh...@gmail.com>
>> *Sent:* Wednesday, March 29, 2017 12:33 PM
>> *To:* user
>> *Subject:* Re: Spark SQL, dataframe join questions.
>>
>>
>>
>> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Following are my questions. Thank you.
>>>
>>> 1. When joining dataframes is it a good idea to repartition on the key 
>>> column that is used in the join or
>>> the optimizer is too smart so forget it.
>>>
>>> 2. In RDD join, wherever possible we do reduceByKey before the join to 
>>> avoid a big shuffle of data. Do we need
>>> to do anything similar with dataframe joins, or the optimizer is too smart 
>>> so forget it.
>>>
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread vaquar khan
Please read Spark documents at least once before asking question.

http://spark.apache.org/docs/latest/streaming-programming-guide.html

http://2s7gjr373w3x22jf92z99mgm5w-wpengine.netdna-ssl.com/wp-content/uploads/2015/11/spark-streaming-datanami.png


Regards,
Vaquar khan


On Fri, Mar 10, 2017 at 6:17 AM, Sean Owen <so...@cloudera.com> wrote:

> Kafka and Spark Streaming don't do the same thing. Kafka stores and
> transports data, Spark Streaming runs computations on a stream of data.
> Neither is itself a streaming platform in its entirety.
>
> It's kind of like asking whether you should build a website using just
> MySQL, or nginx.
>
>
>> On 9 Mar 2017, at 20:37, Gaurav1809 <gauravhpan...@gmail.com> wrote:
>>
>> Hi All, Would you please let me know which streaming platform is best. Be
>> it
>> server log processing, social media feeds ot any such streaming data. I
>> want
>> to know the comparison between Kafka & Spark Streaming.
>>
>>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Serialization error - sql UDF related

2017-02-17 Thread vaquar khan
Hi Darshan ,


When you get org.apache.spark.SparkException: Task not serializable
exception, it means that you are using a reference to an instance of a
non-serialize class inside a transformation.

Hope following link will help.

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html


Regards,
Vaquar khan

On Fri, Feb 17, 2017 at 9:36 PM, Darshan Pandya <darshanpan...@gmail.com>
wrote:

> Hello,
>
> I am getting the famous serialization exception on running some code as
> below,
>
> val correctColNameUDF = udf(getNewColumnName(_: String, false: Boolean): 
> String);
> val charReference: DataFrame = thinLong.select("char_name_id", 
> "char_name").withColumn("columnNameInDimTable", 
> correctColNameUDF(col("char_name"))).withColumn("applicable_dimension", 
> lit(dimension).cast(StringType)).distinct();
> val charReferenceTableName: String = s"""$TargetSFFDBName.fg_char_reference"""
> val tableName: String = charReferenceTableName.toString
> charReference.saveAsTable(tableName, saveMode)
>
> I think it has something to do with the UDF, so I am pasting the UDF
> function as well
>
> def getNewColumnName(oldColName: String, appendID: Boolean): String = {
>   var newColName = oldColName.replaceAll("\\s", "_").replaceAll("%", 
> "_pct").replaceAllLiterally("#", "No")
>   return newColName;
> }
>
>
> *​Exception *seen ​is
>
> Caused by: org.apache.spark.SparkException: Task not serializable
> at org.apache.spark.util.ClosureCleaner$.ensureSerializable(
> ClosureCleaner.scala:304)
> at org.apache.spark.util.ClosureCleaner$.org$apache$
> spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2066)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:150)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$
> doExecute$1.apply(TungstenAggregate.scala:86)
> at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$
> doExecute$1.apply(TungstenAggregate.scala:80)
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(
> package.scala:48)
> ... 73 more
> Caused by: java.io.NotSerializableException: com.nielsen.datamodel.
> converters.cip2sff.ProductDimensionSFFConverterRealApp$
> Serialization stack:
> - object not serializable (class: com.nielsen.datamodel.
> converters.cip2sff.ProductDimensionSFFConverterRealApp$, value:
> com.nielsen.datamodel.converters.cip2sff.ProductDimensionSFFConverterRe
> alApp$@247a8411)
> - field (class: com.nielsen.datamodel.converters.cip2sff.
> CommonTransformationTraits$$anonfun$1, name: $outer, type: interface
> com.nielsen.datamodel.converters.cip2sff.CommonTransformationTraits)
> - object (class com.nielsen.datamodel.converters.cip2sff.
> CommonTransformationTraits$$anonfun$1, )
> - field (class: org.apache.spark.sql.catalyst.
> expressions.ScalaUDF$$anonfun$2, name: func$2, type: interface
> scala.Function1)
> - object (class org.apache.spark.sql.catalyst.
> expressions.ScalaUDF$$anonfun$2, )
> - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name:
> f, type: interface scala.Function1)
> - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF,
> UDF(char_name#3))
> - field (class: org.apache.spark.sql.catalyst.expressions.Alias, name:
> child, type: class org.apache.spark.sql.catalyst.expressions.Expression)
> - object (class org.apache.spark.sql.catalyst.expressions.Alias,
> UDF(char_name#3) AS columnNameInDimTable#304)
> - element of array (index: 2)
> - array (class [Ljava.lang.Object;, size 4)
> - field (class: scala.collection.mutable.ArrayBuffer, name: array, type:
> class [Ljava.lang.Object;)
> - object (class scala.collection.mutable.ArrayBuffer,
> ArrayBuffer(char_name_id#2, char_name#3, UDF(char_name#3) AS
> columnNameInDimTable#304, PRODUCT AS applicable_dimension#305))
> - field (class: org.apache.spark.sql.execution.Project, name:
> projectList, type: interface scala.collection.Seq)
> - object (class org.apache.spark.sql.execution.Project, Project
> [char_name_id#2,char_name#3,UDF(char_name#3) AS 
> columnNameInDimTable#304,PRODUCT
> AS applicable_dimension#305]
>
>
>
> --
> Sincerely,
> Darshan
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread vaquar khan
Did you try  MSCK REPAIR TABLE  ?

Regards,
Vaquar Khan

On Feb 6, 2017 11:21 AM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com>
wrote:

> I dont think so, i was able to insert overwrite other created tables in
> hive using spark sql. The only problem  I am facing is, spark is not able
> to recognize hive view name. Very strange but not sure where I am doing
> wrong in this.
>
> On Mon, Feb 6, 2017 at 11:03 AM, Jon Gregg <coble...@gmail.com> wrote:
>
>> Confirming that Spark can read newly created views - I just created a
>> test view in HDFS and I was able to query it in Spark 1.5 immediately after
>> without a refresh.  Possibly an issue with your Spark-Hive connection?
>>
>> Jon
>>
>> On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Hi Khan,
>>>
>>> It didn't work in my case. used below code. View is already present in
>>> Hive but I cant read that in spark sql. Throwing exception that table not
>>> found
>>>
>>> sqlCtx.refreshTable("schema.hive_view")
>>>
>>>
>>> Thanks,
>>>
>>> Asmath
>>>
>>>
>>> On Sun, Feb 5, 2017 at 7:56 PM, vaquar khan <vaquar.k...@gmail.com>
>>> wrote:
>>>
>>>> Hi Ashmath,
>>>>
>>>> Try  refresh table
>>>>
>>>> // spark is an existing SparkSession
>>>> spark.catalog.refreshTable("my_table")
>>>>
>>>>
>>>> http://spark.apache.org/docs/latest/sql-programming-guide.ht
>>>> ml#metadata-refreshing
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>>
>>>>
>>>> On Sun, Feb 5, 2017 at 7:19 PM, KhajaAsmath Mohammed <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a hive view which is basically set of select statements on some
>>>>> tables. I want to read the hive view and use hive builtin functions
>>>>> available in spark sql.
>>>>>
>>>>> I am not able to read that hive view in spark sql but can retreive
>>>>> data in hive shell.
>>>>>
>>>>> can't spark access hive views?
>>>>>
>>>>> Thanks,
>>>>> Asmath
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Vaquar Khan
>>>> +1 -224-436-0783 <(224)%20436-0783>
>>>>
>>>> IT Architect / Lead Consultant
>>>> Greater Chicago
>>>>
>>>
>>>
>>
>


Re: Cannot read Hive Views in Spark SQL

2017-02-05 Thread vaquar khan
Hi Ashmath,

Try  refresh table

// spark is an existing SparkSession
spark.catalog.refreshTable("my_table")


http://spark.apache.org/docs/latest/sql-programming-guide.html#metadata-refreshing



Regards,
Vaquar khan



On Sun, Feb 5, 2017 at 7:19 PM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I have a hive view which is basically set of select statements on some
> tables. I want to read the hive view and use hive builtin functions
> available in spark sql.
>
> I am not able to read that hive view in spark sql but can retreive data in
> hive shell.
>
> can't spark access hive views?
>
> Thanks,
> Asmath
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Time-Series Analysis with Spark

2017-01-11 Thread vaquar khan
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html

Regards,
Vaquar khan

On Wed, Jan 11, 2017 at 10:07 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hello Rishabh,
> We have done some forecasting, for time-series, using ARIMA in our
> project, it's on top of Spark and it's open source
> https://github.com/eleflow/uberdata
>
> Kind Regards,
> Dirceu
>
> 2017-01-11 8:20 GMT-02:00 Sean Owen <so...@cloudera.com>:
>
>> https://github.com/sryza/spark-timeseries ?
>>
>> On Wed, Jan 11, 2017 at 10:11 AM Rishabh Bhardwaj <rbnex...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am exploring time-series forecasting with Spark.
>>> I have some questions regarding this:
>>>
>>> 1. Is there any library/package out there in community of *Seasonal
>>> ARIMA* implementation in Spark?
>>>
>>> 2. Is there any implementation of Dynamic Linear Model (*DLM*) on Spark?
>>>
>>> 3. What are the libraries which can be leveraged for time-series
>>> analysis with Spark (like spark-ts)?
>>>
>>> 4. Is there any roadmap of having time-series algorithms like
>>> AR,MA,ARIMA in Spark codebase itself?
>>>
>>> Thank You,
>>> Rishabh.
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread vaquar khan
Hi Deepak,

Could you share Index information in your database.

select * from indexInfo;


Regards,
Vaquar khan

On Sat, Dec 17, 2016 at 2:45 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> How many workers are in the cluster?
>
> On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi All,
>> I am iterating over data frame's paritions using df.foreachPartition .
>> Upon each iteration of row , i am initializing DAO to insert the row into
>> cassandra.
>> Each of these iteration takes almost 1 and half minute to finish.
>> In my workflow , this is part of an action and 100 partitions are being
>> created for the df as i can see 100 tasks being created , where the insert
>> dao operation is being performed.
>> Since each of these 100 tasks , takes around 1 and half minute to
>> complete , it takes around 2 hour for this small insert operation.
>> Is anyone facing the same scenario and is there any time efficient way to
>> handle this?
>> This latency is not good in out use case.
>> Any pointer to improve/minimise the latency will be really appreciated.
>>
>>
>> --
>> Thanks
>> Deepak
>>
>>
>>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread vaquar khan
Hi Kant,

Hope following information will help .

1)Cluster
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-standalone.html
http://spark.apache.org/docs/latest/hardware-provisioning.html


2) Yarn vs Mesos
https://www.linkedin.com/pulse/mesos-compare-yarn-vaquar-
khan-?articleId=7126978319066345849

Yarn for Hadoop and Mesos for enterprise so if use together will be more
beneficial. In short Yarn on Mesos.

  http://mesosphere.github.io/presentations/myriad-strata/#/


Regards,
Vaquar khan

On 16 Dec 2016 18:46, "kant kodali" <kanth...@gmail.com> wrote:

Hi Saif,

What do you mean by small cluster? Any specific size?

Also can you shine some light on how YARN takes a win over mesos?




Thanks,
kant

On Fri, Dec 16, 2016 at 10:45 AM, <saif.a.ell...@wellsfargo.com> wrote:

> In my experience, Standalone works very well in small cluster where there
> isn’t anything else running.
>
>
>
> Bigger cluster or shared resources, YARN takes a win, surpassing the
> overhead of spawning containers as opposed to a background running worker.
>
>
>
> Best is if you try both, if standalone is good enough keep it till you
> need more. Otherwise, try YARN or MESOS depending on the rest of your
> components.
>
>
>
> 2cents
>
>
>
> Saif
>
>
>
> *From:* kant kodali [mailto:kanth...@gmail.com]
> *Sent:* Friday, December 16, 2016 3:14 AM
> *To:* user @spark
> *Subject:* Do we really need mesos or yarn? or is standalone sufficent?
>
>
>
> Do we really need mesos or yarn? or is standalone sufficient for
> production systems? I understand the difference but I don't know the
> capabilities of standalone cluster. does anyone have experience deploying
> standalone in the production?
>
>
>
>
>


Re: Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread vaquar khan
That kind of issue SparkUI and DAG  visualization always helpful.


https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

Regards,
Vaquar khan

On Fri, Dec 16, 2016 at 11:10 AM, Vikas K. <vikas.re...@gmail.com> wrote:

> Unsubscribe.
>
> On Fri, Dec 16, 2016 at 9:21 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I am facing an issue with join operation on dataframe. My job is running
>> for very long time( > 2 hrs ) without any result. can someone help me on
>> how to resolve.
>>
>> I tried re-partition with 13 but no luck.
>>
>>
>> val results_dataframe = sqlContext.sql("select gt.*,ct.* from 
>> PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where 
>> gt.vin=pt.vin and pt.cluster=ct.cluster")
>> //val results_dataframe_partitioned=results_dataframe.coalesce(numPartitions)
>> val results_dataframe_partitioned=results_dataframe.repartition(13)
>>
>> [image: Inline image 1]
>>
>> Thanks,
>> Asmath
>>
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: coalesce ending up very unbalanced - but why?

2016-12-16 Thread vaquar khan
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html

Regards,
vaquar khan

On Wed, Dec 14, 2016 at 12:15 PM, Vaibhav Sinha <mail.vsi...@gmail.com>
wrote:

> Hi,
> I see a similar behaviour in an exactly similar scenario at my deployment
> as well. I am using scala, so the behaviour is not limited to pyspark.
> In my observation 9 out of 10 partitions (as in my case) are of similar
> size ~38 GB each and final one is significantly larger ~59 GB.
> Prime number of partitions is an interesting approach I will try that out.
>
> Best,
> Vaibhav.
>
> On 14 Dec 2016, 10:18 PM +0530, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com>, wrote:
>
> Hello,
> We have done some test in here, and it seems that when we use prime number
> of partitions the data is more spread.
> This has to be with the hashpartitioning and the Java Hash algorithm.
> I don't know how your data is and how is this in python, but if you (can)
> implement a partitioner, or change it from default, you will get a better
> result.
>
> Dirceu
>
> 2016-12-14 12:41 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>:
>
>> Since it's pyspark it's just using the default hash partitioning I
>> believe.  Trying a prime number (71 so that there's enough CPUs) doesn't
>> seem to change anything.  Out of curiousity why did you suggest that?
>> Googling "spark coalesce prime" doesn't give me any clue :-)
>> Adrian
>>
>>
>> On 14/12/2016 13:58, Dirceu Semighini Filho wrote:
>>
>> Hi Adrian,
>> Which kind of partitioning are you using?
>> Have you already tried to coalesce it to a prime number?
>>
>>
>> 2016-12-14 11:56 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>:
>>
>>> I realise that coalesce() isn't guaranteed to be balanced and adding a
>>> repartition() does indeed fix this (at the cost of a large shuffle.
>>>
>>> I'm trying to understand _why_ it's so uneven (hopefully it helps
>>> someone else too).   This is using spark v2.0.2 (pyspark).
>>>
>>> Essentially we're just reading CSVs into a DataFrame (which we persist
>>> serialised for some calculations), then writing it back out as PRQ.  To
>>> avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each).
>>>
>>> The writers end up with about 700-900MB each (not bad).  Except for one
>>> which is at 6GB before I killed it.
>>>
>>> Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like
>>> this, almost all about 2MB each:
>>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209
>>> -i-da71c942-389.gz
>>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529
>>> -i-01d3dab021b760d29-334.gz
>>>
>>> (we're aware that this isn't an ideal naming convention from an S3
>>> performance PoV).
>>>
>>> The actual CSV file format is:
>>> UUID\tINT\tINT\... . (wide rows - about 300 columns)
>>>
>>> e.g.:
>>> 17f9c2a7-ddf6-42d3-bada-63b845cb33a51481587198750   11213
>>> 1d723493-5341-450d-a506-5c96ce0697f01481587198751   11212 ...
>>> 64cec96f-732c-44b8-a02e-098d5b63ad771481587198752   11211 ...
>>>
>>> The dataframe seems to be stored evenly on all the nodes (according to
>>> the storage tab) and all the blocks are the same size.   Most of the tasks
>>> are executed at NODE_LOCAL locality (although there are a few ANY).  The
>>> oversized task is NODE_LOCAL though.
>>>
>>> The reading and calculations all seem evenly spread, confused why the
>>> writes aren't as I'd expect the input partitions to be even, what's causing
>>> and what we can do?  Maybe it's possible for coalesce() to be a bit smarter
>>> in terms of which partitions it coalesces - balancing the size of the final
>>> partitions rather than the number of source partitions in each final
>>> partition.
>>>
>>> Thanks for any light you can shine!
>>>
>>> Adrian
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Adrian Bridgett* |  Sysadmin Engineer, OpenSignal
>> <http://www.opensignal.com>
>> _
>> Office: 3rd Floor, The Angel Office, 2 Angel Square, London, EC1V 1NY
>> Phone #: +44 777-377-8251 <+44%20777-377-8251>
>> Skype: abridgett  |  @adrianbridgett <http://twitter.com/adrianbridgett>
>>   |  LinkedIn link  <https://uk.linkedin.com/in/abridgett>
>> _
>>
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: How to get recent value in spark dataframe

2016-12-16 Thread vaquar khan
Not sure about your logic 0 and 1 but you can use orderBy the data
according to time and get the first value.

Regards,
Vaquar khan

On Wed, Dec 14, 2016 at 10:49 PM, Milin korath <milin.kor...@impelsys.com>
wrote:

> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>   a   0100  2015
>   a   050   2015
>   a   1200  2014
>   a   1300  2013
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>   a   0100  2015200
>   a   050   2015200
>   a   1200  2014null
>   a   1300  2013null
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
>
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Query in SparkSQL

2016-12-12 Thread vaquar khan
Hi Neeraj,

As per my understanding Spark SQL doesn't support Update statements .
Why you need update command in Spark SQL, You can run command in Hive .

Regards,
Vaquar khan

On Mon, Dec 12, 2016 at 10:21 PM, Niraj Kumar <nku...@incedoinc.com> wrote:

> Hi
>
>
>
> I am working on SpqrkSQL using hiveContext (version 1.6.2).
>
> Can I run following queries directly in sparkSQL, if yes how
>
>
>
> update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id
> from samples);
>
>
>
> insert into details (accnt_call_id, prdct_cd, prdct_id, dtl_pstn) select
> accnt_call_id, prdct_cd, prdct_id, 32 from samples where PRDCT_CD = 2114515;
>
>
>
>
>
>
>
> Thanks and Regards,
>
> *Niraj Kumar*
>
>
>
> Disclaimer :
> This email communication may contain privileged and confidential
> information and is intended for the use of the addressee only. If you are
> not an intended recipient you are requested not to reproduce, copy
> disseminate or in any manner distribute this email communication as the
> same is strictly prohibited. If you have received this email in error,
> please notify the sender immediately by return e-mail and delete the
> communication sent in error. Email communications cannot be guaranteed to
> be secure & error free and Incedo Inc. is not liable for any errors in
> the email communication or for the proper, timely and complete transmission
> thereof.
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


Re: Best practises around spark-scala

2016-08-08 Thread vaquar khan
I found following links are good as I am using same.

http://spark.apache.org/docs/latest/tuning.html

https://spark-summit.org/2014/testing-spark-best-practices/

Regards,
Vaquar khan

On 8 Aug 2016 10:11, "Deepak Sharma" <deepakmc...@gmail.com> wrote:

> Hi All,
> Can anyone please give any documents that may be there around spark-scala
> best practises?
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Spark Getting data from MongoDB in JAVA

2016-06-12 Thread vaquar khan
Hi Asfanyar,

*NoSuchMethodError *in Java means you compiled against one version of code
, and executed against a different version.

Please make sure your java version and adding dependency version is working
on same java version.

regards,
vaquar khan

On Fri, Jun 10, 2016 at 4:50 AM, Asfandyar Ashraf Malik <
asfand...@kreditech.com> wrote:

> Hi,
> I did not notice that I put it twice.
> I changed that and ran my program but it still gives the same error:
>
> java.lang.NoSuchMethodError:
> org.apache.spark.sql.catalyst.ScalaReflection$.typeOfObject()Lscala/PartialFunction;
>
>
> Cheers
>
>
>
> 2016-06-10 11:47 GMT+02:00 Alonso Isidoro Roman <alons...@gmail.com>:
>
>> why *spark-mongodb_2.11 dependency is written twice in pom.xml?*
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>>
>> 2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik <
>> asfand...@kreditech.com>:
>>
>>> Hi,
>>> I am using Stratio library to get MongoDB to work with Spark but I get
>>> the following error:
>>>
>>> java.lang.NoSuchMethodError:
>>> org.apache.spark.sql.catalyst.ScalaReflection
>>>
>>> This is my code.
>>>
>>> ---
>>> *public static void main(String[] args) {*
>>>
>>> *JavaSparkContext sc = new JavaSparkContext("local[*]", "test
>>> spark-mongodb java"); *
>>> *SQLContext sqlContext = new SQLContext(sc); *
>>>
>>> *Map options = new HashMap(); *
>>> *options.put("host", "xyz.mongolab.com:59107
>>> <http://xyz.mongolab.com:59107>"); *
>>> *options.put("database", "heroku_app3525385");*
>>> *options.put("collection", "datalog");*
>>> *options.put("credentials", "*,,");*
>>>
>>> *DataFrame df =
>>> sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
>>> *df.registerTempTable("datalog"); *
>>> *df.show();*
>>>
>>> *}*
>>>
>>> ---
>>> My pom file is as follows:
>>>
>>>  **
>>> **
>>> *org.apache.spark*
>>> *spark-core_2.11*
>>> *${spark.version}*
>>> **
>>> **
>>> *    org.apache.spark*
>>> *spark-catalyst_2.11 *
>>> *${spark.version}*
>>> **
>>> **
>>> *org.apache.spark*
>>> *spark-sql_2.11*
>>> *${spark.version}*
>>> * *
>>> **
>>> *com.stratio.datasource*
>>> *spark-mongodb_2.11*
>>> *0.10.3*
>>> **
>>> **
>>> *com.stratio.datasource*
>>> *spark-mongodb_2.11*
>>> *0.10.3*
>>> *jar*
>>> **
>>> **
>>>
>>>
>>> Regards
>>>
>>
>>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: Questions about Spark Worker

2016-06-12 Thread vaquar khan
Agreed with Mich

The spark driver is the program that declares the transformations and
actions on RDDs of data and submits such requests to the master.

*spark.driver.host :* Hostname or IP address for the driver to listen on.
This is used for communicating with the executors and the standalone Master.


hostname -f return FQDN add this to /etc/hosts file (
http://www.tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap9sec95.html
)

*Doc:*
http://spark.apache.org/docs/latest/configuration.html

regards,
Vaquar Kha


On Sun, Jun 12, 2016 at 8:27 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> You basically want to use wired/Ethernet connections as opposed to
> wireless?
>
> in Your Spark Web UI under environment table what do you get for "
> spark.driver.host".
>
> Also can you cat /etc/hosts and send the output please  and  the output
> from ifconfig -a
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 June 2016 at 14:12, East Evil <super_big_h...@sina.com> wrote:
>
>> Hi, guys
>>
>> My question is about Spark Worker IP address.
>>
>> I have four nodes, four nodes have Wireless module and Ethernet module,
>> so all nodes have two IP addresses.
>>
>> When I vist the webUI, information is always displayed in the  Wireless
>> IP address but my Spark computing cluster based on Ethernet.
>>
>> I have tried “ifconfig wlan0 down” and then “start-all.sh”, the Worker IP
>> address become 127.0.0.1, and then I tried “ifconfig l0 down” and the
>> Worker IP address become 127.0.1.1.
>>
>> What should I do to make IP use the IP address of the Ethernet instead of
>> the address of the wireless?
>>
>> Thanks
>>
>> Jay
>>
>>
>>
>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>
>>
>>
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: OutOfMemoryError - When saving Word2Vec

2016-06-12 Thread vaquar khan
Hi Sharad.

The array size you (or the serializer) tries to allocate is just too big
for the JVM.

You can also split your input further by increasing parallelism.

Following is good explanintion

https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit

regards,
Vaquar khan

On Sun, Jun 12, 2016 at 5:08 AM, sharad82 <khandelwal.gem...@gmail.com>
wrote:

> When trying to save the word2vec model trained over 10G of data leads to
> below OOM error.
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> Spark Version: 1.6
> spark.dynamicAllocation.enable  false
> spark.executor.memory   75g
> spark.driver.memory 150g
> spark.driver.cores  10
>
> Full Stack Trace:
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at
>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at
>
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuilder.append(StringBuilder.java:136)
> at java.lang.StringBuilder.append(StringBuilder.java:131)
> at
> scala.StringContext.standardInterpolator(StringContext.scala:122)
> at scala.StringContext.s(StringContext.scala:90)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
> at
>
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
>
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at
>
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
> at
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
> at
>
> org.apache.spark.ml.feature.Word2VecModel$Word2VecModelWriter.saveImpl(Word2Vec.scala:271)
> at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:91)
> at
> org.apache.spark.ml.util.MLWritable$class.save(ReadWrite.scala:131)
> at
> org.apache.spark.ml.feature.Word2VecModel.save(Word2Vec.scala:172)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: oozie and spark on yarn

2016-06-08 Thread vaquar khan
Hi Karthi,

Hope following information will help you.

Doc:
https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

Example :
https://developer.ibm.com/hadoop/2015/11/05/run-spark-job-yarn-oozie/

Code :

http://3097fca9b1ec8942c4305e550ef1b50a.proxysheep.com/apache/oozie/blob/master/client/src/main/resources/spark-action-0.1.xsd

regards,

Vaquar Khan



On Wed, Jun 8, 2016 at 5:26 AM, karthi keyan <karthi93.san...@gmail.com>
wrote:

> Hi ,
>
> Make sure you have oozie 4.2.0 and configured with either yarn / mesos
> mode.
>
> Well, you just parse your scala / Jar file in the below syntax,
>
> 
> ${jobTracker}
> ${nameNode}
> ${master}
> Wordcount
> ${Classname}
> ${nameNode}/WordCount.jar 
>
> above sample will have the required file in HDFS and made change according
> to your use case.
>
> Regards,
> Karthik
>
>
> On Wed, Jun 8, 2016 at 1:10 PM, pseudo oduesp <pseudo20...@gmail.com>
> wrote:
>
>> hi ,
>>
>> i want ask if somone used oozie with spark ?
>>
>> if you can give me example:
>> how ? we can configure  on yarn
>> thanks
>>
>
>


-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: Spark_Usecase

2016-06-07 Thread vaquar khan
Deepak Spark does provide support to incremental load,if users want to
schedule their batch jobs frequently and  want to have incremental load of
their data from databases.

You will not get good performance  to update your Spark SQL tables backed
by files. Instead, you can use message queues and Spark Streaming or do an
incremental select to make sure your Spark SQL tables stay up to date with
your production databases

Regards,
Vaquar khan
On 7 Jun 2016 10:29, "Deepak Sharma" <deepakmc...@gmail.com> wrote:

I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from  spark job before it
finishes with the new relevant ranges to be used next time.

On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
> Now I am using spark to do the same. Right now, I am trying
> to implement incremental updates while loading from MySQL through spark.
> Can you suggest any best practices for this ? Thank you.
>
>
> On Tuesday, June 7, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I use Spark rather that Sqoop to import data from an Oracle table into a
>> Hive ORC table.
>>
>> It used JDBC for this purpose. All inclusive in Scala itself.
>>
>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>> map-reduce/.
>>
>> pretty simple.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 15:38, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> bq. load the data from edge node to hdfs
>>>
>>> Does the loading involve accessing sqlserver ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistr...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>> how about
>>>>
>>>> 1.  have a process that read the data from your sqlserver and dumps it
>>>> as a file into a directory on your hd
>>>> 2. use spark-streanming to read data from that directory  and store it
>>>> into hdfs
>>>>
>>>> perhaps there is some sort of spark 'connectors' that allows you to
>>>> read data from a db directly so you dont need to go via spk streaming?
>>>>
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itsche...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Spark users,
>>>>>
>>>>> Right now we are using spark for everything(loading the data from
>>>>> sqlserver, apply transformations, save it as permanent tables in
>>>>> hive) in our environment. Everything is being done in one spark 
>>>>> application.
>>>>>
>>>>> The only thing we do before we launch our spark application through
>>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>>> through a ssh action from oozie to run shell script on edge node).
>>>>>
>>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>>> graph?
>>>>>
>>>>> Any pointers are highly appreciated. Thanks
>>>>>
>>>>> Regards,
>>>>> Aj
>>>>>
>>>>
>>>>
>>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark Interview Questions

2015-07-29 Thread vaquar khan
Hi Abhishek,

Please  learn spark ,there are no shortcuts for sucess.

Regards,
Vaquar khan
On 29 Jul 2015 11:32, Mishra, Abhishek abhishek.mis...@xerox.com wrote:

 Hello,

 Please help me with links or some document for Apache Spark interview
 questions and answers. Also for the tools related to it ,for which
 questions could be asked.

 Thanking you all.

 Sincerely,
 Abhishek

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Java 8 vs Scala

2015-07-15 Thread vaquar khan
My choice is java 8
On 15 Jul 2015 18:03, Alan Burlison alan.burli...@oracle.com wrote:

 On 15/07/2015 08:31, Ignacio Blasco wrote:

  The main advantage of using scala vs java 8 is being able to use a console


 https://bugs.openjdk.java.net/browse/JDK-8043364

 --
 Alan Burlison
 --

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Research ideas using spark

2015-07-15 Thread vaquar khan
I would suggest study spark ,flink,strom and based on your understanding
and finding prepare your research paper.

May be you will invented new spark ☺

Regards,
Vaquar khan
On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote:

 Silly question…

 When thinking about a PhD thesis… do you want to tie it to a specific
 technology or do you want to investigate an idea but then use a specific
 technology.
 Or is this an outdated way of thinking?

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.”

 So before we look at technologies like Spark… could the OP break down a
 more specific concept or idea that he wants to pursue?

 Looking at what Jorn said…

 Using machine learning to better predict workloads in terms of managing
 clusters… This could be interesting… but is it enough for a PhD thesis, or
 of interest to the OP?


 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote:

 Well one of the strength of spark is standardized general distributed
 processing allowing many different types of processing, such as graph
 processing, stream processing etc. The limitation is that it is less
 performant than one system focusing only on one type of processing (eg
 graph processing). I miss - and this may not be spark specific - some
 artificial intelligence to manage a cluster, e.g. Predicting workloads, how
 long a job may run based on previously executed similar jobs etc.
 Furthermore, many optimizations you have do to manually, e.g. Bloom
 filters, partitioning etc - if you find here as well some intelligence that
 does this automatically based on previously executed jobs taking into
 account that optimizations themselves change over time would be great...
 You may also explore feature interaction

 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com
 a écrit :

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards






Re: Spark Intro

2015-07-15 Thread vaquar khan
Totally agreed with hafasa, you need to identify your requirements and
needs before choose spark.

If you want to handle data with fast access go to no sql (mongo,aerospike
etc) if you need data analytical then spark is best .

Regards,
Vaquar khan
On 14 Jul 2015 20:39, Hafsa Asif hafsa.a...@matchinguu.com wrote:

 Hi,
 I was also in the same situation as we were using MySQL. Let me give some
 clearfications:
 1. Spark provides a great methodology for big data analysis. So, if you
 want to make your system more analytical and want deep prepared analytical
 methods to analyze your data, then its a very good option.
 2. If you want to get rid of old behavior of MS SQL and want to take fast
 responses from database with huge datasets then you can take any NOSQL
 database.

 In my case I select Aerospike for data storage and apply Spark analytical
 engine on it. It gives me really good response and I have a plan to go in
 real production with this combination.

 Best,
 Hafsa

 2015-07-14 11:49 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com:

 It might take some time to understand the echo system. I'm not sure about
 what kind of environment you are having (like #cores, Memory etc.), To
 start with, you can basically use a jdbc connector or dump your data as csv
 and load it into Spark and query it. You get the advantage of caching if
 you have more memory, also if you have enough cores 4 records are
 nothing.

 Thanks
 Best Regards

 On Tue, Jul 14, 2015 at 3:09 PM, vinod kumar vinodsachin...@gmail.com
 wrote:

 Hi Akhil

 Is my choice to switch to spark is good? because I don't have enough
 information regards limitation and working environment of spark.
 I tried spark SQL but it seems it returns data slower than compared to
 MsSQL.( I have tested with data which has 4 records)



 On Tue, Jul 14, 2015 at 3:50 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 This is where you can get started
 https://spark.apache.org/docs/latest/sql-programming-guide.html

 Thanks
 Best Regards

 On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar vinodsachin...@gmail.com
 wrote:


 Hi Everyone,

 I am developing application which handles bulk of data around
 millions(This may vary as per user's requirement) records.As of now I am
 using MsSqlServer as back-end and it works fine  but when I perform some
 operation on large data I am getting overflow exceptions.I heard about
 spark that it was fastest computation engine Than SQL(Correct me if I am
 worng).so i thought to switch my application to spark.Is my decision is
 right?
 My User Enviroment is
 #.Window 8
 #.Data in millions.
 #.Need to perform filtering and Sorting operations with aggregartions
 frequently.(for analystics)

 Thanks in-advance,

 Vinod








Re: Eclipse on spark

2015-01-26 Thread vaquar khan
I am using SBT
On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote:

 I use this: http://scala-ide.org/

 I also use Maven with this archetype:
 https://github.com/davidB/scala-archetype-simple. To be frank though, you
 should be fine using SBT.

 On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote:

 How to compile a Spark project in Scala IDE for Eclipse? I got many scala
 scripts and i no longer want to load them from scala-shell what can i do?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org