Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 .
I can contribute to it as well .

On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage 
wrote:

> +1
>
> Thanks for proposing
>
> On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
>  wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: Online classes for spark topics

2023-03-08 Thread Deepak Sharma
I can prepare some topics and present as well , if we have a prioritised
list of topics already .

On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee  wrote:

> We used to run Spark webinars on the Apache Spark LinkedIn group
>  but
> honestly the turnout was pretty low.  We had dove into various features.
> If there are particular topics that. you would like to discuss during a
> live session, please let me know and we can try to restart them.  HTH!
>
> On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:
>
>> +1
>>
>> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
>> wrote:
>>
>>> +1, any webinar on Spark related topic is appreciated 
>>>
>>> Thank You & Best Regards
>>> Winston Lai
>>> --
>>> *From:* asma zgolli 
>>> *Sent:* Thursday, March 9, 2023 5:43:06 AM
>>> *To:* karan alang 
>>> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com
>>> ; User 
>>> *Subject:* Re: Online classes for spark topics
>>>
>>> +1
>>>
>>> Le mer. 8 mars 2023 à 21:32, karan alang  a
>>> écrit :
>>>
>>> +1 .. I'm happy to be part of these discussions as well !
>>>
>>>
>>>
>>>
>>> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I guess I can schedule this work over a course of time. I for myself can
>>> contribute plus learn from others.
>>>
>>> So +1 for me.
>>>
>>> Let us see if anyone else is interested.
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
>>> wrote:
>>>
>>>
>>> Hello Mich.
>>>
>>> Greetings. Would you be able to arrange for Spark Structured Streaming
>>> learning webinar.?
>>>
>>> This is something I haven been struggling with recently. it will be very
>>> helpful.
>>>
>>> Thanks and Regard
>>>
>>> AK
>>> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> This might  be a worthwhile exercise on the assumption that the
>>> contributors will find the time and bandwidth to chip in so to speak.
>>>
>>> I am sure there are many but on top of my head I can think of Holden
>>> Karau for k8s, and Sean Owen for data science stuff. They are both very
>>> experienced.
>>>
>>> Anyone else 樂
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>>>  wrote:
>>>
>>> Hello gurus,
>>>
>>> Does Spark arranges online webinars for special topics like Spark on
>>> K8s, data science and Spark Structured Streaming?
>>>
>>> I would be most grateful if experts can share their experience with
>>> learners with intermediate knowledge like myself. Hopefully we will find
>>> the practical experiences told valuable.
>>>
>>> Respectively,
>>>
>>> AK
>>>
>>>
>>>
>>>
>>


Re: Spark Issue with Istio in Distributed Mode

2022-09-12 Thread Deepak Sharma
Was able to resolve the idle connections being terminated issue using
EnvoyFilter

On Sat, 3 Sept 2022 at 18:14, Ilan Filonenko  wrote:

> Must be set in envoy (maybe could passthrough via istio)
>
> https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/protocol.proto#envoy-v3-api-field-config-core-v3-httpprotocoloptions-idle-timeout
>
>
> On Sat, Sep 3, 2022 at 4:23 AM Deepak Sharma 
> wrote:
>
>> Thank for the reply IIan .
>> Can we set this in spark conf or does it need to goto istio / envoy conf?
>>
>>
>>
>> On Sat, 3 Sept 2022 at 10:28, Ilan Filonenko  wrote:
>>
>>> This might be a result of the idle_timeout that is configured in envoy.
>>> The default is an hour.
>>>
>>> On Sat, Sep 3, 2022 at 12:17 AM Deepak Sharma 
>>> wrote:
>>>
>>>> Hi All,
>>>> In 1 of our cluster , we enabled Istio where spark is running in
>>>> distributed mode.
>>>> Spark works fine when we run it with Istio in standalone mode.
>>>> In spark distributed mode , we are seeing that every 1 hour or so the
>>>> workers are getting disassociated from master and then master is not able
>>>> to spawn any jobs on these workers , until we restart spark rest server.
>>>>
>>>> Here is the error we see in the worker logs:
>>>>
>>>>
>>>> *ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to :
>>>> Driver spark-rest-service:44463 disassociated! Shutting down.*
>>>>
>>>> For 1 hour or so (until this issue happens) , spark distributed mode
>>>> works just fine.
>>>>
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>


Re: Spark Issue with Istio in Distributed Mode

2022-09-03 Thread Deepak Sharma
Thank for the reply IIan .
Can we set this in spark conf or does it need to goto istio / envoy conf?



On Sat, 3 Sept 2022 at 10:28, Ilan Filonenko  wrote:

> This might be a result of the idle_timeout that is configured in envoy.
> The default is an hour.
>
> On Sat, Sep 3, 2022 at 12:17 AM Deepak Sharma 
> wrote:
>
>> Hi All,
>> In 1 of our cluster , we enabled Istio where spark is running in
>> distributed mode.
>> Spark works fine when we run it with Istio in standalone mode.
>> In spark distributed mode , we are seeing that every 1 hour or so the
>> workers are getting disassociated from master and then master is not able
>> to spawn any jobs on these workers , until we restart spark rest server.
>>
>> Here is the error we see in the worker logs:
>>
>>
>> *ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to :
>> Driver spark-rest-service:44463 disassociated! Shutting down.*
>>
>> For 1 hour or so (until this issue happens) , spark distributed mode
>> works just fine.
>>
>>
>> Thanks
>> Deepak
>>
>


Spark Issue with Istio in Distributed Mode

2022-09-02 Thread Deepak Sharma
Hi All,
In 1 of our cluster , we enabled Istio where spark is running in
distributed mode.
Spark works fine when we run it with Istio in standalone mode.
In spark distributed mode , we are seeing that every 1 hour or so the
workers are getting disassociated from master and then master is not able
to spawn any jobs on these workers , until we restart spark rest server.

Here is the error we see in the worker logs:


*ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver
spark-rest-service:44463 disassociated! Shutting down.*

For 1 hour or so (until this issue happens) , spark distributed mode works
just fine.


Thanks
Deepak


Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma
It will spill to disk if everything can’t be loaded in memory .


On Wed, 22 Jun 2022 at 5:58 PM, Sid  wrote:

> I have a 150TB CSV file.
>
> I have a total of 100 TB RAM and 100TB disk. So If I do something like this
>
> spark.read.option("header","true").csv(filepath).show(false)
>
> Will it lead to an OOM error since it doesn't have enough memory? or it
> will spill data onto the disk and process it?
>
> Thanks,
> Sid
>
-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: spark as data warehouse?

2022-03-25 Thread Deepak Sharma
It can be used as warehouse but then you have to keep long running spark
jobs.
This can be possible using cached data frames or dataset .

Thanks
Deepak

On Sat, 26 Mar 2022 at 5:56 AM,  wrote:

> In the past time we have been using hive for building the data
> warehouse.
> Do you think if spark can used for this purpose? it's even more realtime
> than hive.
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Deepak Sharma
coalesce returns a new dataset.
That will cause the recomputation.

Thanks
Deepak

On Sun, 30 Jan 2022 at 14:06, Benjamin Du  wrote:

> I have some PySpark code like below. Basically, I persist a DataFrame
> (which is time-consuming to compute) to disk, call the method
> DataFrame.count to trigger the caching/persist immediately, and then I
> coalesce the DataFrame to reduce the number of partitions (the original
> DataFrame has 30,000 partitions) and output it to HDFS. Based on the
> execution time of job stages and the execution plan, it seems to me that
> the DataFrame is recomputed at df.coalesce(300). Does anyone know why
> this happens?
>
> df = spark.read.parquet("/input/hdfs/path") \
> .filter(...) \
> .withColumn("new_col", my_pandas_udf("col0", "col1")) \
> .persist(StorageLevel.DISK_ONLY)
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>
>
> BTW, it works well if I manually write the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.
> 
>
> Best,
>
> 
>
> Ben Du
>
> Personal Blog  | GitHub
>  | Bitbucket 
> | Docker Hub 
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Profiling spark application

2022-01-19 Thread Deepak Sharma
You can take a look at jvm profiler that was open sourced by uber:
https://github.com/uber-common/jvm-profiler



On Thu, Jan 20, 2022 at 11:20 AM Prasad Bhalerao <
prasadbhalerao1...@gmail.com> wrote:

> Hi,
>
> It will require code changes and I am looking at some third party code , I
> am looking for something which I can just hook to jvm and get the stats..
>
> Thanks,
> Prasad
>
> On Thu, Jan 20, 2022 at 11:00 AM Sonal Goyal 
> wrote:
>
>> Hi Prasad,
>>
>> Have you checked the SparkListener -
>> https://mallikarjuna_g.gitbooks.io/spark/content/spark-SparkListener.html
>> ?
>>
>> Cheers,
>> Sonal
>> https://github.com/zinggAI/zingg
>>
>>
>>
>> On Thu, Jan 20, 2022 at 10:49 AM Prasad Bhalerao <
>> prasadbhalerao1...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Is there any way we can profile spark applications which will show no.
>>> of invocations of spark api and their execution time etc etc just the way
>>> jprofiler shows all the details?
>>>
>>>
>>> Thanks,
>>> Prasad
>>>
>>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Edge AI with Spark

2020-09-24 Thread Deepak Sharma
Near edge would work in this case.
On Edge doesn't makes much sense , specially if its distributed processing
framework such as spark.

On Thu, Sep 24, 2020 at 3:12 PM Gourav Sengupta 
wrote:

> hi,
>
> its better to use lighter frameworks over edge. Some of the edge devices I
> work on run at over 40 to 50 degree celsius, therefore using lighter
> frameworks will be useful for the health of the device.
>
> Regards,
> Gourav
>
> On Thu, Sep 24, 2020 at 8:42 AM ayan guha  wrote:
>
>> Too broad a question  and the short answer is yes and long answer is it
>> depends.
>>
>> Essentially spark is a compute engine so it can be wrapped into any
>> containerized model and deployed at the edge. I believe there are various
>> implemntation available
>>
>>
>>
>> On Thu, 24 Sep 2020 at 5:19 pm, Marco Sassarini <
>> marco.sassar...@overit.it> wrote:
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>> I'd like to know if Spark supports edge AI: can Spark
>>>
>>> run on physical device such as mobile devices running Android/iOS?
>>>
>>>
>>>
>>>
>>>
>>> Best regards,
>>>
>>>
>>> Marco Sassarini
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Marco SassariniArtificial Intelligence Department*
>>>
>>>
>>>
>>>
>>>
>>>
>>> office: +39 0434 562 978
>>>
>>>
>>>
>>> www.overit.it
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi
Is there any design pattern around writing to the same hdfs directory from
multiple spark jobs?

-- 
Thanks
Deepak
www.bigdatabig.com


GroupBy issue while running K-Means - Dataframe

2020-06-16 Thread Deepak Sharma
Hi All,
I have a custom implementation of K-Means where it needs the data to be
grouped by a key in a dataframe.
Now there is a big data skew for some of the keys , where it exceeds the
BufferHolder:
 Cannot grow BufferHolder by size 17112 because the size after growing
exceeds size limitation 2147483632

I tried solving it by converting the dataframe to RDD and then using
reduceByKey on RDD and converting it back to RDD.
This gives Java Heap : Out of memory error.
Since it looks like a common issue , i was wondering how anyone would be
solving this problem ?
-- 
Thanks
Deepak


Re: On spam messages

2020-04-29 Thread Deepak Sharma
Much appreciated Sean.
Thanks.


On Wed, 29 Apr 2020 at 6:48 PM, Sean Owen  wrote:

> I am subscribed to this list to watch for a certain person's new
> accounts, which are posting obviously off-topic and inappropriate
> messages. It goes without saying that this is unacceptable and a CoC
> violation, and anyone posting that will be immediately removed and
> blocked.
>
> In the meantime, please don't prolong and expand these threads by
> engaging the very off-topic discussion on the list. You can email me
> privately to ensure I've caught any such messages.
>
> Yes, the original account was removed for this behavior and the new
> one will be too immediately.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Unsubscribe

2020-04-29 Thread Deepak Sharma
-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: OFFICIAL USA REPORT TODAY India Most Dangerous : USA Religious Freedom Report out TODAY

2020-04-29 Thread Deepak Sharma
I am unsubscribing until these hatemongers like Zahid Amin are removed or
blocked .
FYI Zahid Amin , Indian govt rejected the false report already .



On Wed, 29 Apr 2020 at 11:58 AM, Gaurav Agarwal 
wrote:

> Spark moderator supress this user please. Unnecessary Spam or apache spark
> account is hacked ?
>
> On Wed, Apr 29, 2020, 11:56 AM Zahid Amin  wrote:
>
>> How can it be rumours   ?
>> Of course you want  to suppress me.
>> Suppress USA official Report out TODAY .
>>
>> > Sent: Wednesday, April 29, 2020 at 8:17 AM
>> > From: "Deepak Sharma" 
>> > To: "Zahid Amin" 
>> > Cc: user@spark.apache.org
>> > Subject: Re: India Most Dangerous : USA Religious Freedom Report
>> >
>> > Can someone block this email ?
>> > He is spreading rumours and spamming.
>> >
>> > On Wed, 29 Apr 2020 at 11:46 AM, Zahid Amin 
>> wrote:
>> >
>> > > USA report states that India is now the most dangerous country for
>> Ethnic
>> > > Minorities.
>> > >
>> > > Remember Martin Luther King.
>> > >
>> > >
>> > >
>> https://www.mail.com/int/news/us/9880960-religious-freedom-watchdog-pitches-adding-india-to.html#.1258-stage-set1-3
>> > >
>> > > It began with Kasmir and still in locked down Since August 2019.
>> > >
>> > > The Hindutwa  want to eradicate all minorities .
>> > > The Apache foundation is infested with these Hindutwa purists and
>> their
>> > > sympathisers.
>> > > Making Sure all Muslims are kept away from IT industry. Using you to
>> help
>> > > them.
>> > >
>> > > Those people in IT you deal with are purists yet you are not welcome
>> India.
>> > >
>> > > The recognition of  Hindutwa led to the creation of Pakistan in 1947.
>> > >
>> > > Evil propers when good men do nothing.
>> > > The genocide is not coming . It is Here.
>> > > I ask you please think and act.
>> > > Protect the Muslims from Indian Continent.
>> > >
>> > > -
>> > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > >
>> > > --
>> > Thanks
>> > Deepak
>> > www.bigdatabig.com
>> > www.keosha.net
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: India Most Dangerous : USA Religious Freedom Report

2020-04-29 Thread Deepak Sharma
Can someone block this email ?
He is spreading rumours and spamming.

On Wed, 29 Apr 2020 at 11:46 AM, Zahid Amin  wrote:

> USA report states that India is now the most dangerous country for Ethnic
> Minorities.
>
> Remember Martin Luther King.
>
>
> https://www.mail.com/int/news/us/9880960-religious-freedom-watchdog-pitches-adding-india-to.html#.1258-stage-set1-3
>
> It began with Kasmir and still in locked down Since August 2019.
>
> The Hindutwa  want to eradicate all minorities .
> The Apache foundation is infested with these Hindutwa purists and their
> sympathisers.
> Making Sure all Muslims are kept away from IT industry. Using you to help
> them.
>
> Those people in IT you deal with are purists yet you are not welcome India.
>
> The recognition of  Hindutwa led to the creation of Pakistan in 1947.
>
> Evil propers when good men do nothing.
> The genocide is not coming . It is Here.
> I ask you please think and act.
> Protect the Muslims from Indian Continent.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


unsubscribe

2019-12-07 Thread Deepak Sharma



Re: PGP Encrypt using spark Scala

2019-08-26 Thread Deepak Sharma
Hi Schit
PGP Encrypt is something that is not inbuilt with spark.
I would suggest writing a shell script that would do pgp encrypt and use it
in spark scala program , which would run from driver.

Thanks
Deepak

On Mon, Aug 26, 2019 at 8:10 PM Sachit Murarka 
wrote:

> Hi All,
>
> I want to encrypt my files available at HDFS location using PGP Encryption
> How can I do it in spark. I saw Apache Camel  but it seems camel is used
> when source files are in Local location rather than HDFS.
>
> Kind Regards,
> Sachit Murarka
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: A basic question

2019-06-17 Thread Deepak Sharma
You can follow this example:
https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html


On Mon, Jun 17, 2019 at 12:27 PM Shyam P  wrote:

> I am developing a spark job using java1.8v.
>
> Is it possible to write a spark app using spring-boot technology?
> Did anyone tried it ? if so how it should be done?
>
>
> Regards,
> Shyam
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Read hdfs files in spark streaming

2019-06-10 Thread Deepak Sharma
Thanks All.
I managed to get this working.
Marking this thread as closed.

On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma  wrote:

> This is the project requirement , where paths are being streamed in kafka
> topic.
> Seems it's not possible using spark structured streaming.
>
>
> On Mon, Jun 10, 2019 at 3:59 PM Shyam P  wrote:
>
>> Hi Deepak,
>>  Why are you getting paths from kafka topic? any specific reason to do so
>> ?
>>
>> Regards,
>> Shyam
>>
>> On Mon, Jun 10, 2019 at 10:44 AM Deepak Sharma 
>> wrote:
>>
>>> The context is different here.
>>> The file path are coming as messages in kafka topic.
>>> Spark streaming (structured) consumes form this topic.
>>> Now it have to get the value from the message , thus the path to file.
>>> read the json stored at the file location into another df.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Sun, Jun 9, 2019 at 11:03 PM vaquar khan 
>>> wrote:
>>>
>>>> Hi Deepak,
>>>>
>>>> You can use textFileStream.
>>>>
>>>> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
>>>>
>>>> Plz start using stackoverflow to ask question to other ppl so get
>>>> benefits of answer
>>>>
>>>>
>>>> Regards,
>>>> Vaquar khan
>>>>
>>>> On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma 
>>>> wrote:
>>>>
>>>>> I am using spark streaming application to read from  kafka.
>>>>> The value coming from kafka message is path to hdfs file.
>>>>> I am using spark 2.x , spark.read.stream.
>>>>> What is the best way to read this path in spark streaming and then
>>>>> read the json stored at the hdfs path , may be using spark.read.json , 
>>>>> into
>>>>> a df inside the spark streaming app.
>>>>> Thanks a lot in advance
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
The context is different here.
The file path are coming as messages in kafka topic.
Spark streaming (structured) consumes form this topic.
Now it have to get the value from the message , thus the path to file.
read the json stored at the file location into another df.

Thanks
Deepak

On Sun, Jun 9, 2019 at 11:03 PM vaquar khan  wrote:

> Hi Deepak,
>
> You can use textFileStream.
>
> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
>
> Plz start using stackoverflow to ask question to other ppl so get benefits
> of answer
>
>
> Regards,
> Vaquar khan
>
> On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma  wrote:
>
>> I am using spark streaming application to read from  kafka.
>> The value coming from kafka message is path to hdfs file.
>> I am using spark 2.x , spark.read.stream.
>> What is the best way to read this path in spark streaming and then read
>> the json stored at the hdfs path , may be using spark.read.json , into a df
>> inside the spark streaming app.
>> Thanks a lot in advance
>>
>> --
>> Thanks
>> Deepak
>>
>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from  kafka.
The value coming from kafka message is path to hdfs file.
I am using spark 2.x , spark.read.stream.
What is the best way to read this path in spark streaming and then read the
json stored at the hdfs path , may be using spark.read.json , into a df
inside the spark streaming app.
Thanks a lot in advance

-- 
Thanks
Deepak


Re: dynamic allocation in spark-shell

2019-05-31 Thread Deepak Sharma
You can start spark-shell with these properties:
--conf spark.dynamicAllocation.enabled=true --conf
spark.dynamicAllocation.initialExecutors=2 --conf
spark.dynamicAllocation.minExecutors=2 --conf
spark.dynamicAllocation.maxExecutors=5

On Fri, May 31, 2019 at 5:30 AM Qian He  wrote:

> Sometimes it's convenient to start a spark-shell on cluster, like
> ./spark/bin/spark-shell --master yarn --deploy-mode client --num-executors
> 100 --executor-memory 15g --executor-cores 4 --driver-memory 10g --queue
> myqueue
> However, with command like this, those allocated resources will be
> occupied until the console exits.
>
> Just wandering if it is possible to start a spark-shell with
> dynamicAllocation enabled? If it is, how to specify the configs? Can anyone
> give an quick example? Thanks!
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Getting EOFFileException while reading from sequence file in spark

2019-04-29 Thread Deepak Sharma
This can happen if the file size is 0

On Mon, Apr 29, 2019 at 2:28 PM Prateek Rajput
 wrote:

> Hi guys,
> I am getting this strange error again and again while reading from from a
> sequence file in spark.
> User class threw exception: org.apache.spark.SparkException: Job aborted.
> at
> org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
> at
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
> at
> org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)
> at
> org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)
> at
> com.flipkart.prognos.spark.UniqueDroppedFSN.main(UniqueDroppedFSN.java:42)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 186 in stage 0.0 failed 4 times, most recent failure: Lost
> task 186.3 in stage 0.0 (TID 179, prod-fdphadoop-krios-dn-1039, executor
> 1): java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2436)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2568)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:293)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:224)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> 

Spark streaming filling the disk with logs

2019-02-13 Thread Deepak Sharma
Hi All
I am running a spark streaming job with below configuration :

--conf "spark.executor.extraJavaOptions=-Droot.logger=WARN,console"

But it’s still filling the disk with info logs.
If the logging level is set to WARN at cluster level , then only the WARN
logs are getting written but then it affects all the jobs .

Is there any way to get rid of INFO level of logging at spark streaming job
level ?

Thanks
Deepak

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Error while upserting ElasticSearch from Spark 2.2

2018-10-08 Thread Deepak Sharma
Hi All,
I am facing this weird issue while upserting ElasticSearch using Spark Data
Frame.
*org.elasticsearch.hadoop.rest.EsHadoopRemoteException:
version_conflict_engine_exception:*

After it fails and if rerun for 2-3 times , it finally succeeds.
I thought to check if anyone faced this issue and what needs to be done to
get rid of this ?

-- 
Thanks
Deepak


Re: getting error: value toDF is not a member of Seq[columns]

2018-09-05 Thread Deepak Sharma
Try this:

*import **spark*.implicits._

df.toDF()


On Wed, Sep 5, 2018 at 2:31 PM Mich Talebzadeh 
wrote:

> With the following
>
> case class columns(KEY: String, TICKER: String, TIMEISSUED: String, PRICE:
> Float)
>
>  var key = line._2.split(',').view(0).toString
>  var ticker =  line._2.split(',').view(1).toString
>  var timeissued = line._2.split(',').view(2).toString
>  var price = line._2.split(',').view(3).toFloat
>
>   var df = Seq(columns(key, ticker, timeissued, price))
>  println(df)
>
> I get
>
>
> List(columns(ac11a78d-82df-4b37-bf58-7e3388aa64cd,MKS,2018-09-05T10:10:15,676.5))
>
> So just need to convert that list to DF
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 5 Sep 2018 at 09:49, Mich Talebzadeh 
> wrote:
>
>> Thanks!
>>
>> The spark  is version 2.3.0
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 5 Sep 2018 at 09:41, Jungtaek Lim  wrote:
>>
>>> You may also find below link useful (though it looks far old), since
>>> case class is the thing which Encoder is available, so there may be another
>>> reason which prevent implicit conversion.
>>>
>>>
>>> https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Spark-Scala-Error-value-toDF-is-not-a-member-of-org-apache/m-p/29994/highlight/true#M973
>>>
>>> And which Spark version do you use?
>>>
>>>
>>> 2018년 9월 5일 (수) 오후 5:32, Jungtaek Lim 님이 작성:
>>>
 Sorry I guess I pasted another method. the code is...

 implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): 
 DatasetHolder[T] = {
   DatasetHolder(_sqlContext.createDataset(s))
 }


 2018년 9월 5일 (수) 오후 5:30, Jungtaek Lim 님이 작성:

> I guess you need to have encoder for the type of result for columns().
>
>
> https://github.com/apache/spark/blob/2119e518d31331e65415e0f817a6f28ff18d2b42/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala#L227-L229
>
> implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): 
> DatasetHolder[T] = {
>   DatasetHolder(_sqlContext.createDataset(rdd))
> }
>
> You can see lots of Encoder implementations in the scala code. If your
> type doesn't match anything it may not work and you need to provide custom
> Encoder.
>
> -Jungtaek Lim (HeartSaVioR)
>
> 2018년 9월 5일 (수) 오후 5:24, Mich Talebzadeh 님이
> 작성:
>
>> Thanks
>>
>> I already do that as below
>>
>> val sqlContext= new org.apache.spark.sql.SQLContext(sparkContext)
>>   import sqlContext.implicits._
>>
>> but still getting the error!
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 5 Sep 2018 at 09:17, Jungtaek Lim  wrote:
>>
>>> You may need to import implicits from your spark session like below:
>>> (Below code is borrowed from
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html)
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val spark = SparkSession
>>>   .builder()
>>>   .appName("Spark SQL basic example")
>>>   .config("spark.some.config.option", "some-value")
>>>   .getOrCreate()
>>> // For implicit conversions like converting RDDs 

java.lang.IndexOutOfBoundsException: len is negative - when data size increases

2018-08-16 Thread Deepak Sharma
Hi All,
I am running spark based ETL in spark 1.6  and facing this weird issue.
The same code with same properties/configuration runs fine in other
environment E.g. PROD but never completes in CAT.
The only change would be the size of data it is processing and that too be
by 1-2 GB.
This is the stack trace:java.lang.IndexOutOfBoundsException: len is negative
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:895)
at
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:76)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$SpillableIterator.loadNext(UnsafeExternalSorter.java:509)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123)
at
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
at
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoin.scala:272)
at
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoin.scala:233)
at
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeOuterJoin.scala:250)
at
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeOuterJoin.scala:283)
at
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Did anyone faced this issue?
If yes , what can i do to resolve this?

Thanks
Deepak


Re: Big data visualization

2018-05-27 Thread Deepak Sharma
Yes Amin
Spark is primarily being used for ETL.
Once you transform , you can store it in any nosql DBs that support use
case.
The BI dashboard app can further connect to the nosql DB for reports and
visualization.

HTH

Deepak.

On Mon, May 28, 2018, 05:47 amin mohebbi 
wrote:

>  I am working on analytic application using Apache Spark to store and
> analyze data. Spark might be used as a ETL application to aggregate
> different metrics and then join with the aggregated metrics. The data
> sources are flat files that are coming from two different sources(interval
> meter data and customer information) on a daily basis(65Gb per day - time
> series data). The end users are BI users, so we cannot provide them
> notebook visualization. They only can use  Power BI , Tableua or Excel to
> do self service filters for run time analytics, graphing the data and
> reporting.
>
> So, my question is that what is the best tools to implement this pipeline?
> I do not think storing parquet or orc in file system is a good choice in
> production, and I think we have to deposit the data somewhere (time series
> or standard db) , please correct me if  I am wrong.
>
> 1- where to store the data? files system/time series db/azure cosmos /
> standard db?
> 2- Is it right way to do to use spark as to  etl and aggregation
> application , store it somewhere and use power bi for reporting and
> dashboard purposes?
> Best Regards ... Amin
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia
> Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my
> amin_...@me.com
>


Re: Help Required - Unable to run spark-submit on YARN client mode

2018-05-08 Thread Deepak Sharma
Can you try increasing the partition for the base RDD/dataframe that you
are working on?


On Tue, May 8, 2018 at 5:05 PM, Debabrata Ghosh 
wrote:

> Hi Everyone,
> I have been trying to run spark-shell in YARN client mode, but am getting
> lot of ClosedChannelException errors, however the program works fine on
> local mode.  I am using spark 2.2.0 build for Hadoop 2.7.3.  If you are
> familiar with this error, please can you help with the possible resolution.
>
> Any help would be greatly appreciated!
>
> Here is the error message:
>
> 18/05/08 00:01:18 ERROR TransportClient: Failed to send RPC
> 7905321254854295784 to /9.30.94.43:60220: java.nio.channels.
> ClosedChannelException
> java.nio.channels.ClosedChannelException
> at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
> Source)
> 18/05/08 00:01:18 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint:
> Sending RequestExecutors(5,0,Map(),Set()) to AM was unsuccessful
> java.io.IOException: Failed to send RPC 7905321254854295784 to /
> 9.30.94.43:60220: java.nio.channels.ClosedChannelException
> at org.apache.spark.network.client.TransportClient.lambda$
> sendRpc$2(TransportClient.java:237)
> at io.netty.util.concurrent.DefaultPromise.notifyListener0(
> DefaultPromise.java:507)
> at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(
> DefaultPromise.java:481)
> at io.netty.util.concurrent.DefaultPromise.access$000(
> DefaultPromise.java:34)
> at io.netty.util.concurrent.DefaultPromise$1.run(
> DefaultPromise.java:431)
> at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(
> SingleThreadEventExecutor.java:399)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
> at io.netty.util.concurrent.SingleThreadEventExecutor$2.
> run(SingleThreadEventExecutor.java:131)
> at io.netty.util.concurrent.DefaultThreadFactory$
> DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.channels.ClosedChannelException
>
> Cheers,
>
> Debu
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Yes Nicolas.
It would be great hell if you can push code to github and share URL.

Thanks
Deepak

On Mon, Apr 23, 2018, 23:00 unk1102  wrote:

> Hi Nicolas thanks much for guidance it was very useful information if you
> can
> push that code to github and share url it would be a great help. Looking
> forward. If you can find time to push early it would be even greater help
> as
> I have to finish POC on this use case ASAP.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Is there any open source code base to refer to for this kind of use case ?

Thanks
Deepak

On Mon, Apr 23, 2018, 22:13 Nicolas Paris  wrote:

> Hi
>
> Problem is number of files on hadoop;
>
>
> I deal with 50M pdf files. What I did is to put them in an avro table on
> hdfs,
> as a binary column.
>
> Then I read it with spark and push that into pdfbox.
>
> Transforming 50M pdfs into text took 2hours on a 5 computers clusters
>
> About colors and formating, I guess pdfbox is able to get that information
> and then maybe you could add html balise in your txt output.
> That's some extra work indeed
>
>
>
>
> 2018-04-23 18:25 GMT+02:00 unk1102 :
>
>> Hi I need guidance on dealing with large no of pdf files when using Hadoop
>> and Spark. Can I store as binaryFiles using sc.binaryFiles and then
>> convert
>> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert
>> it
>> into text using these parsers and store it as text files but in doing so I
>> am loosing colors, formatting etc Please guide.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Merge query using spark sql

2018-04-02 Thread Deepak Sharma
I am using spark to run merge query in postgres sql.
The way its being done now is save the data to be merged in postgres as
temp tables.
Now run the  merge queries in postgres using java sql connection and
statment .
So basically this query runs in postgres.
The queries are insert into source table if it doesn't exists in source but
exists in temp table , else update.
Problem is both the tables got 400K records and thus this whole query takes
20 hours to run.
Is there any way to do it in spark itself and not run the query in PG , so
this can complete in reasonable time.

-- 
Thanks
Deepak


Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-18 Thread Deepak Sharma
The other approach would to write to temp table and then merge the data.
But this may be expensive solution.

Thanks
Deepak

On Mon, Mar 19, 2018, 08:04 Gurusamy Thirupathy  wrote:

> Hi,
>
> I am trying to read data from Hive as DataFrame, then trying to write the
> DF into the Oracle data base. In this case, the date field/column in hive
> is with Type Varchar(20)
> but the corresponding column type in Oracle is Date. While reading from
> hive , the hive table names are dynamically decided(read from another
> table) based on some job condition(ex. Job1). There are multiple tables
> like this, so column and the table names are decided only run time. So I
> can't do type conversion explicitly when read from Hive.
>
> So is there any utility/api available in Spark to achieve this conversion
> issue?
>
>
> Thanks,
> Guru
>


Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Deepak Sharma
I would suggest repartioning it to reasonable partitions  may ne 500 and
save it to some intermediate working directory .
Finally read all the files from this working dir and then coalesce as 1 and
save to final location.

Thanks
Deepak

On Fri, Mar 9, 2018, 20:12 Vadim Semenov  wrote:

> because `coalesce` gets propagated further up in the DAG in the last
> stage, so your last stage only has one task.
>
> You need to break your DAG so your expensive operations would be in a
> previous stage before the stage with `.coalesce(1)`
>
> On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Dear All,
>>
>> I have a tiny CSV file, which is around 250MB. There are only 30 columns
>> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
>> another CSV file on disk for later usage.
>>
>> However, I'm getting pissed off as writing the resultant DataFrame is
>> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
>> file written on the disk is about 58GB!
>>
>> Here's the sample code that I tried:
>>
>> # Using repartition()
>>
>> myDF.repartition(1).write.format("com.databricks.spark.csv").save("data/file.csv")
>>
>> # Using coalesce()
>> myDF.
>> coalesce(1).write.format("com.databricks.spark.csv").save("data/file.csv")
>>
>>
>> Any better suggestion?
>>
>>
>>
>> 
>> Md. Rezaul Karim, BSc, MSc
>> Research Scientist, Fraunhofer FIT, Germany
>>
>> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
>>
>> eMail: rezaul.ka...@fit.fraunhofer.de 
>> Tel: +49 241 80-21527 <+49%20241%208021527>
>>
>
>
>
> --
> Sent from my iPhone
>


Re: HBase connector does not read ZK configuration from Spark session

2018-02-23 Thread Deepak Sharma
Hi Dharmin
With the 1st approach , you will have to read the properties from the
--files using this below:
SparkFiles.get('file.txt')

Or else , you can copy the file to hdfs , read it using sc.textFile and use
the property within it.

If you add files using --files , it gets copied to executor's working
directory but you still have to read it and use the properties to be set in
conf.
Thanks
Deepak

On Fri, Feb 23, 2018 at 10:25 AM, Dharmin Siddesh J <
siddeshjdhar...@gmail.com> wrote:

> I am trying to write a Spark program that reads data from HBase and store
> it in DataFrame.
>
> I am able to run it perfectly with hbase-site.xml in the $SPARK_HOME/conf
> folder, but I am facing few issues here.
>
> Issue 1
>
> The first issue is passing hbase-site.xml location with the --files
> parameter submitted through client mode (it works in cluster mode).
>
>
> When I removed hbase-site.xml from $SPARK_HOME/conf and tried to execute
> it in client mode by passing with the --files parameter over YARN I keep
> getting the an exception (which I think means it is not taking the
> ZooKeeper configuration from hbase-site.xml.
>
> spark-submit \
>
>   --master yarn \
>
>   --deploy-mode client \
>
>   --files /home/siddesh/hbase-site.xml \
>
>   --class com.orzota.rs.json.HbaseConnector \
>
>   --packages com.hortonworks:shc:1.0.0-2.0-s_2.11 \
>
>   --repositories http://repo.hortonworks.com/content/groups/public/ \
>
>   target/scala-2.11/test-0.1-SNAPSHOT.jar
>
> at org.apache.zookeeper.ClientCnxn$SendThread.run(
> ClientCnxn.java:1125)
>
> 18/02/22 01:43:09 INFO ClientCnxn: Opening socket connection to server
> localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using
> SASL (unknown error)
>
> 18/02/22 01:43:09 WARN ClientCnxn: Session 0x0 for server null, unexpected
> error, closing socket connection and attempting reconnect
>
> java.net.ConnectException: Connection refused
>
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
>
> at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(
> ClientCnxnSocketNIO.java:361)
>
> at org.apache.zookeeper.ClientCnxn$SendThread.run(
> ClientCnxn.java:1125)
>
> However it works good when I run it in cluster mode.
>
>
> Issue 2
>
> Passing the HBase configuration details through the Spark session, which I
> can't get to work in both client and cluster mode.
>
>
> Instead of passing the entire hbase-site.xml I am trying to add the
> configuration directly in the code by adding it as a configuration
> parameter in the SparkSession, e.g.:
>
>
> val spark = SparkSession
>
>   .builder()
>
>   .appName(name)
>
>   .config("hbase.zookeeper.property.clientPort", "2181")
>
>   .config("hbase.zookeeper.quorum", "ip1,ip2,ip3")
>
>   .config("spark.hbase.host","zookeeperquorum")
>
>   .getOrCreate()
>
>
> val json_df =
>
>   spark.read.option("catalog",catalog_read).
>
>   format("org.apache.spark.sql.execution.datasources.hbase").
>
>   load()
>
> This is not working in cluster mode either.
>
>
> Can anyone help me with a solution or explanation why this is happening
> are there any workarounds?
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
egatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in
> java.library.path
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
> at java.lang.Runtime.loadLibrary0(Runtime.java:870)
> at java.lang.System.loadLibrary(System.java:1122)
> at org.xerial.snappy.SnappyNativeLoader.loadLibrary(
> SnappyNativeLoader.java:52)
> ... 52 more
> Exception in thread "main" org.xerial.snappy.SnappyError:
> [FAILED_TO_LOAD_NATIVE_LIBRARY] null
> at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
> at org.xerial.snappy.Snappy.(Snappy.java:44)
> at parquet.hadoop.codec.SnappyDecompressor.decompress(
> SnappyDecompressor.java:62)
> at parquet.hadoop.codec.NonBlockedDecompressorStream.read(
> NonBlockedDecompressorStream.java:51)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at parquet.bytes.BytesInput$StreamBytesInput.toByteArray(
> BytesInput.java:204)
> at parquet.column.impl.ColumnReaderImpl.readPageV1(
> ColumnReaderImpl.java:557)
> at parquet.column.impl.ColumnReaderImpl.access$300(
> ColumnReaderImpl.java:57)
> at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:516)
> at parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:513)
> at parquet.column.page.DataPageV1.accept(DataPageV1.java:96)
> at parquet.column.impl.ColumnReaderImpl.readPage(
> ColumnReaderImpl.java:513)
> at parquet.column.impl.ColumnReaderImpl.checkRead(
> ColumnReaderImpl.java:505)
> at parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:607)
> at parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:351)
> at parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(
> ColumnReadStoreImpl.java:66)
> at parquet.column.impl.ColumnReadStoreImpl.getColumnReader(
> ColumnReadStoreImpl.java:61)
> at parquet.io.RecordReaderImplementation.(
> RecordReaderImplementation.java:270)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at parquet.filter2.compat.FilterCompat$NoOpFilter.
> accept(FilterCompat.java:154)
> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at parquet.hadoop.InternalParquetRecordReader.checkRead(
> InternalParquetRecordReader.java:137)
> at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(
> InternalParquetRecordReader.java:208)
> at parquet.hadoop.ParquetRecordReader.nextKeyValue(
> ParquetRecordReader.java:201)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<
> init>(ParquetRecordReaderWrapper.java:122)
> at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<
> init>(ParquetRecordReaderWrapper.java:85)
> at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.
> getRecordReader(MapredParquetInputFormat.java:72)
> at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.
> getRecordReader(FetchOperator.java:673)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.
> getRecordReader(FetchOperator.java:323)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(
> FetchOperator.java:445)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(
> FetchOperator.java:414)
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:140)
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1670)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
> CliDriver.java:233)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:736)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Feb 11, 2018 3:14:06 AM WARNING: parquet.hadoop.ParquetRecordReader: Can
> not initialize counter due to context is not a instance of
> TaskInputOutputContext, but is org.apache.hadoop.mapreduce.
> task.TaskAttemptContextImpl
> Feb 11, 2018 3:14:06 AM INFO: parquet.hadoop.InternalPar

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
There was a typo:
Instead of :
alter table mine set locations "hdfs://localhost:8020/user/
hive/warehouse/mine";

Use :
alter table mine set location "hdfs://localhost:8020/user/
hive/warehouse/mine";

On Sun, Feb 11, 2018 at 1:38 PM, Deepak Sharma <deepakmc...@gmail.com>
wrote:

> Try this in hive:
> alter table mine set locations "hdfs://localhost:8020/user/
> hive/warehouse/mine";
>
> Thanks
> Deepak
>
> On Sun, Feb 11, 2018 at 1:24 PM, ☼ R Nair (रविशंकर नायर) <
> ravishankar.n...@gmail.com> wrote:
>
>> Hi,
>> Here you go:
>>
>> hive> show create table mine;
>> OK
>> CREATE TABLE `mine`(
>>   `policyid` int,
>>   `statecode` string,
>>   `socialid` string,
>>   `county` string,
>>   `eq_site_limit` decimal(10,2),
>>   `hu_site_limit` decimal(10,2),
>>   `fl_site_limit` decimal(10,2),
>>   `fr_site_limit` decimal(10,2),
>>   `tiv_2014` decimal(10,2),
>>   `tiv_2015` decimal(10,2),
>>   `eq_site_deductible` int,
>>   `hu_site_deductible` int,
>>   `fl_site_deductible` int,
>>   `fr_site_deductible` int,
>>   `latitude` decimal(6,6),
>>   `longitude` decimal(6,6),
>>   `line` string,
>>   `construction` string,
>>   `point_granularity` int)
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>> WITH SERDEPROPERTIES (
>>   'path'='hdfs://localhost:8020/user/hive/warehouse/mine')
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>> LOCATION
>>   'file:/Users/ravishankarnair/spark-warehouse/mine'
>> TBLPROPERTIES (
>>   'spark.sql.sources.provider'='parquet',
>>   'spark.sql.sources.schema.numParts'='1',
>>   'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"
>> fields\":[{\"name\":\"policyid\",\"type\":\"integer\",\"
>> nullable\":true,\"metadata\":{\"name\":\"policyid\",\"scale\
>> ":0}},{\"name\":\"statecode\",\"type\":\"string\",\"
>> nullable\":true,\"metadata\":{\"name\":\"statecode\",\"
>> scale\":0}},{\"name\":\"Socialid\",\"type\":\"string\"
>> ,\"nullable\":true,\"metadata\":{\"name\":\"Socialid\",\"
>> scale\":0}},{\"name\":\"county\",\"type\":\"string\",\
>> "nullable\":true,\"metadata\":{\"name\":\"county\",\"scale\"
>> :0}},{\"name\":\"eq_site_limit\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"eq_
>> site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\",\"typ
>> e\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"
>> name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_site_
>> limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
>> metadata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"
>> name\":\"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullab
>> le\":true,\"metadata\":{\"name\":\"fr_site_limit\",\"sca
>> le\":2}},{\"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\
>> "nullable\":true,\"metadata\":{\"name\":\"tiv_2014\",\"
>> scale\":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_
>> 2015\",\"scale\":2}},{\"name\":\"eq_site_deductible\",\"
>> type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":
>> \"eq_site_deductible\",\"scale\":0}},{\"name\":\"hu_
>> site_deductible\",\"type\":\"integer\",\"nullable\":true,\"
>> metadata\":{\"name\":\"hu_site_deductible\",\"scale\":0}},{\
>> "name\":\"fl_site_deductible\",\"type\":\"

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
ot;metadata\":{\"name\":\"eq_
>> site_limit\",\"scale\":2}},{\"name\":\"hu_site_limit\",\"typ
>> e\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{\"
>> name\":\"hu_site_limit\",\"scale\":2}},{\"name\":\"fl_site_
>> limit\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
>> metadata\":{\"name\":\"fl_site_limit\",\"scale\":2}},{\"
>> name\":\"fr_site_limit\",\"type\":\"decimal(10,2)\",\"nullab
>> le\":true,\"metadata\":{\"name\":\"fr_site_limit\",\"sca
>> le\":2}},{\"name\":\"tiv_2014\",\"type\":\"decimal(10,2)\",\
>> "nullable\":true,\"metadata\":{\"name\":\"tiv_2014\",\"
>> scale\":2}},{\"name\":\"tiv_2015\",\"type\":\"decimal(10,
>> 2)\",\"nullable\":true,\"metadata\":{\"name\":\"tiv_
>> 2015\",\"scale\":2}},{\"name\":\"eq_site_deductible\",\"
>> type\":\"integer\",\"nullable\":true,\"metadata\":{\"name\":
>> \"eq_site_deductible\",\"scale\":0}},{\"name\":\"hu_
>> site_deductible\",\"type\":\"integer\",\"nullable\":true,\"
>> metadata\":{\"name\":\"hu_site_deductible\",\"scale\":0}},{\
>> "name\":\"fl_site_deductible\",\"type\":\"integer\",\"
>> nullable\":true,\"metadata\":{\"name\":\"fl_site_deductible\
>> ",\"scale\":0}},{\"name\":\"fr_site_deductible\",\"type\":
>> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_si
>> te_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"type\
>> ":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":
>> \"latitude\",\"scale\":6}},{\"name\":\"longitude\",\"type\":
>> \"decimal(6,6)\",\"nullable\":true,\"metadata\":{\"name\":\"
>> longitude\",\"scale\":6}},{\"name\":\"line\",\"type\":\"
>> string\",\"nullable\":true,\"metadata\":{\"name\":\"line\",
>> \"scale\":0}},{\"name\":\"construction\",\"type\":\"
>> string\",\"nullable\":true,\"metadata\":{\"name\":\"
>> construction\",\"scale\":0}},{\"name\":\"point_granularity\"
>> ,\"type\":\"integer\",\"nullable\":true,\"metadata\":{
>> \"name\":\"point_granularity\",\"scale\":0}}]}',
>>   'transient_lastDdlTime'='1518335598')
>> Time taken: 0.13 seconds, Fetched: 35 row(s)
>>
>> On Sun, Feb 11, 2018 at 2:36 AM, Shmuel Blitz <
>> shmuel.bl...@similarweb.com> wrote:
>>
>>> Please run the following command, and paste the result:
>>> SHOW CREATE TABLE <>
>>>
>>> On Sun, Feb 11, 2018 at 7:56 AM, ☼ R Nair (रविशंकर नायर) <
>>> ravishankar.n...@gmail.com> wrote:
>>>
>>>> No, No luck.
>>>>
>>>> Thanks
>>>>
>>>> On Sun, Feb 11, 2018 at 12:48 AM, Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> In hive cli:
>>>>> msck repair table 《table_name》;
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <
>>>>> ravishankar.n...@gmail.com> wrote:
>>>>>
>>>>>> NO, can you pease explain the command ? Let me try now.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <
>>>>>> deepakmc...@gmail.com> wrote:
>>>>>>
>>>>>>> I am not sure about the exact issue bjt i see you are partioning
>>>>>>> while writing from spark.
>>>>>>> Did you tried msck repair on the table before

Re: Spark Dataframe and HIVE

2018-02-11 Thread Deepak Sharma
Try this in hive:
alter table mine set locations "hdfs://localhost:8020/
user/hive/warehouse/mine";

Thanks
Deepak

On Sun, Feb 11, 2018 at 1:24 PM, ☼ R Nair (रविशंकर नायर) <
ravishankar.n...@gmail.com> wrote:

> Hi,
> Here you go:
>
> hive> show create table mine;
> OK
> CREATE TABLE `mine`(
>   `policyid` int,
>   `statecode` string,
>   `socialid` string,
>   `county` string,
>   `eq_site_limit` decimal(10,2),
>   `hu_site_limit` decimal(10,2),
>   `fl_site_limit` decimal(10,2),
>   `fr_site_limit` decimal(10,2),
>   `tiv_2014` decimal(10,2),
>   `tiv_2015` decimal(10,2),
>   `eq_site_deductible` int,
>   `hu_site_deductible` int,
>   `fl_site_deductible` int,
>   `fr_site_deductible` int,
>   `latitude` decimal(6,6),
>   `longitude` decimal(6,6),
>   `line` string,
>   `construction` string,
>   `point_granularity` int)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='hdfs://localhost:8020/user/hive/warehouse/mine')
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/Users/ravishankarnair/spark-warehouse/mine'
> TBLPROPERTIES (
>   'spark.sql.sources.provider'='parquet',
>   'spark.sql.sources.schema.numParts'='1',
>   'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",
> \"fields\":[{\"name\":\"policyid\",\"type\":\"integer\
> ",\"nullable\":true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"
> name\":\"statecode\",\"type\":\"string\",\"nullable\":true,\
> "metadata\":{\"name\":\"statecode\",\"scale\":0}},{\"
> name\":\"Socialid\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"Socialid\",\"scale\":0}},{\"
> name\":\"county\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"county\",\"scale\":0}},{\"name\":\"
> eq_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"eq_site_limit\",\"scale\":2}},
> {\"name\":\"hu_site_limit\",\"type\":\"decimal(10,2)\",\"
> nullable\":true,\"metadata\":{\"name\":\"hu_site_limit\",\"
> scale\":2}},{\"name\":\"fl_site_limit\",\"type\":\"
> decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"
> fl_site_limit\",\"scale\":2}},{\"name\":\"fr_site_limit\",\"
> type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{
> \"name\":\"fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_
> 2014\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
> metadata\":{\"name\":\"tiv_2014\",\"scale\":2}},{\"name\"
> :\"tiv_2015\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"
> name\":\"eq_site_deductible\",\"type\":\"integer\",\"
> nullable\":true,\"metadata\":{\"name\":\"eq_site_deductible\
> ",\"scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":
> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_
> site_deductible\",\"scale\":0}},{\"name\":\"fl_site_
> deductible\",\"type\":\"integer\",\"nullable\":true,\"
> metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}
> },{\"name\":\"fr_site_deductible\",\"type\":\"
> integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_
> site_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"
> type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{
> \"name\":\"latitude\",\"scale\":6}},{\"name\&q

Re: Spark Dataframe and HIVE

2018-02-10 Thread Deepak Sharma
In hive cli:
msck repair table 《table_name》;

Thanks
Deepak

On Feb 11, 2018 11:14, "☼ R Nair (रविशंकर नायर)" <ravishankar.n...@gmail.com>
wrote:

> NO, can you pease explain the command ? Let me try now.
>
> Best,
>
> On Sun, Feb 11, 2018 at 12:40 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> I am not sure about the exact issue bjt i see you are partioning while
>> writing from spark.
>> Did you tried msck repair on the table before reading it in hive ?
>>
>> Thanks
>> Deepak
>>
>> On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" <
>> ravishankar.n...@gmail.com> wrote:
>>
>>> All,
>>>
>>> Thanks for the inputs. Again I am not successful. I think, we need to
>>> resolve this, as this is a very common requirement.
>>>
>>> Please go through my complete code:
>>>
>>> STEP 1:  Started Spark shell as spark-shell --master yarn
>>>
>>> STEP 2: Flowing code is being given as inout to shark shell
>>>
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.SparkSession
>>> val warehouseLocation ="/user/hive/warehouse"
>>>
>>> val spark = SparkSession.builder().appName("Spark Hive
>>> Example").config("spark.sql.warehouse.dir",
>>> warehouseLocation).enableHiveSupport().getOrCreate()
>>>
>>> import org.apache.spark.sql._
>>> var passion_df = spark.read.
>>> format("jdbc").
>>> option("url", "jdbc:mysql://localhost:3307/policies").
>>> option("driver" ,"com.mysql.jdbc.Driver").
>>> option("user", "root").
>>> option("password", "root").
>>> option("dbtable", "insurancedetails").
>>> option("partitionColumn", "policyid").
>>> option("lowerBound", "1").
>>> option("upperBound", "10").
>>> option("numPartitions", "4").
>>> load()
>>> //Made sure that passion_df is created, as passion_df.show(5) shows me
>>> correct data.
>>> passion_df.write.saveAsTable("default.mine") //Default parquet
>>>
>>> STEP 3: Went to HIVE. Started HIVE prompt.
>>>
>>> hive> show tables;
>>> OK
>>> callcentervoicelogs
>>> mine
>>> Time taken: 0.035 seconds, Fetched: 2 row(s)
>>> //As you can see HIVE is showing the table "mine" in default schema.
>>>
>>> STEP 4: HERE IS THE PROBLEM.
>>>
>>> hive> select * from mine;
>>> OK
>>> Time taken: 0.354 seconds
>>> hive>
>>> //Where is the data ???
>>>
>>> STEP 5:
>>>
>>> See the below command on HIVE
>>>
>>> describe formatted mine;
>>> OK
>>> # col_name data_type   comment
>>>
>>> policyid int
>>> statecode   string
>>> socialid string
>>> county   string
>>> eq_site_limit   decimal(10,2)
>>> hu_site_limit   decimal(10,2)
>>> fl_site_limit   decimal(10,2)
>>> fr_site_limit   decimal(10,2)
>>> tiv_2014 decimal(10,2)
>>> tiv_2015 decimal(10,2)
>>> eq_site_deductible   int
>>> hu_site_deductible   int
>>> fl_site_deductible   int
>>> fr_site_deductible   int
>>> latitude decimal(6,6)
>>> longitude   decimal(6,6)
>>> line string
>>> construction string
>>> point_granularity   int
>>>
>>> # Detailed Table Information
>>> Database:   default
>>> Owner:   ravishankarnair
>>> CreateTime: Sun Feb 11 00:26:40 EST 2018
>>> LastAccessTime: UNKNOWN
>>> Protect Mode:   None
>>> Retention:   0
>>> Location:   file:/Users/ravishankarnair/spark-warehouse/mine
>>> Table Type: MANAGED_TABLE
>>> Table Parameters:
>>> spark.sql.sources.provider parquet
>>> spark.sql.sources.schema.numParts 1
>>> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
>>> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\":
>>> true,\"metadata\":{\"name\":\"policyid\",

Re: Spark Dataframe and HIVE

2018-02-10 Thread Deepak Sharma
I am not sure about the exact issue bjt i see you are partioning while
writing from spark.
Did you tried msck repair on the table before reading it in hive ?

Thanks
Deepak

On Feb 11, 2018 11:06, "☼ R Nair (रविशंकर नायर)" 
wrote:

> All,
>
> Thanks for the inputs. Again I am not successful. I think, we need to
> resolve this, as this is a very common requirement.
>
> Please go through my complete code:
>
> STEP 1:  Started Spark shell as spark-shell --master yarn
>
> STEP 2: Flowing code is being given as inout to shark shell
>
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> val warehouseLocation ="/user/hive/warehouse"
>
> val spark = SparkSession.builder().appName("Spark Hive
> Example").config("spark.sql.warehouse.dir", warehouseLocation).
> enableHiveSupport().getOrCreate()
>
> import org.apache.spark.sql._
> var passion_df = spark.read.
> format("jdbc").
> option("url", "jdbc:mysql://localhost:3307/policies").
> option("driver" ,"com.mysql.jdbc.Driver").
> option("user", "root").
> option("password", "root").
> option("dbtable", "insurancedetails").
> option("partitionColumn", "policyid").
> option("lowerBound", "1").
> option("upperBound", "10").
> option("numPartitions", "4").
> load()
> //Made sure that passion_df is created, as passion_df.show(5) shows me
> correct data.
> passion_df.write.saveAsTable("default.mine") //Default parquet
>
> STEP 3: Went to HIVE. Started HIVE prompt.
>
> hive> show tables;
> OK
> callcentervoicelogs
> mine
> Time taken: 0.035 seconds, Fetched: 2 row(s)
> //As you can see HIVE is showing the table "mine" in default schema.
>
> STEP 4: HERE IS THE PROBLEM.
>
> hive> select * from mine;
> OK
> Time taken: 0.354 seconds
> hive>
> //Where is the data ???
>
> STEP 5:
>
> See the below command on HIVE
>
> describe formatted mine;
> OK
> # col_name data_type   comment
>
> policyid int
> statecode   string
> socialid string
> county   string
> eq_site_limit   decimal(10,2)
> hu_site_limit   decimal(10,2)
> fl_site_limit   decimal(10,2)
> fr_site_limit   decimal(10,2)
> tiv_2014 decimal(10,2)
> tiv_2015 decimal(10,2)
> eq_site_deductible   int
> hu_site_deductible   int
> fl_site_deductible   int
> fr_site_deductible   int
> latitude decimal(6,6)
> longitude   decimal(6,6)
> line string
> construction string
> point_granularity   int
>
> # Detailed Table Information
> Database:   default
> Owner:   ravishankarnair
> CreateTime: Sun Feb 11 00:26:40 EST 2018
> LastAccessTime: UNKNOWN
> Protect Mode:   None
> Retention:   0
> Location:   file:/Users/ravishankarnair/spark-warehouse/mine
> Table Type: MANAGED_TABLE
> Table Parameters:
> spark.sql.sources.provider parquet
> spark.sql.sources.schema.numParts 1
> spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\
> ":[{\"name\":\"policyid\",\"type\":\"integer\",\"nullable\
> ":true,\"metadata\":{\"name\":\"policyid\",\"scale\":0}},{\"
> name\":\"statecode\",\"type\":\"string\",\"nullable\":true,\
> "metadata\":{\"name\":\"statecode\",\"scale\":0}},{\"
> name\":\"Socialid\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"Socialid\",\"scale\":0}},{\"
> name\":\"county\",\"type\":\"string\",\"nullable\":true,\"
> metadata\":{\"name\":\"county\",\"scale\":0}},{\"name\":\"
> eq_site_limit\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"eq_site_limit\",\"scale\":2}},
> {\"name\":\"hu_site_limit\",\"type\":\"decimal(10,2)\",\"
> nullable\":true,\"metadata\":{\"name\":\"hu_site_limit\",\"
> scale\":2}},{\"name\":\"fl_site_limit\",\"type\":\"
> decimal(10,2)\",\"nullable\":true,\"metadata\":{\"name\":\"
> fl_site_limit\",\"scale\":2}},{\"name\":\"fr_site_limit\",\"
> type\":\"decimal(10,2)\",\"nullable\":true,\"metadata\":{
> \"name\":\"fr_site_limit\",\"scale\":2}},{\"name\":\"tiv_
> 2014\",\"type\":\"decimal(10,2)\",\"nullable\":true,\"
> metadata\":{\"name\":\"tiv_2014\",\"scale\":2}},{\"name\"
> :\"tiv_2015\",\"type\":\"decimal(10,2)\",\"nullable\":
> true,\"metadata\":{\"name\":\"tiv_2015\",\"scale\":2}},{\"
> name\":\"eq_site_deductible\",\"type\":\"integer\",\"
> nullable\":true,\"metadata\":{\"name\":\"eq_site_deductible\
> ",\"scale\":0}},{\"name\":\"hu_site_deductible\",\"type\":
> \"integer\",\"nullable\":true,\"metadata\":{\"name\":\"hu_
> site_deductible\",\"scale\":0}},{\"name\":\"fl_site_
> deductible\",\"type\":\"integer\",\"nullable\":true,\"
> metadata\":{\"name\":\"fl_site_deductible\",\"scale\":0}
> },{\"name\":\"fr_site_deductible\",\"type\":\"
> integer\",\"nullable\":true,\"metadata\":{\"name\":\"fr_
> site_deductible\",\"scale\":0}},{\"name\":\"latitude\",\"
> type\":\"decimal(6,6)\",\"nullable\":true,\"metadata\":{
> \"name\":\"latitude\",\"scale\":6}},{\"name\":\"longitude\",
> 

CI/CD for spark and scala

2018-01-24 Thread Deepak Sharma
Hi All,
I just wanted to check if there are any best practises around using CI/CD
for spark /  scala projects running on AWS hadoop clusters.
IF there is any specific tools , please do let me know.

-- 
Thanks
Deepak


Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Deepak Sharma
I am not sure about java but in scala it would be something like
df.rdd.map{ x => MyClass(x.getString(0),.)}

HTH

--Deepak

On Dec 19, 2017 09:25, "Sunitha Chennareddy" 
wrote:

Hi All,

I am new to Spark, I want to convert DataFrame to List with out
using collect().

Main requirement is I need to iterate through the rows of dataframe and
call another function by passing column value of each row (person.getId())

Here is the snippet I have tried, Kindly help me to resolve the issue,
personLst is returning 0:

List personLst= new ArrayList();
JavaRDD personRDD = person_dataframe.toJavaRDD().map(new
Function() {
  public Person call(Row row)  throws Exception{
  Person person = new Person();
  person.setId(row.getDecimal(0).longValue());
  person.setName(row.getString(1));

personLst.add(person);
// here I tried to call another function but control never passed
return person;
  }
});
logger.info("personLst size =="+personLst.size());
logger.info("personRDD count ==="+personRDD.count());

//output is
personLst size == 0
personRDD count === 3


Re: Spark based Data Warehouse

2017-11-13 Thread Deepak Sharma
If you have only 1 user , its still possible to execute non-blocking long
running queries .
Best way is to have different users with pre assigned resources , run their
queries .

HTH

Thanks
Deepak

On Nov 13, 2017 23:56, "ashish rawat" <dceash...@gmail.com> wrote:

> Thanks Everyone. I am still not clear on what is the right way to execute
> support multiple users, running concurrent queries with Spark. Is it
> through multiple spark contexts or through Livy (which creates a single
> spark context only).
>
> Also, what kind of isolation is possible with Spark SQL? If one user fires
> a big query, then would that choke all other queries in the cluster?
>
> Regards,
> Ashish
>
> On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <palw...@hortonworks.com>
> wrote:
>
>> Alcon,
>>
>>
>>
>> You can most certainly do this. I’ve done benchmarking with Spark SQL and
>> the TPCDS queries using S3 as the filesystem.
>>
>>
>>
>> Zeppelin and Livy server work well for the dash boarding and concurrent
>> query issues:  https://hortonworks.com/blog/
>> livy-a-rest-interface-for-apache-spark/
>>
>>
>>
>> Livy Server will allow you to create multiple spark contexts via REST:
>> https://livy.incubator.apache.org/
>>
>>
>>
>> If you are looking for broad SQL functionality I’d recommend
>> instantiating a Hive context. And Spark is able to spill to disk à
>> https://spark.apache.org/faq.html
>>
>>
>>
>> There are multiple companies running spark within their data warehouse
>> solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbac
>> h_dashdb_local_spark/
>>
>>
>>
>> Edmunds used Spark to allow business analysts to point Spark to files in
>> S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0
>>
>>
>>
>> Recommend running some benchmarks and testing query scenarios for your
>> end users; but it sounds like you’ll be using it for exploratory analysis.
>> Spark is great for this ☺
>>
>>
>>
>> -Pat
>>
>>
>>
>>
>>
>> *From: *Vadim Semenov <vadim.seme...@datadoghq.com>
>> *Date: *Sunday, November 12, 2017 at 1:06 PM
>> *To: *Gourav Sengupta <gourav.sengu...@gmail.com>
>> *Cc: *Phillip Henry <londonjava...@gmail.com>, ashish rawat <
>> dceash...@gmail.com>, Jörn Franke <jornfra...@gmail.com>, Deepak Sharma <
>> deepakmc...@gmail.com>, spark users <user@spark.apache.org>
>> *Subject: *Re: Spark based Data Warehouse
>>
>>
>>
>> It's actually quite simple to answer
>>
>>
>>
>> > 1. Is Spark SQL and UDF, able to handle all the workloads?
>>
>> Yes
>>
>>
>>
>> > 2. What user interface did you provide for data scientist, data
>> engineers and analysts
>>
>> Home-grown platform, EMR, Zeppelin
>>
>>
>>
>> > What are the challenges in running concurrent queries, by many users,
>> over Spark SQL? Considering Spark still does not provide spill to disk, in
>> many scenarios, are there frequent query failures when executing concurrent
>> queries
>>
>> You can run separate Spark Contexts, so jobs will be isolated
>>
>>
>>
>> > Are there any open source implementations, which provide something
>> similar?
>>
>> Yes, many.
>>
>>
>>
>>
>>
>> On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>> Dear Ashish,
>>
>> what you are asking for involves at least a few weeks of dedicated
>> understanding of your used case and then it takes at least 3 to 4 months to
>> even propose a solution. You can even build a fantastic data warehouse just
>> using C++. The matter depends on lots of conditions. I just think that your
>> approach and question needs a lot of modification.
>>
>>
>>
>> Regards,
>>
>> Gourav
>>
>>
>>
>> On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <londonjava...@gmail.com>
>> wrote:
>>
>> Hi, Ashish.
>>
>> You are correct in saying that not *all* functionality of Spark is
>> spill-to-disk but I am not sure how this pertains to a "concurrent user
>> scenario". Each executor will run in its own JVM and is therefore isolated
>> from others. That is, if the JVM of one user dies, this should not effect
>> another user who is running their own jobs in their own JVMs. The amount of
>> resources used by a user 

Re: Spark based Data Warehouse

2017-11-11 Thread Deepak Sharma
I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat"  wrote:

> Hello Everyone,
>
> I was trying to understand if anyone here has tried a data warehouse
> solution using S3 and Spark SQL. Out of multiple possible options
> (redshift, presto, hive etc), we were planning to go with Spark SQL, for
> our aggregates and processing requirements.
>
> If anyone has tried it out, would like to understand the following:
>
>1. Is Spark SQL and UDF, able to handle all the workloads?
>2. What user interface did you provide for data scientist, data
>engineers and analysts
>3. What are the challenges in running concurrent queries, by many
>users, over Spark SQL? Considering Spark still does not provide spill to
>disk, in many scenarios, are there frequent query failures when executing
>concurrent queries
>4. Are there any open source implementations, which provide something
>similar?
>
>
> Regards,
> Ashish
>


Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Deepak Sharma
I guess the issue is spark.default.parallelism is ignored when you are
working with Data frames.It is supposed to work with only raw RDDs.

Thanks
Deepak

On Thu, Oct 26, 2017 at 10:05 PM, Noorul Islam Kamal Malmiyoda <
noo...@noorul.com> wrote:

> Hi all,
>
> I have the following spark configuration
>
> spark.app.name=Test
> spark.cassandra.connection.host=127.0.0.1
> spark.cassandra.connection.keep_alive_ms=5000
> spark.cassandra.connection.port=1
> spark.cassandra.connection.timeout_ms=3
> spark.cleaner.ttl=3600
> spark.default.parallelism=4
> spark.master=local[2]
> spark.ui.enabled=false
> spark.ui.showConsoleProgress=false
>
> Because I am setting spark.default.parallelism to 4, I was expecting
> only 4 spark partitions. But it looks like it is not the case
>
> When I do the following
>
> df.foreachPartition { partition =>
>   val groupedPartition = partition.toList.grouped(3).toList
>   println("Grouped partition " + groupedPartition)
> }
>
> There are too many print statements with empty list at the top. Only
> the relevant partitions are at the bottom. Is there a way to control
> number of partitions?
>
> Regards,
> Noorul
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Deepak Sharma
df.rdd.foreach

Thanks
Deepak

On Oct 26, 2017 18:07, "Noorul Islam Kamal Malmiyoda" 
wrote:

> Hi all,
>
> I have a Dataframe with 1000 records. I want to split them into 100
> each and post to rest API.
>
> If it was RDD, I could use something like this
>
> myRDD.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
>   partition => {
>
> This will ensure that code is executed on executors and not on driver.
>
> Is there any similar approach that we can take for Dataframes? I see
> examples on stackoverflow with collect() which will bring whole data
> to driver.
>
> Thanks and Regards
> Noorul
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Write to HDFS

2017-10-20 Thread Deepak Sharma
Better use coalesce instead of repatition

On Fri, Oct 20, 2017 at 9:47 PM, Marco Mistroni  wrote:

> Use  counts.repartition(1).save..
> Hth
>
>
> On Oct 20, 2017 3:01 PM, "Uğur Sopaoğlu"  wrote:
>
> Actually, when I run following code,
>
>   val textFile = sc.textFile("Sample.txt")
>   val counts = textFile.flatMap(line => line.split(" "))
>  .map(word => (word, 1))
>  .reduceByKey(_ + _)
>
>
> It save the results into more than one partition like part-0,
> part-1. I want to collect all of them into one file.
>
>
> 2017-10-20 16:43 GMT+03:00 Marco Mistroni :
>
>> Hi
>>  Could you just create an rdd/df out of what you want to save and store
>> it in hdfs?
>> Hth
>>
>> On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu"  wrote:
>>
>>> Hi all,
>>>
>>> In word count example,
>>>
>>>   val textFile = sc.textFile("Sample.txt")
>>>   val counts = textFile.flatMap(line => line.split(" "))
>>>  .map(word => (word, 1))
>>>  .reduceByKey(_ + _)
>>>  counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>>
>>> I want to write collection of "*counts" *which is used in code above to
>>> HDFS, so
>>>
>>> val x = counts.collect()
>>>
>>> Actually I want to write *x *to HDFS. But spark wants to RDD to write
>>> sometihng to HDFS
>>>
>>> How can I write Array[(String,Int)] to HDFS
>>>
>>>
>>> --
>>> Uğur
>>>
>>
>
>
> --
> Uğur Sopaoğlu
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How can i split dataset to multi dataset

2017-08-06 Thread Deepak Sharma
This can be mapped as below:
dataset.map(x=>((x(0),x(1),x(2)),x)

This works with Dataframe of rows but i haven't tried with dataset
Thanks
Deepak

On Mon, Aug 7, 2017 at 8:21 AM, Jone Zhang  wrote:

> val schema = StructType(
>   Seq(
>   StructField("app", StringType, nullable = true),
>   StructField("server", StringType, nullable = true),
>   StructField("file", StringType, nullable = true),
>   StructField("...", StringType, nullable = true)
>   )
> )
> val row = ...
> val dataset = session.createDataFrame(row, schema)
>
> How can i split dataset to dataset array by composite key(app,
> server,file) as follow
> mapdataset>
>
>
> Thanks.
>
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector ,
but am not able to.

I am passing es.nodes and es.port along with es.nodes.wan.only set to true.
But it fails with below error:

34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x
failed to respond); no other nodes left - aborting...

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES
version - typically this happens if the network/Elasticsearch cluster is
not accessible or when targeting a WAN/Cloud instance without the proper
setting 'es.nodes.wan.only'

  at
org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:294)

  at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:103)

  at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)

  at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)

  ... 50 elided

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[x.x.x.x:443]]

I just wanted to check if anyone have already connected to managed ES
service of AWS from spark and how it can be done?
-- 
Thanks
Deepak


Hive Context and SQL Context interoperability

2017-04-13 Thread Deepak Sharma
Hi All,
I have registered temp tables using hive context and sql context both.
Now when i try to join these 2 temp tables , 1 of the tables complain about
not being found.
Is there any setting or option so the tables in these 2 different contexts
are visible to each other?

-- 
Thanks
Deepak


Re: Check if dataframe is empty

2017-03-07 Thread Deepak Sharma
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath 
wrote:

> df.take(1).isEmpty should work


My bad.
It will return empty array:
 emptydf.take(1)
res0: Array[org.apache.spark.sql.Row] = Array()

and applying isEmpty would return boolean
 emptydf.take(1).isEmpty
res2: Boolean = true




-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Check if dataframe is empty

2017-03-06 Thread Deepak Sharma
If the df is empty , the .take would return
java.util.NoSuchElementException.
This can be done as below:
df.rdd.isEmpty


On Tue, Mar 7, 2017 at 9:33 AM,  wrote:

> Dataframe.take(1) is faster.
>
>
>
> *From:* ashaita...@nz.imshealth.com [mailto:ashaita...@nz.imshealth.com]
> *Sent:* Tuesday, March 07, 2017 9:22 AM
> *To:* user@spark.apache.org
> *Subject:* Check if dataframe is empty
>
>
>
> Hello!
>
>
>
> I am pretty sure that I am asking something which has been already asked
> lots of times. However, I cannot find the question in the mailing list
> archive.
>
>
>
> The question is – I need to check whether dataframe is empty or not. I
> receive a dataframe from 3rd party library and this dataframe can be
> potentially empty, but also can be really huge – millions of rows. Thus, I
> want to avoid of doing some logic in case the dataframe is empty. How can I
> efficiently check it?
>
>
>
> Right now I am doing it in the following way:
>
>
>
> *private def *isEmpty(df: Option[DataFrame]): Boolean = {
>   df.isEmpty || (df.isDefined && df.get.limit(1).*rdd*.isEmpty())
> }
>
>
>
> But the performance is really slow for big dataframes. I would be grateful
> for any suggestions.
>
>
>
> Thank you in advance.
>
>
>
>
> Best regards,
>
>
>
> Artem
>
>
> --
>
> ** IMPORTANT--PLEASE READ 
> This electronic message, including its attachments, is CONFIDENTIAL and may
> contain PROPRIETARY or LEGALLY PRIVILEGED or PROTECTED information and is
> intended for the authorized recipient of the sender. If you are not the
> intended recipient, you are hereby notified that any use, disclosure,
> copying, or distribution of this message or any of the information included
> in it is unauthorized and strictly prohibited. If you have received this
> message in error, please immediately notify the sender by reply e-mail and
> permanently delete this message and its attachments, along with any copies
> thereof, from all locations received (e.g., computer, mobile device, etc.).
> Thank you. 
> 
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: how to compare two avro format hive tables

2017-01-30 Thread Deepak Sharma
You can use spark testing base's rdd comparators.
Create 2 different dataframes from these 2 hive tables.
Convert them to rdd and use spark-testing-base compareRDD.

Here is an example for rdd comparison:
https://github.com/holdenk/spark-testing-base/wiki/RDDComparisons


On Mon, Jan 30, 2017 at 9:07 PM, Alex  wrote:

> Hi Team,
>
> how to compare two avro format hive tables if there is same data in it
>
> if i give limit 5 its giving different results
>
>
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma
The better way is to read the data directly into spark using spark sql read
jdbc .
Apply the udf's locally .
Then save the data frame back to Oracle using dataframe's write jdbc.

Thanks
Deepak

On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:

> One alternative could be the oracle Hadoop loader and other Oracle
> products, but you have to invest some money and probably buy their Hadoop
> Appliance, which you have to evaluate if it make sense (can get expensive
> with large clusters etc).
>
> Another alternative would be to get rid of Oracle alltogether and use
> other databases.
>
> However, can you elaborate a little bit on your use case and the business
> logic as well as SLA requires. Otherwise all recommendations are right
> because the requirements you presented are very generic.
>
> About get rid of Hadoop - this depends! You will need some resource
> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
> file system. Spark supports through the Hadoop apis a wide range of file
> systems, but does not need HDFS for persistence. You can have local
> filesystem (ie any file system mounted to a node, so also distributed ones,
> such as zfs), cloud file systems (s3, azure blob etc).
>
>
>
> On 29 Jan 2017, at 11:18, Alex  wrote:
>
> Hi All,
>
> Thanks for your response .. Please find below flow diagram
>
> Please help me out simplifying this architecture using Spark
>
> 1) Can i skip step 1 to step 4 and directly store it in spark
> if I am storing it in spark where actually it is getting stored
> Do i need to retain HAdoop to store data
> or can i directly store it in spark and remove hadoop also?
>
> I want to remove informatica for preprocessing and directly load the files
> data coming from server to Hadoop/Spark
>
> So My Question is Can i directly load files data to spark ? Then where
> exactly the data will get stored.. Do I need to have Spark installed on Top
> of HDFS?
>
> 2) if I am retaining below architecture Can I store back output from spark
> directly to oracle from step 5 to step 7
>
> and will spark way of storing it back to oracle will be better than using
> sqoop performance wise
> 3)Can I use SPark scala UDF to process data from hive and retain entire
> architecture
>
> which among the above would be optimal
>
> [image: Inline image 1]
>
> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
> wrote:
>
>> I strongly agree with Jorn and Russell. There are different solutions for
>> data movement depending upon your needs frequency, bi-directional drivers.
>> workflow, handling duplicate records. This is a space is known as " Change
>> Data Capture - CDC" for short. If you need more information, I would be
>> happy to chat with you.  I built some products in this space that
>> extensively used connection pooling over ODBC/JDBC.
>>
>> Happy to chat if you need more information.
>>
>> -Sachin Naik
>>
>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>> and what the data is about?
>> >>For example, depending on your use case sqoop can make sense or not.
>> Sent from my iPhone
>>
>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>> wrote:
>>
>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>> way back out (see the same link) and write directly to Oracle. I'll leave
>> the performance questions for someone else.
>>
>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha Cheruvu 
>> wrote:
>>
>>>
>>> On Sat, Jan 28, 2017 at 6:44 AM, Sirisha Cheruvu 
>>> wrote:
>>>
>>> Hi Team,
>>>
>>> RIght now our existing flow is
>>>
>>> Oracle-->Sqoop --> Hive--> Hive Queries on Spark-sql (Hive
>>> Context)-->Destination Hive table -->sqoop export to Oracle
>>>
>>> Half of the Hive UDFS required is developed in Java UDF..
>>>
>>> SO Now I want to know if I run the native scala UDF's than runninng hive
>>> java udfs in spark-sql will there be any performance difference
>>>
>>>
>>> Can we skip the Sqoop Import and export part and
>>>
>>> Instead directly load data from oracle to spark and code Scala UDF's for
>>> transformations and export output data back to oracle?
>>>
>>> RIght now the architecture we are using is
>>>
>>> oracle-->Sqoop (Import)-->Hive Tables--> Hive Queries --> Spark-SQL-->
>>> Hive --> Oracle
>>> what would be optimal architecture to process data from oracle using
>>> spark ?? can i anyway better this process ?
>>>
>>>
>>>
>>>
>>> Regards,
>>> Sirisha
>>>
>>>
>>>
>


Examples in graphx

2017-01-29 Thread Deepak Sharma
Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back
in spark , process using graphx.

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark
sql or hive context ?
Or do you want to reuse the existing UDF jar for hive in spark ?

Thanks
Deepak

On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu"  wrote:

> Hi Team,
>
> I am trying to keep below code in get method and calling that get mthod in
> another hive UDF
> and running the hive UDF using Hive Context.sql procedure..
>
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Suprisingly only LongWritable and Text convrsions are throwing error but
> DoubleWritable is working
> So I tried changing below code to
>
> switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((DoubleWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
> Still its throws error saying Java.Lang.Long cant be convrted
> to org.apache.hadoop.hive.serde2.io.DoubleWritable
>
>
>
> its working fine on hive but throwing error on spark-sql
>
> I am importing the below packages.
> import java.util.*;
> import org.apache.hadoop.hive.serde2.objectinspector.*;
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.hive.serde2.io.DoubleWritable;
>
> .Please let me know why it is making issue in spark when perfectly running
> fine on hive
>


Re: anyone from bangalore wants to work on spark projects along with me

2017-01-19 Thread Deepak Sharma
Yes.
I will be there before 4 PM .
Whats your contact number ?
Thanks
Deepak

On Thu, Jan 19, 2017 at 2:38 PM, Sirisha Cheruvu  wrote:

> Are we meeting today?!
>
> On Jan 18, 2017 8:32 AM, "Sirisha Cheruvu"  wrote:
>
>> Hi ,
>>
>> Just thought of keeping my intention of working together with spark
>> developers who are also from bangalore so that we can brainstorm
>> togetherand work out solutions on our projects?
>>
>>
>> what say?
>>
>> expecting a reply
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark ANSI SQL Support

2017-01-17 Thread Deepak Sharma
>From spark documentation page:
Spark SQL can now run all 99 TPC-DS queries.

On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj"  wrote:

> Hi All,
>
> Does Spark 2.0 Sql support full ANSI SQL query standards?
>
> Thanks,
> Rishabh.
>


Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
Did you tried this with spark-shell?
Please try this.
$spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar

On the spark shell:
val hc = new org.apache.spark.sql.hive.HiveContext(sc) ;
hc.sql("create temporary function nexr_nvl2 as '
com.nexr.platform.hive.udf.GenericUDFNVL2'");
hc.sql("select nexr_nvl2(name,let,ret) from testtab5").show;

This should work.

So basically in your spark program , you can specify the jars while
submitting the job using spark-submit.

There is no need to have add jars statement in the spark program itself.

Thanks
Deepak


On Wed, Jan 18, 2017 at 8:58 AM, Sirisha Cheruvu  wrote:

> This error
> org.apache.spark.sql.AnalysisException: No handler for Hive udf class
> com.nexr.platform.hive.udf.GenericUDFNVL2 because:
> com.nexr.platform.hive.udf.GenericUDFNVL2.; line 1 pos 26
> at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$look
> upFunction$2.apply(hiveUDFs.scala:105)
> at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$look
> upFunction$2.apply(hiveUDFs.scala:64)
> at scala.util.Try.getOrElse(Try.scala:77)
>
>
>
>
>
> My script is this
>
>
> import org.apache.spark.sql.hive.HiveContext
> val hc = new org.apache.spark.sql.hive.HiveContext(sc) ;
> hc.sql("add jar /home/cloudera/Downloads/genudnvl2.jar");
> hc.sql("create temporary function nexr_nvl2 as '
> com.nexr.platform.hive.udf.GenericUDFNVL2'");
> hc.sql("select nexr_nvl2(name,let,ret) from testtab5").show;
> System.exit(0);
>
>
> On Jan 17, 2017 2:01 PM, "Sirisha Cheruvu"  wrote:
>
>> Hi
>>
>> Anybody has a test and tried generic udf with object inspector
>> implementaion which sucessfully ran on both hive and spark-sql
>>
>> please share me the git hub link or source code file
>>
>> Thanks in advance
>> Sirisha
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
On the sqlcontext or hivesqlcontext , you can register the function as udf
below:
*hiveSqlContext.udf.register("func_name",func(_:String))*

Thanks
Deepak

On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu  wrote:

> Hey
>
> Can yu send me the source code of hive java udf which worked in spark sql
> and how yu registered the function on spark
>
>
> On Jan 17, 2017 2:01 PM, "Sirisha Cheruvu"  wrote:
>
> Hi
>
> Anybody has a test and tried generic udf with object inspector
> implementaion which sucessfully ran on both hive and spark-sql
>
> please share me the git hub link or source code file
>
> Thanks in advance
> Sirisha
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
How about this:
ADD_JARS="/home/hduser/jars/ojdbc6.jar" spark-shell

Thanks
Deepak

On Tue, Dec 27, 2016 at 5:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok I tried this but no luck
>
> spark-shell --jars /home/hduser/jars/ojdbc6.jar
> Spark context Web UI available at http://50.140.197.217:4041
> Spark context available as 'sc' (master = local[*], app id =
> local-1482838526271).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_77)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> warning: there was one deprecation warning; re-run with -deprecation for
> details
> HiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@ad0bb4e
> scala> //val sqlContext = new HiveContext(sc)
> scala> println ("\nStarted at"); spark.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> Started at
> [27/12/2016 11:36:26.26]
> scala> //
> scala> var _ORACLEserver= "jdbc:oracle:thin:@rhes564:1521:mydb12"
> _ORACLEserver: String = jdbc:oracle:thin:@rhes564:1521:mydb12
> scala> var _username = "scratchpad"
> _username: String = scratchpad
> scala> var _password = "oracle"
> _password: String = oracle
> scala> //
> scala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT ID, CLUSTERED, SCATTERED, RANDOMISED,
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>  | "partitionColumn" -> "ID",
>  | "lowerBound" -> "1",
>  | "upperBound" -> "1",
>  | "numPartitions" -> "10",
>  | "user" -> _username,
>  | "password" -> _password)).load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.
> createConnectionFactory(JdbcUtils.scala:53)
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JDBCRDD$.resolveTable(JDBCRDD.scala:123)
>   at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(
> JDBCRelation.scala:117)
>   at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.
> createRelation(JdbcRelationProvider.scala:53)
>   at org.apache.spark.sql.execution.datasources.
> DataSource.resolveRelation(DataSource.scala:315)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   ... 56 elided
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 December 2016 at 11:23, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
>> I meant ADD_JARS as you said --jars is not working for you with
>> spark-shell.
>>
>> Thanks
>> Deepak
>>
>> On Tue, Dec 27, 2016 at 4:51 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Ok just to be clear do you mean
>>>
>>> ADD_JARS="~/jars/ojdbc6.jar" spark-shell
>>>
>>> or
>>>
>>> spark-shell --jars $ADD_JARS
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
I meant ADD_JARS as you said --jars is not working for you with spark-shell.

Thanks
Deepak

On Tue, Dec 27, 2016 at 4:51 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok just to be clear do you mean
>
> ADD_JARS="~/jars/ojdbc6.jar" spark-shell
>
> or
>
> spark-shell --jars $ADD_JARS
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 December 2016 at 10:30, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
>> It works for me with spark 1.6 (--jars)
>> Please try this:
>> ADD_JARS="<>" spark-shell
>>
>> Thanks
>> Deepak
>>
>> On Tue, Dec 27, 2016 at 3:49 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks.
>>>
>>> The problem is that with spark-shell --jars does not work! This is Spark
>>> 2 accessing Oracle 12c
>>>
>>> spark-shell --jars /home/hduser/jars/ojdbc6.jar
>>>
>>> It comes back with
>>>
>>> java.sql.SQLException: No suitable driver
>>>
>>> unfortunately
>>>
>>> and spark-shell uses spark-submit under the bonnet if you look at the
>>> shell file
>>>
>>> "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main
>>> --name "Spark shell" "$@"
>>>
>>>
>>> hm
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 December 2016 at 09:52, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mich
>>>> You can copy the jar to shared location and use --jars command line
>>>> argument of spark-submit.
>>>> Who so ever needs  access to this jar , can refer to the shared path
>>>> and access it using --jars argument.
>>>>
>>>> Thanks
>>>> Deepak
>>>>
>>>> On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> When one runs in Local mode (one JVM) on an edge host (the host user
>>>>> accesses the cluster), it is possible to put additional jar file say
>>>>> accessing Oracle RDBMS tables in $SPARK_CLASSPATH. This works
>>>>>
>>>>> export SPARK_CLASSPATH=~/user_jars/ojdbc6.jar
>>>>>
>>>>> Normally a group of users can have read access to a shared directory
>>>>> like above and once they log in their shell will invoke an environment 
>>>>> file
>>>>> that will have the above classpath plus additional parameters like
>>>>> $JAVA_HOME etc are set up for them.
>>>>>
>>>>> However, if the user chooses to run spark through spark-submit with
>>>>> yarn, then the only way this will work in my research is to add the jar
>>>>> file as follows on every node of Spark cluster
>>>>>
>>>>> in $SPARK_HOME/conf/spark-defaults.conf
>>>>>
>>>>> Add the jar path to the following:
>>>>>
>>>>> spark.executor.extraClassPath   /user_jars/ojdbc6.jar
>>>>>
>>>>> Note that setting both spark.executor.extraClassPath and
>>>>> SPARK_CLASSPATH
>>>>> will cause initialisation error
>>>>>
>>>>> ERROR SparkContext: Error initializing SparkContext.
>>>>> org.apache.spark.SparkException: Found both
>>>>> spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former.
>>>>>
>>>>> I was wondering if there are other ways of making this work in YARN
>>>>> mode, where every node of cluster will require this JAR file?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks
>>>> Deepak
>>>> www.bigdatabig.com
>>>> www.keosha.net
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
It works for me with spark 1.6 (--jars)
Please try this:
ADD_JARS="<>" spark-shell

Thanks
Deepak

On Tue, Dec 27, 2016 at 3:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks.
>
> The problem is that with spark-shell --jars does not work! This is Spark 2
> accessing Oracle 12c
>
> spark-shell --jars /home/hduser/jars/ojdbc6.jar
>
> It comes back with
>
> java.sql.SQLException: No suitable driver
>
> unfortunately
>
> and spark-shell uses spark-submit under the bonnet if you look at the
> shell file
>
> "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main
> --name "Spark shell" "$@"
>
>
> hm
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 December 2016 at 09:52, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
>> Hi Mich
>> You can copy the jar to shared location and use --jars command line
>> argument of spark-submit.
>> Who so ever needs  access to this jar , can refer to the shared path and
>> access it using --jars argument.
>>
>> Thanks
>> Deepak
>>
>> On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> When one runs in Local mode (one JVM) on an edge host (the host user
>>> accesses the cluster), it is possible to put additional jar file say
>>> accessing Oracle RDBMS tables in $SPARK_CLASSPATH. This works
>>>
>>> export SPARK_CLASSPATH=~/user_jars/ojdbc6.jar
>>>
>>> Normally a group of users can have read access to a shared directory
>>> like above and once they log in their shell will invoke an environment file
>>> that will have the above classpath plus additional parameters like
>>> $JAVA_HOME etc are set up for them.
>>>
>>> However, if the user chooses to run spark through spark-submit with
>>> yarn, then the only way this will work in my research is to add the jar
>>> file as follows on every node of Spark cluster
>>>
>>> in $SPARK_HOME/conf/spark-defaults.conf
>>>
>>> Add the jar path to the following:
>>>
>>> spark.executor.extraClassPath   /user_jars/ojdbc6.jar
>>>
>>> Note that setting both spark.executor.extraClassPath and SPARK_CLASSPATH
>>> will cause initialisation error
>>>
>>> ERROR SparkContext: Error initializing SparkContext.
>>> org.apache.spark.SparkException: Found both
>>> spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former.
>>>
>>> I was wondering if there are other ways of making this work in YARN
>>> mode, where every node of cluster will require this JAR file?
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
Hi Mich
You can copy the jar to shared location and use --jars command line
argument of spark-submit.
Who so ever needs  access to this jar , can refer to the shared path and
access it using --jars argument.

Thanks
Deepak

On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh 
wrote:

> When one runs in Local mode (one JVM) on an edge host (the host user
> accesses the cluster), it is possible to put additional jar file say
> accessing Oracle RDBMS tables in $SPARK_CLASSPATH. This works
>
> export SPARK_CLASSPATH=~/user_jars/ojdbc6.jar
>
> Normally a group of users can have read access to a shared directory like
> above and once they log in their shell will invoke an environment file that
> will have the above classpath plus additional parameters like $JAVA_HOME
> etc are set up for them.
>
> However, if the user chooses to run spark through spark-submit with yarn,
> then the only way this will work in my research is to add the jar file as
> follows on every node of Spark cluster
>
> in $SPARK_HOME/conf/spark-defaults.conf
>
> Add the jar path to the following:
>
> spark.executor.extraClassPath   /user_jars/ojdbc6.jar
>
> Note that setting both spark.executor.extraClassPath and SPARK_CLASSPATH
> will cause initialisation error
>
> ERROR SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Found both spark.executor.extraClassPath
> and SPARK_CLASSPATH. Use only the former.
>
> I was wondering if there are other ways of making this work in YARN mode,
> where every node of cluster will require this JAR file?
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame.
Then iterate over all rows with map and use something like below:
df.map(x=>x(0).toString().toDouble)

Thanks
Deepak

On Tue, Dec 20, 2016 at 3:05 PM, big data  wrote:

> our source data are string-based data, like this:
> col1   col2   col3 ...
> aaa   bbbccc
> aa2   bb2cc2
> aa3   bb3cc3
> ... ...   ...
>
> How to convert all of these data to double to apply for mlib's algorithm?
>
> thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
On Sun, Dec 18, 2016 at 2:26 AM, vaquar khan  wrote:

> select * from indexInfo;
>

Hi Vaquar
I do not see CF with the name indexInfo in any of the cassandra databases.

Thank
Deepak


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
There are 8 worker nodes in the cluster .

Thanks
Deepak

On Dec 18, 2016 2:15 AM, "Holden Karau" <hol...@pigscanfly.ca> wrote:

> How many workers are in the cluster?
>
> On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi All,
>> I am iterating over data frame's paritions using df.foreachPartition .
>> Upon each iteration of row , i am initializing DAO to insert the row into
>> cassandra.
>> Each of these iteration takes almost 1 and half minute to finish.
>> In my workflow , this is part of an action and 100 partitions are being
>> created for the df as i can see 100 tasks being created , where the insert
>> dao operation is being performed.
>> Since each of these 100 tasks , takes around 1 and half minute to
>> complete , it takes around 2 hour for this small insert operation.
>> Is anyone facing the same scenario and is there any time efficient way to
>> handle this?
>> This latency is not good in out use case.
>> Any pointer to improve/minimise the latency will be really appreciated.
>>
>>
>> --
>> Thanks
>> Deepak
>>
>>
>>


foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
Hi All,
I am iterating over data frame's paritions using df.foreachPartition .
Upon each iteration of row , i am initializing DAO to insert the row into
cassandra.
Each of these iteration takes almost 1 and half minute to finish.
In my workflow , this is part of an action and 100 partitions are being
created for the df as i can see 100 tasks being created , where the insert
dao operation is being performed.
Since each of these 100 tasks , takes around 1 and half minute to complete
, it takes around 2 hour for this small insert operation.
Is anyone facing the same scenario and is there any time efficient way to
handle this?
This latency is not good in out use case.
Any pointer to improve/minimise the latency will be really appreciated.


-- 
Thanks
Deepak


Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
Another simpler approach will be:
scala> val findf = sqlContext.sql("select
client_id,from_unixtime(ts/1000,'-MM-dd') ts from ts")
findf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string]

scala> findf.show
++--+
|   client_id|ts|
++--+
|cd646551-fceb-416...|2016-11-01|
|3bc61951-0f49-43b...|2016-11-01|
|688acc61-753f-4a3...|2016-11-23|
|5ff1eb6c-14ec-471...|2016-11-23|
++--+

I registered temp table out of the original DF
Thanks
Deepak

On Mon, Dec 5, 2016 at 1:49 PM, Deepak Sharma <deepakmc...@gmail.com> wrote:

> This is the correct way to do it.The timestamp that you mentioned was not
> correct:
>
> scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd")
> ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM-dd)
>
> scala> val finaldf = df.withColumn("ts1",ts1)
> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
> ts1: string]
>
> scala> finaldf.show
> ++-+--+
> |   client_id|   ts|   ts1|
> ++-+--+
> |cd646551-fceb-416...|1477989416803|2016-11-01|
> |3bc61951-0f49-43b...|1477983725292|2016-11-01|
> |688acc61-753f-4a3...|1479899459947|2016-11-23|
> |5ff1eb6c-14ec-471...|1479901374026|2016-11-23|
> ++-+--+
>
>
> Thanks
> Deepak
>
> On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> This is how you can do it in scala:
>> scala> val ts1 = from_unixtime($"ts", "-MM-dd")
>> ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd)
>>
>> scala> val finaldf = df.withColumn("ts1",ts1)
>> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
>> ts1: string]
>>
>> scala> finaldf.show
>> ++-+---+
>> |   client_id|   ts|ts1|
>> ++-+---+
>> |cd646551-fceb-416...|1477989416803|48805-08-14|
>> |3bc61951-0f49-43b...|1477983725292|48805-06-09|
>> |688acc61-753f-4a3...|1479899459947|48866-02-22|
>> |5ff1eb6c-14ec-471...|1479901374026|48866-03-16|
>> ++-+---+
>>
>> The year is returning wrong here.May be the input timestamp is not
>> correct .Not sure.
>>
>> Thanks
>> Deepak
>>
>> On Mon, Dec 5, 2016 at 1:34 PM, Devi P.V <devip2...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Thanks for replying to my question.
>>> I am using scala
>>>
>>> On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni <mmistr...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>  In python you can use date time.fromtimestamp(..).str
>>>> ftime('%Y%m%d')
>>>> Which spark API are you using?
>>>> Kr
>>>>
>>>> On 5 Dec 2016 7:38 am, "Devi P.V" <devip2...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a dataframe like following,
>>>>>
>>>>> ++---+
>>>>> |client_id   |timestamp|
>>>>> ++---+
>>>>> |cd646551-fceb-4166-acbc-b9|1477989416803  |
>>>>> |3bc61951-0f49-43bf-9848-b2|1477983725292  |
>>>>> |688acc61-753f-4a33-a034-bc|1479899459947  |
>>>>> |5ff1eb6c-14ec-4716-9798-00|1479901374026  |
>>>>> ++---+
>>>>>
>>>>>  I want to convert timestamp column into -MM-dd format.
>>>>> How to do this?
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is the correct way to do it.The timestamp that you mentioned was not
correct:

scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd")
ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM-dd)

scala> val finaldf = df.withColumn("ts1",ts1)
finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
ts1: string]

scala> finaldf.show
++-+--+
|   client_id|   ts|   ts1|
++-+--+
|cd646551-fceb-416...|1477989416803|2016-11-01|
|3bc61951-0f49-43b...|1477983725292|2016-11-01|
|688acc61-753f-4a3...|1479899459947|2016-11-23|
|5ff1eb6c-14ec-471...|1479901374026|2016-11-23|
++-+--+


Thanks
Deepak

On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma <deepakmc...@gmail.com> wrote:

> This is how you can do it in scala:
> scala> val ts1 = from_unixtime($"ts", "-MM-dd")
> ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd)
>
> scala> val finaldf = df.withColumn("ts1",ts1)
> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
> ts1: string]
>
> scala> finaldf.show
> ++-+---+
> |   client_id|   ts|ts1|
> ++-+---+
> |cd646551-fceb-416...|1477989416803|48805-08-14|
> |3bc61951-0f49-43b...|1477983725292|48805-06-09|
> |688acc61-753f-4a3...|1479899459947|48866-02-22|
> |5ff1eb6c-14ec-471...|1479901374026|48866-03-16|
> ++-+---+
>
> The year is returning wrong here.May be the input timestamp is not correct
> .Not sure.
>
> Thanks
> Deepak
>
> On Mon, Dec 5, 2016 at 1:34 PM, Devi P.V <devip2...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks for replying to my question.
>> I am using scala
>>
>> On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hi
>>>  In python you can use date time.fromtimestamp(..).str
>>> ftime('%Y%m%d')
>>> Which spark API are you using?
>>> Kr
>>>
>>> On 5 Dec 2016 7:38 am, "Devi P.V" <devip2...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a dataframe like following,
>>>>
>>>> ++---+
>>>> |client_id   |timestamp|
>>>> ++---+
>>>> |cd646551-fceb-4166-acbc-b9|1477989416803  |
>>>> |3bc61951-0f49-43bf-9848-b2|1477983725292  |
>>>> |688acc61-753f-4a33-a034-bc|1479899459947  |
>>>> |5ff1eb6c-14ec-4716-9798-00|1479901374026  |
>>>> ++---+
>>>>
>>>>  I want to convert timestamp column into -MM-dd format.
>>>> How to do this?
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is how you can do it in scala:
scala> val ts1 = from_unixtime($"ts", "-MM-dd")
ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd)

scala> val finaldf = df.withColumn("ts1",ts1)
finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
ts1: string]

scala> finaldf.show
++-+---+
|   client_id|   ts|ts1|
++-+---+
|cd646551-fceb-416...|1477989416803|48805-08-14|
|3bc61951-0f49-43b...|1477983725292|48805-06-09|
|688acc61-753f-4a3...|1479899459947|48866-02-22|
|5ff1eb6c-14ec-471...|1479901374026|48866-03-16|
++-+---+

The year is returning wrong here.May be the input timestamp is not correct
.Not sure.

Thanks
Deepak

On Mon, Dec 5, 2016 at 1:34 PM, Devi P.V  wrote:

> Hi,
>
> Thanks for replying to my question.
> I am using scala
>
> On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni 
> wrote:
>
>> Hi
>>  In python you can use date time.fromtimestamp(..).str
>> ftime('%Y%m%d')
>> Which spark API are you using?
>> Kr
>>
>> On 5 Dec 2016 7:38 am, "Devi P.V"  wrote:
>>
>>> Hi all,
>>>
>>> I have a dataframe like following,
>>>
>>> ++---+
>>> |client_id   |timestamp|
>>> ++---+
>>> |cd646551-fceb-4166-acbc-b9|1477989416803  |
>>> |3bc61951-0f49-43bf-9848-b2|1477983725292  |
>>> |688acc61-753f-4a33-a034-bc|1479899459947  |
>>> |5ff1eb6c-14ec-4716-9798-00|1479901374026  |
>>> ++---+
>>>
>>>  I want to convert timestamp column into -MM-dd format.
>>> How to do this?
>>>
>>>
>>> Thanks
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-11-30 Thread Deepak Sharma
In Spark > 2.0 , spark session was introduced that you can use to query
hive as well.
Just make sure you create spark session with enableHiveSupport() option.

Thanks
Deepak

On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande 
wrote:

> I am Spark 2.0.2 , using DStreams because I need Cassandra Sink.
>
> How do I create SQLContext? I get the error SQLContext deprecated.
>
>
> *[image: Inline image 1]*
>
> *Thanks*
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them?
It may save you time.

Thanks
Deepak

On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V  wrote:

> Hi all,
>
> I have 4 data frames with three columns,
>
> client_id,product_id,interest
>
> I want to combine these 4 dataframes into one dataframe.I used union like
> following
>
> df1.union(df2).union(df3).union(df4)
>
> But it is time consuming for bigdata.what is the optimized way for doing
> this using spark 2.0 & scala
>
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
Reason being you can set up hdfs duplication on your own to some other
cluster .

On Nov 11, 2016 22:42, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:

> reason being ?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 November 2016 at 17:11, Deepak Sharma <deepakmc...@gmail.com> wrote:
>
>> This is waste of money I guess.
>>
>> On Nov 11, 2016 22:41, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>> wrote:
>>
>>> starts at $4,000 per node per year all inclusive.
>>>
>>> With discount it can be halved but we are talking a node itself so if
>>> you have 5 nodes in primary and 5 nodes in DR we are talking about $40K
>>> already.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 11 November 2016 at 16:43, Mudit Kumar <mkumar...@sapient.com> wrote:
>>>
>>>> Is it feasible cost wise?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mudit
>>>>
>>>>
>>>>
>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>>> *Sent:* Friday, November 11, 2016 2:56 PM
>>>> *To:* user @spark
>>>> *Subject:* Possible DR solution
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> Has anyone had experience of using WanDisco <https://www.wandisco.com/>
>>>> block replication to create a fault tolerant solution to DR in Hadoop?
>>>>
>>>>
>>>>
>>>> The product claims that it starts replicating as soon as the first data
>>>> block lands on HDFS and takes the block and sends it to DR/replicate site.
>>>> The idea is that is faster than doing it through traditional HDFS copy
>>>> tools which are normally batch oriented.
>>>>
>>>>
>>>>
>>>> It also claims to replicate Hive metadata as well.
>>>>
>>>>
>>>>
>>>> I wanted to gauge if anyone has used it or a competitor product. The
>>>> claim is that they do not have competitors!
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn  
>>>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>>
>


Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
This is waste of money I guess.

On Nov 11, 2016 22:41, "Mich Talebzadeh"  wrote:

> starts at $4,000 per node per year all inclusive.
>
> With discount it can be halved but we are talking a node itself so if you
> have 5 nodes in primary and 5 nodes in DR we are talking about $40K already.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 November 2016 at 16:43, Mudit Kumar  wrote:
>
>> Is it feasible cost wise?
>>
>>
>>
>> Thanks,
>>
>> Mudit
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>> *Sent:* Friday, November 11, 2016 2:56 PM
>> *To:* user @spark
>> *Subject:* Possible DR solution
>>
>>
>>
>> Hi,
>>
>>
>>
>> Has anyone had experience of using WanDisco 
>> block replication to create a fault tolerant solution to DR in Hadoop?
>>
>>
>>
>> The product claims that it starts replicating as soon as the first data
>> block lands on HDFS and takes the block and sends it to DR/replicate site.
>> The idea is that is faster than doing it through traditional HDFS copy
>> tools which are normally batch oriented.
>>
>>
>>
>> It also claims to replicate Hive metadata as well.
>>
>>
>>
>> I wanted to gauge if anyone has used it or a competitor product. The
>> claim is that they do not have competitors!
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
Hi Rohit
You can use accumulators and increase it on every record processing.
At last you can get the value of accumulator on driver , which will give
you the count.

HTH
Deepak

On Nov 5, 2016 20:09, "Rohit Verma"  wrote:

> I am using spark to read from database and write in hdfs as parquet file.
> Here is code snippet.
>
> private long etlFunction(SparkSession spark){
> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
> “SNAPPY");
> Properties properties = new Properties();
> properties.put("driver”,”oracle.jdbc.driver");
> properties.put("fetchSize”,”5000");
> Dataset dataset = spark.read().jdbc(jdbcUrl, query, properties);
> dataset.write.format(“parquet”).save(“pdfs-path”);
> return dataset.count();
> }
>
> When I look at spark ui, during write I have stats of records written,
> visible in sql tab under query plan.
>
> While the count itself is a heavy task.
>
> Can someone suggest best way to get count in most optimized way.
>
> Thanks all..
>


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for
messages.


On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> My concern with Postgres / Cassandra is only scalability. I will look
> further into Postgres horizontal scaling, thanks.
>
> Writes could be idempotent if done as upserts, otherwise updates will be
> idempotent but not inserts.
>
> Data should not be lost. The system should be as fault tolerant as
> possible.
>
> What's the advantage of using Spark for reading Kafka instead of direct
> Kafka consumers?
>
> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> I wouldn't give up the flexibility and maturity of a relational
>> database, unless you have a very specific use case.  I'm not trashing
>> cassandra, I've used cassandra, but if all I know is that you're doing
>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>> aggregations without a lot of forethought.  If you're worried about
>> scaling, there are several options for horizontally scaling Postgres
>> in particular.  One of the current best from what I've worked with is
>> Citus.
>>
>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>> > Hi Cody
>> > Spark direct stream is just fine for this use case.
>> > But why postgres and not cassandra?
>> > Is there anything specific here that i may not be aware?
>> >
>> > Thanks
>> > Deepak
>> >
>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>> >>
>> >> How are you going to handle etl failures?  Do you care about lost /
>> >> duplicated data?  Are your writes idempotent?
>> >>
>> >> Absent any other information about the problem, I'd stay away from
>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>> >> feeding postgres.
>> >>
>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com>
>> wrote:
>> >> > Is there an advantage to that vs directly consuming from Kafka?
>> Nothing
>> >> > is
>> >> > being done to the data except some light ETL and then storing it in
>> >> > Cassandra
>> >> >
>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <
>> deepakmc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Its better you use spark's direct stream to ingest from kafka.
>> >> >>
>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I don't think I need a different speed storage and batch storage.
>> Just
>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>> somewhere
>> >> >>> where
>> >> >>> the web UI can query it, seems like it will be enough.
>> >> >>>
>> >> >>> I'm thinking about:
>> >> >>>
>> >> >>> - Reading data from Kafka via Spark Streaming
>> >> >>> - Standardizing, then storing it in Cassandra
>> >> >>> - Querying Cassandra from the web ui
>> >> >>>
>> >> >>> That seems like it will work. My question now is whether to use
>> Spark
>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>> >> >>>
>> >> >>>
>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>> >> >>> <mich.talebza...@gmail.com> wrote:
>> >> >>>>
>> >> >>>> - Spark Streaming to read data from Kafka
>> >> >>>> - Storing the data on HDFS using Flume
>> >> >>>>
>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>> on
>> >> >>>> HDFS. It is a waste of resources.
>> >> >>>>
>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>> >> >>>>
>> >> >>>> KafkaAgent.sources = kafka-sources
>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>> >> >>>>
>> >> >>>> That will be for your batch layer. To analyse you can directly
>> read
&

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody
Spark direct stream is just fine for this use case.
But why postgres and not cassandra?
Is there anything specific here that i may not be aware?

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> wrote:

> How are you going to handle etl failures?  Do you care about lost /
> duplicated data?  Are your writes idempotent?
>
> Absent any other information about the problem, I'd stay away from
> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
> feeding postgres.
>
> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> > Is there an advantage to that vs directly consuming from Kafka? Nothing
> is
> > being done to the data except some light ETL and then storing it in
> > Cassandra
> >
> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com>
> > wrote:
> >>
> >> Its better you use spark's direct stream to ingest from kafka.
> >>
> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com>
> wrote:
> >>>
> >>> I don't think I need a different speed storage and batch storage. Just
> >>> taking in raw data from Kafka, standardizing, and storing it somewhere
> where
> >>> the web UI can query it, seems like it will be enough.
> >>>
> >>> I'm thinking about:
> >>>
> >>> - Reading data from Kafka via Spark Streaming
> >>> - Standardizing, then storing it in Cassandra
> >>> - Querying Cassandra from the web ui
> >>>
> >>> That seems like it will work. My question now is whether to use Spark
> >>> Streaming to read Kafka, or use Kafka consumers directly.
> >>>
> >>>
> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
> >>> <mich.talebza...@gmail.com> wrote:
> >>>>
> >>>> - Spark Streaming to read data from Kafka
> >>>> - Storing the data on HDFS using Flume
> >>>>
> >>>> You don't need Spark streaming to read data from Kafka and store on
> >>>> HDFS. It is a waste of resources.
> >>>>
> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
> >>>>
> >>>> KafkaAgent.sources = kafka-sources
> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
> >>>>
> >>>> That will be for your batch layer. To analyse you can directly read
> from
> >>>> hdfs files with Spark or simply store data in a database of your
> choice via
> >>>> cron or something. Do not mix your batch layer with speed layer.
> >>>>
> >>>> Your speed layer will ingest the same data directly from Kafka into
> >>>> spark streaming and that will be  online or near real time (defined
> by your
> >>>> window).
> >>>>
> >>>> Then you have a a serving layer to present data from both speed  (the
> >>>> one from SS) and batch layer.
> >>>>
> >>>> HTH
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Dr Mich Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> LinkedIn
> >>>> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>>
> >>>>
> >>>>
> >>>> http://talebzadehmich.wordpress.com
> >>>>
> >>>>
> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >>>> loss, damage or destruction of data or any other property which may
> arise
> >>>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>>> author will in no case be liable for any monetary damages arising
> from such
> >>>> loss, damage or destruction.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com>
> wrote:
> >>>>>
> >>>>> The web UI is actually the speed layer, it needs to be able to query
> >>>>> the data online, and show the results in real-time.
> >>>>>
> >>>>> It also needs a custom front-end, so a system like Tableau can't be
> >>>>> used, it must have a custom backend + front-end.
> >>>>>
> >>>>> Thanks for the recommendation of 

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka.

On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

> I don't think I need a different speed storage and batch storage. Just
> taking in raw data from Kafka, standardizing, and storing it somewhere
> where the web UI can query it, seems like it will be enough.
>
> I'm thinking about:
>
> - Reading data from Kafka via Spark Streaming
> - Standardizing, then storing it in Cassandra
> - Querying Cassandra from the web ui
>
> That seems like it will work. My question now is whether to use Spark
> Streaming to read Kafka, or use Kafka consumers directly.
>
>
> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>>
>> You don't need Spark streaming to read data from Kafka and store on HDFS.
>> It is a waste of resources.
>>
>> Couple Flume to use Kafka as source and HDFS as sink directly
>>
>> KafkaAgent.sources = kafka-sources
>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>
>> That will be for your batch layer. To analyse you can directly read from
>> hdfs files with Spark or simply store data in a database of your choice via
>> cron or something. Do not mix your batch layer with speed layer.
>>
>> Your speed layer will ingest the same data directly from Kafka into spark
>> streaming and that will be  online or near real time (defined by your
>> window).
>>
>> Then you have a a serving layer to present data from both speed  (the one
>> from SS) and batch layer.
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> The web UI is actually the speed layer, it needs to be able to query the
>>> data online, and show the results in real-time.
>>>
>>> It also needs a custom front-end, so a system like Tableau can't be
>>> used, it must have a custom backend + front-end.
>>>
>>> Thanks for the recommendation of Flume. Do you think this will work:
>>>
>>> - Spark Streaming to read data from Kafka
>>> - Storing the data on HDFS using Flume
>>> - Using Spark to query the data in the backend of the web UI?
>>>
>>>
>>>
>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>>> on HDFS using flume.
>>>>
>>>> -  Query this data to generate reports / analytics (There will be a web
>>>> UI which will be the front-end to the data, and will show the reports)
>>>>
>>>> This is basically batch layer and you need something like Tableau or
>>>> Zeppelin to query data
>>>>
>>>> You will also need spark streaming to query data online for speed
>>>> layer. That data could be stored in some transient fabric like ignite or
>>>> even druid.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this ema

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple
channels in distributed fashion.
In that case , the resource utilization will be high in that case as well.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> - Spark Streaming to read data from Kafka
> - Storing the data on HDFS using Flume
>
> You don't need Spark streaming to read data from Kafka and store on HDFS.
> It is a waste of resources.
>
> Couple Flume to use Kafka as source and HDFS as sink directly
>
> KafkaAgent.sources = kafka-sources
> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>
> That will be for your batch layer. To analyse you can directly read from
> hdfs files with Spark or simply store data in a database of your choice via
> cron or something. Do not mix your batch layer with speed layer.
>
> Your speed layer will ingest the same data directly from Kafka into spark
> streaming and that will be  online or near real time (defined by your
> window).
>
> Then you have a a serving layer to present data from both speed  (the one
> from SS) and batch layer.
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>>
>>>>>

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around
queries .
Ingest the data to spark streaming (speed layer) and write to hdfs(for
batch layer).
Now you have data at rest as well as in motion(real time).
>From spark streaming itself , do further processing and write the final
result to Cassandra/nosql DB.
UI can pick the data from the DB now.

Thanks
Deepak

On Thu, Sep 29, 2016 at 8:00 PM, Alonso Isidoro Roman <alons...@gmail.com>
wrote:

> "Using Spark to query the data in the backend of the web UI?"
>
> Dont do that. I would recommend that spark streaming process stores data
> into some nosql or sql database and the web ui to query data from that
> database.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> <https://about.me/alonso.isidoro.roman?promo=email_sig_source=email_sig_medium=email_sig_campaign=external_links>
>
> 2016-09-29 16:15 GMT+02:00 Ali Akhtar <ali.rac...@gmail.com>:
>
>> The web UI is actually the speed layer, it needs to be able to query the
>> data online, and show the results in real-time.
>>
>> It also needs a custom front-end, so a system like Tableau can't be used,
>> it must have a custom backend + front-end.
>>
>> Thanks for the recommendation of Flume. Do you think this will work:
>>
>> - Spark Streaming to read data from Kafka
>> - Storing the data on HDFS using Flume
>> - Using Spark to query the data in the backend of the web UI?
>>
>>
>>
>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You need a batch layer and a speed layer. Data from Kafka can be stored
>>> on HDFS using flume.
>>>
>>> -  Query this data to generate reports / analytics (There will be a web
>>> UI which will be the front-end to the data, and will show the reports)
>>>
>>> This is basically batch layer and you need something like Tableau or
>>> Zeppelin to query data
>>>
>>> You will also need spark streaming to query data online for speed layer.
>>> That data could be stored in some transient fabric like ignite or even
>>> druid.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>>
>>>> It needs to be able to scale to a very large amount of data, yes.
>>>>
>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the message inflow ?
>>>>> If it's really high , definitely spark will be of great use .
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>>>
>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.
>>>>>>
>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their
>>>>>> raw data into Kafka.
>>>>>>
>>>>>> I need to:
>>>>>>
>>>>>> - Do ETL on the data, and standardize it.
>>>>>>
>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS
>>>>>> / ElasticSearch / Postgres)
>>>>>>
>>>>>> - Query this data to generate reports / analytics (There will be a
>>>>>> web UI which will be the front-end to the data, and will show the 
>>>>>> reports)
>>>>>>
>>>>>> Java is being used as the backend language for everything (backend of
>>>>>> the web UI, as well as the ETL layer)
>>>>>>
>>>>>> I'm considering:
>>>>>>
>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer
>>>>>> (receive raw data from Kafka, standardize & store it)
>>>>>>
>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized
>>>>>> data, and to allow queries
>>>>>>
>>>>>> - In the backend of the web UI, I could either use Spark to run
>>>>>> queries across the data (mostly filters), or directly run queries against
>>>>>> Cassandra / HBase
>>>>>>
>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs Spark 
>>>>>> for
>>>>>> ETL, which persistent data store to use, and how to query that data store
>>>>>> in the backend of the web UI, for displaying the reports).
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ?
If it's really high , definitely spark will be of great use .

Thanks
Deepak

On Sep 29, 2016 19:24, "Ali Akhtar"  wrote:

> I have a somewhat tricky use case, and I'm looking for ideas.
>
> I have 5-6 Kafka producers, reading various APIs, and writing their raw
> data into Kafka.
>
> I need to:
>
> - Do ETL on the data, and standardize it.
>
> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS /
> ElasticSearch / Postgres)
>
> - Query this data to generate reports / analytics (There will be a web UI
> which will be the front-end to the data, and will show the reports)
>
> Java is being used as the backend language for everything (backend of the
> web UI, as well as the ETL layer)
>
> I'm considering:
>
> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer (receive
> raw data from Kafka, standardize & store it)
>
> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data,
> and to allow queries
>
> - In the backend of the web UI, I could either use Spark to run queries
> across the data (mostly filters), or directly run queries against Cassandra
> / HBase
>
> I'd appreciate some thoughts / suggestions on which of these alternatives
> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which
> persistent data store to use, and how to query that data store in the
> backend of the web UI, for displaying the reports).
>
>
> Thanks.
>


Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case
class , if you are using scala.
You can then use play api's
(play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the
mapped case class to json.

Thanks
Deepak

On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog  wrote:

> Hi,
>
> I have a Rdd of n rows,  i want to transform this to a Json RDD, and also
> add some more information , any idea how to accomplish this .
>
>
> ex : -
>
> i have rdd with n rows with data like below ,  ,
>
>  16.9527493170273,20.1989561393151,15.7065424947394
>  17.9527493170273,21.1989561393151,15.7065424947394
>  18.9527493170273,22.1989561393151,15.7065424947394
>
>
> would like to add few rows highlited to the beginning of RDD like below,
> is there a way to
> do this and transform it to JSON,  the reason being i intend to push this
> as input  to some application via pipeRDD for some processing, and want to
> enforce a JSON structure on the input.
>
> *{*
> *TimeSeriesID : 1234*
> *NumOfInputSamples : 1008 *
> *Request Type : Fcast*
>  16.9527493170273,20.1989561393151,15.7065424947394
>  17.9527493170273,21.1989561393151,15.7065424947394
>  18.9527493170273,22.1989561393151,15.7065424947394
> }
>
>
> Thanks,
> Sujeet
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama

To me it looks like issue with the SPN with which you are trying to connect
to hive2 , i.e. hive@hostname.

Are you able to connect to hive from spark-shell?

Try getting the tkt using any other user keytab but not hadoop services
keytab and then try running the spark submit.


Thanks

Deepak

On 17 Sep 2016 12:23 am,  wrote:

> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConne
> ction.(HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDrive
> r.connect(HiveDriver.java:105)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:571)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.Java
> PairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$
> apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.t
> ryWithSafeFinally(Utils.scala:1285)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1116)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1095)
>
>at org.apache.spark.scheduler.Res
> ultTask.runTask(ResultTask.scala:63)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
>at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:213)
>
>at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.thrift.transport.TTransportException
>
>at org.apache.thrift.transport.TI
> OStreamTransport.read(TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.TT
> ransport.readAll(TTransport.java:84)
>
>at org.apache.thrift.transport.TS
> aslTransport.receiveSaslMessage(TSaslTransport.java:182)
>
>at org.apache.thrift.transport.TS
> aslTransport.open(TSaslTransport.java:258)
>
>at org.apache.thrift.transport.TS
> aslClientTransport.open(TSaslClientTransport.java:37)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:415)
>
>at org.apache.hadoop.security.Use
> rGroupInformation.doAs(UserGroupInformation.java:1657)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:203)
>
>... 21 more
>
>
>
> In spark conf directory hive-site.xml has the following properties
>
>
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.keytab.file
>
>   /etc/security/keytabs/hive.service.keytab
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.principal
>
>   hive/_HOST@
>
> 
>
>
>
> 
>
>   hive.metastore.sasl.enabled
>
>   true
>
> 
>
>
>
> 
>
>   hive.metastore.uris
>
>   thrift://:9083
>
> 
>
>
>
> 
>
>   hive.server2.authentication
>
>   KERBEROS
>
> 
>
>
>
> 
>
>   

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your
case.
Multi tenancy means all the applications has to be submitted to different
queues.

Thanks
Deepak

On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot 
wrote:

> Hi,
>
> I am on EMR cluster and My cluster configuration is as below:
> Number of nodes including master node - 3
> Memory:22.50 GB
> VCores Total : 16
> Active Nodes : 2
> Spark version- 1.6.1
>
> Parameter set in spark-default.conf
>
> spark.executor.instances 2
>> spark.executor.cores 8
>> spark.driver.memory  10473M
>> spark.executor.memory9658M
>> spark.default.parallelism32
>
>
> Would let me know if need any other info regarding the cluster .
>
> The current configuration for spark-submit is
> --driver-memory 5G \
> --executor-memory 2G \
> --executor-cores 5 \
> --num-executors 10 \
>
>
> Currently  with the above job configuration if I try to run another spark
> job it will be in accepted state till the first one finishes .
> How do I optimize or update the above spark-submit configurations to run
> some more spark jobs simultaneously
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Ways to check Spark submit running

2016-09-13 Thread Deepak Sharma
Use yarn-client mode and you can see the logs n console after you submit.

On Tue, Sep 13, 2016 at 11:47 AM, Divya Gehlot 
wrote:

> Hi,
>
> Some how for time being  I am unable to view Spark Web UI and Hadoop Web
> UI.
> Looking for other ways ,I can check my job is running fine apart from keep
> checking current yarn logs .
>
>
> Thanks,
> Divya
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Assign values to existing column in SparkR

2016-09-09 Thread Deepak Sharma
Data frames are immutable in nature , so i don't think you can directly
assign or change values on the column.

Thanks
Deepak

On Fri, Sep 9, 2016 at 10:59 PM, xingye  wrote:

> I have some questions about assign values to a spark dataframe. I want to
> assign values to an existing column of a spark dataframe but if I assign
> the value directly, I got the following error.
>
>
>1. df$c_mon<-0
>2. Error: class(value) == "Column" || is.null(value) is not TRUE
>
> Is there a way to solve this?
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Calling udf in Spark

2016-09-08 Thread Deepak Sharma
No its not required for UDF.
Its required when you convert from rdd to df.
Thanks
Deepak

On 8 Sep 2016 2:25 pm, "Divya Gehlot"  wrote:

> Hi,
>
> Is it necessary to import sqlContext.implicits._ whenever define and
> call UDF in Spark.
>
>
> Thanks,
> Divya
>
>
>


Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
Is it possible to execute any query using SQLContext even if the DB is
secured using roles or tools such as Sentry?

Thanks
Deepak

On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan 
wrote:

> Hi All,
>
> In our YARN cluster, we have setup spark 1.6.1 , we plan to give access to
> all the end users/developers/BI users, etc. But we learnt any valid user
> after getting their own user kerb TGT, can get hold of sqlContext (in
> program or in shell) and can run any query against any secure databases.
>
> This puts us in a critical condition as we do not want to give blanket
> permission to everyone.
>
>
>
> We are looking forward to:
>
> 1)  A *solution or a work around, by which we can give secure access
> only to the selected users to sensitive tables/database.*
>
> 2)  *Failing to do so, we would like to remove/disable the SparkSQL
> context/feature for everyone.  *
>
>
>
> Any pointers in this direction will be very valuable.
>
> Thank you,
>
> Arpan
>
>
> This e-mail and any attachments are confidential, intended only for the 
> addressee and may be privileged. If you have received this e-mail in error, 
> please notify the sender immediately and delete it. Any content that does not 
> relate to the business of Worldpay is personal to the sender and not 
> authorised or endorsed by Worldpay. Worldpay does not accept responsibility 
> for viruses or any loss or damage arising from transmission or access.
>
> Worldpay (UK) Limited (Company No: 07316500/ Financial Conduct Authority No: 
> 530923), Worldpay Limited (Company No:03424752 / Financial Conduct Authority 
> No: 504504), Worldpay AP Limited (Company No: 05593466 / Financial Conduct 
> Authority No: 502597). Registered Office: The Walbrook Building, 25 Walbrook, 
> London EC4N 8AF and authorised by the Financial Conduct Authority under the 
> Payment Service Regulations 2009 for the provision of payment services. 
> Worldpay (UK) Limited is authorised and regulated by the Financial Conduct 
> Authority for consumer credit activities. Worldpay B.V. (WPBV) has its 
> registered office in Amsterdam, the Netherlands (Handelsregister KvK no. 
> 60494344). WPBV holds a licence from and is included in the register kept by 
> De Nederlandsche Bank, which registration can be consulted through 
> www.dnb.nl. Worldpay, the logo and any associated brand names are trade marks 
> of the Worldpay group.
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark 2.0 - Join statement compile error

2016-08-23 Thread Deepak Sharma
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma <deepakmc...@gmail.com>
wrote:

> *val* *df** = 
> **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID"
> =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)*


Ignore the last statement.
It should look something like this:
*val* *df** = 
**sales_demand**.**join**(**product_master**,$"**sales_demand**.INVENTORY_ITEM_ID"
=**== $"**product_master**.INVENTORY_ITEM_ID",**"inner"**)*


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
Hi Subhajit
Try this in your join:
*val* *df** = 
**sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID"
=**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)*

On Tue, Aug 23, 2016 at 2:30 AM, Subhajit Purkayastha 
wrote:

> *All,*
>
>
>
> *I have the following dataFrames and the temp table. *
>
>
>
> *I am trying to create a new DF , the following statement is not compiling*
>
>
>
> *val* *df** = **sales_demand**.**join**(**product_master**,(*
> *sales_demand**.INVENTORY_ITEM_ID**==**product_master*
> *.INVENTORY_ITEM_ID),**joinType**=**"inner"**)*
>
>
>
>
>
>
>
> *What am I doing wrong?*
>
>
>
> *==Code===*
>
>
>
> *var* sales_order_sql_stmt = s"""SELECT ORDER_NUMBER , INVENTORY_ITEM_ID,
> ORGANIZATION_ID,
>
>   from_unixtime(unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd'),
> '-MM-dd') AS schedule_date
>
>   FROM sales_order_demand
>
>   WHERE unix_timestamp(SCHEDULE_SHIP_DATE,'-MM-dd') >= $
> planning_start_date  limit 10"""
>
>
>
> *val* sales_demand = spark.sql (sales_order_sql_stmt)
>
>
>
> //print the data
>
> *sales_demand**.**collect**()*.foreach { println }
>
>
>
>
>
> *val* product_sql_stmt = "select 
> SEGMENT1,INVENTORY_ITEM_ID,ORGANIZATION_ID
> from product limit 10"
>
> *val* product_master = spark.sql (product_sql_stmt)
>
>
>
> //print the data
>
> *product_master**.**collect**()*.foreach { println }
>
>
>
> *val* *df** = **sales_demand**.**join**(**product_master**,(*
> *sales_demand**.INVENTORY_ITEM_ID**==**product_master*
> *.INVENTORY_ITEM_ID),**joinType**=**"inner"**)*
>
>
>
>
>
>
>
>spark.stop()
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread DEEPAK SHARMA
Hi All,


Below is the small piece of code in scala and python REPL in Apache 
Spark.However I am getting different output in both the language when I execute 
toDebugString.I am using cloudera quick start VM.

PYTHON

rdd2 = 
sc.textFile('file:/home/training/training_materials/data/frostroad.txt').map(lambda
 x:x.upper()).filter(lambda x : 'THE' in x)

print rdd2.toDebugString()
(1) PythonRDD[56] at RDD at PythonRDD.scala:42 []
 |  file:/home/training/training_materials/data/frostroad.txt 
MapPartitionsRDD[55] at textFile at NativeMethodAccessorImpl.java:-2 []
 |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[54] at 
textFile at ..

SCALA

 val rdd2 = 
sc.textFile("file:/home/training/training_materials/data/frostroad.txt").map(x 
=> x.toUpperCase()).filter(x => x.contains("THE"))



rdd2.toDebugString
res1: String =
(1) MapPartitionsRDD[3] at filter at :21 []
 |  MapPartitionsRDD[2] at map at :21 []
 |  file:/home/training/training_materials/data/frostroad.txt 
MapPartitionsRDD[1] at textFile at :21 []
 |  file:/home/training/training_materials/data/frostroad.txt HadoopRDD[0] at 
textFile at <


Also one of cloudera slides say that the default partitions  is 2 however its 1 
(looking at output of toDebugString).


Appreciate any help.


Thanks

Deepak Sharma


Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi
If anyone is using or knows about github repo that can help me get started
with image and video processing using spark.
The images/videos will be stored in s3 and i am planning to use s3 with
Spark.
In this case , how will spark achieve distributed processing?
Any code base or references is really appreciated.

-- 
Thanks
Deepak


Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Deepak Sharma
Can you please post the code snippet and the error you are getting ?

-Deepak

On 9 Aug 2016 12:18 am, "manish jaiswal"  wrote:

> Hi,
>
> I am not able to read data from hive transactional table using sparksql.
> (i don't want read via hive jdbc)
>
>
>
> Please help.
>


Re: Spark join and large temp files

2016-08-08 Thread Deepak Sharma
Register you dataframes as temp tables and then try the join on the temp
table.
This should resolve your issue.

Thanks
Deepak

On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab  wrote:

> Hello,
> We have two parquet inputs of the following form:
>
> a: id:String, Name:String  (1.5TB)
> b: id:String, Number:Int  (1.3GB)
>
> We need to join these two to get (id, Number, Name). We've tried two
> approaches:
>
> a.join(b, Seq("id"), "right_outer")
>
> where a and b are dataframes. We also tried taking the rdds, mapping them
> to pair rdds with id as the key, and then joining. What we're seeing is
> that temp file usage is increasing on the join stage, and filling up our
> disks, causing the job to crash. Is there a way to join these two data sets
> without well...crashing?
>
> Note, the ids are unique, and there's a one to one mapping between the two
> datasets.
>
> Any help would be appreciated.
>
> -Ashic.
>
>
>
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Thanks Vaquar.
My intention is to find something which can help stress test the code in
spark , measure the performance and suggest some improvements.
Is there any such framework or tool I can use here ?

Thanks
Deepak

On 8 Aug 2016 9:14 pm, "vaquar khan" <vaquar.k...@gmail.com> wrote:

> I found following links are good as I am using same.
>
> http://spark.apache.org/docs/latest/tuning.html
>
> https://spark-summit.org/2014/testing-spark-best-practices/
>
> Regards,
> Vaquar khan
>
> On 8 Aug 2016 10:11, "Deepak Sharma" <deepakmc...@gmail.com> wrote:
>
>> Hi All,
>> Can anyone please give any documents that may be there around spark-scala
>> best practises?
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>


Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Hi All,
Can anyone please give any documents that may be there around spark-scala
best practises?

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Deepak Sharma
Hi Devi
Please make sure the jdbc jar is in the spark classpath.
With spark-submit , you can use --jars option to specify the sql server
jdbc jar.

Thanks
Deepak

On Mon, Aug 8, 2016 at 1:14 PM, Devi P.V  wrote:

> Hi all,
>
> I am trying to write a spark dataframe into MS-Sql Server.I have tried
> using the following code,
>
>  val sqlprop = new java.util.Properties
> sqlprop.setProperty("user","uname")
> sqlprop.setProperty("password","pwd")
> sqlprop.setProperty("driver","com.microsoft.sqlserver.jdbc.
> SQLServerDriver")
> val url = "jdbc:sqlserver://samplesql.amazonaws.com:1433/dbName"
> val dfWriter = df.write
> dfWriter.jdbc(url, "tableName", sqlprop)
>
> But I got following error
>
> Exception in thread "main" java.lang.ClassNotFoundException:
> com.microsoft.sqlserver.jdbc.SQLServerDriver
>
> what are the configurations needs to connect to MS-Sql Server.Not found
> any library dependencies for connecting spark and MS-Sql.
>
> Thanks
>



-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


  1   2   >