Re: can we all help use our expertise to create an IT solution for Covid-19

2020-03-26 Thread hemant singh
Hello Mich,

I will be more than happy to contribute to this.

Thanks,
Hemant


On Thu, Mar 26, 2020 at 7:11 PM Mich Talebzadeh 
wrote:

> Hi all,
>
> Do you think we can create a global solution in the cloud using
> volunteers like us and third party employees. What I have in mind is to
> create a comprehensive real time solution to get data from various
> countries, universities pushed into a fast database through Kafka and Spark
> and used downstream for greater analytics. I am sure likes of Goggle etc.
> will provide free storage and likely many vendors will grab the opportunity.
>
> We can then donate this to WHO or others and we can make it very modular
> though microservices etc.
>
> I hope this does not sound futuristic.
>
> Regards,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: RESTful Operations

2020-01-20 Thread hemant singh
Livy has both statement based(scala) as well as batch processing(code jar
packaged). I think first statement based approach is what you might want to
look at.
Data has to residing in some source.

Thanks,
Hemant

On Mon, 20 Jan 2020 at 2:04 PM,  wrote:

> Sorry didn't explain well. Livy seems to me similar to job server by
> interacting through restful api interface for  job submission. The task I
> want to achieve is more like general restful api interface where user would
> provide some parameters for business operations (a bit like readText or a
> custom date source with embedded http). So when the spark receive those
> requests in e.g custom date source, the spark job can then perform normal
> operations such as  aggregation, window functions, and so on.
>
> Thank you for the suggestions and help.
>
>
>
>
> Jan 20, 2020, 00:26 by chris.t...@gmail.com:
>
> Maybe something like Livy, otherwise roll your own REST API and have it
> start a Spark job.
>
> On Mon, 20 Jan 2020 at 06:55,  wrote:
>
> I am new to Spark. The task I want to accomplish is let client send http
> requests, then spark process that request for further operations. However
> searching Spark's website docs
>
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
> 
>
> https://spark.apache.org/docs/latest/
>
> I do not find any places mentioning about this. Also most of the internet
> result are more lated to spark job server.
>
> Any places I should start if I want to use Spark for such purpose?
>
> Thanks
>
>
>
> --
> Chris
>
>
>


Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread hemant singh
You can try repartitioning the data, if it’s a skewed data then you may
need to salt the keys for better partitioning.
Are you using a coalesce or any other fn which brings the data to lesser
nodes. Window function also incurs shuffling that could be an issue.

On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah  wrote:

> Thanks Hemant, underlying data volume increased from 550GB to 690GB and
> now the same job doesn't succeed. I tried incrementing executor memory to
> 20G as well, still fails. I am running this in Databricks and start cluster
> with 20G assigned to spark.executor.memory property.
>
> Also some more information on the job, I have about 4 window functions on
> this dataset before it gets written out.
>
> Any other ideas?
>
> Thanks,
> -Shraddha
>
> On Sun, Jan 5, 2020 at 11:06 PM hemant singh  wrote:
>
>> You can try increasing the executor memory, generally this error comes
>> when there is not enough memory in individual executors.
>> Job is getting completed may be because when tasks are re-scheduled it
>> would be going through.
>>
>> Thanks.
>>
>> On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah 
>> wrote:
>>
>>> Hello All,
>>>
>>> One of my jobs, keep getting into this situation where 100s of tasks
>>> keep failing with below error but job eventually completes.
>>>
>>> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
>>> bytes of memory
>>>
>>> Could someone advice?
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread hemant singh
You can try increasing the executor memory, generally this error comes when
there is not enough memory in individual executors.
Job is getting completed may be because when tasks are re-scheduled it
would be going through.

Thanks.

On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah  wrote:

> Hello All,
>
> One of my jobs, keep getting into this situation where 100s of tasks keep
> failing with below error but job eventually completes.
>
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
> bytes of memory
>
> Could someone advice?
>
> --
> Regards,
>
> Rishi Shah
>


Re: Spark onApplicationEnd run multiple times during the application failure

2019-11-21 Thread hemant singh
This is how it is. It is the whole application level retry after 4 tasks
attempts fail the whole application fails and then the application retry.

Thanks,
Hemant

On Thu, 21 Nov 2019 at 7:24 PM, Jiang, Yi J (CWM-NR) 
wrote:

> Hello,
>
> Thank you for replying, it is retry, but why retry is happened from whole
> application level?
>
> To my understanding, the retry can be done in job level.
>
> Jacky
>
>
>
>
>
> *From:* hemant singh [mailto:hemant2...@gmail.com]
> *Sent:* November 21, 19 3:12 AM
> *To:* Jiang, Yi J (CWM-NR) 
> *Cc:* Martin, Phil ; user@spark.apache.org
> *Subject:* Re: Spark onApplicationEnd run multiple times during the
> application failure
>
>
>
> Could it be because of re-try.
>
>
>
> Thanks
>
>
>
> On Thu, 21 Nov 2019 at 3:35 AM, Jiang, Yi J (CWM-NR) <
> yi.j.ji...@rbc.com.invalid> wrote:
>
> Hello
>
> We are running into an issue.
>
> We have customized the SparkListener class, and added that to spark
> context. But when the spark job is failed, we find the “onApplicationEnd”
> function was triggered twice.
>
> Is that supposed to be triggered just once when the spark job is failed?
> Because the application level of spark job is only launched once, how can
> it be triggered twice when it is failed.
>
> Please let us know
>
> Thank you
>
>
>
>
>
> ___
>
> This email is intended only for the use of the individual(s) to whom it is
> addressed and may be privileged and confidential.
> Unauthorised use or disclosure is prohibited. If you receive this e-mail
> in error, please advise immediately and delete the original message.
> This message may have been altered without your or our knowledge and the
> sender does not accept any liability for any errors or omissions in the
> message.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux
> droits et obligations qui s'y rapportent.
> Toute diffusion, utilisation ou copie de ce message ou des renseignements
> qu'il contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite.
> Si vous recevez ce courriel par erreur, veuillez m'en aviser
> immédiatement, par retour de courriel ou par un autre moyen.
>
>


Re: Spark onApplicationEnd run multiple times during the application failure

2019-11-21 Thread hemant singh
Could it be because of re-try.

Thanks

On Thu, 21 Nov 2019 at 3:35 AM, Jiang, Yi J (CWM-NR)
 wrote:

> Hello
>
> We are running into an issue.
>
> We have customized the SparkListener class, and added that to spark
> context. But when the spark job is failed, we find the “onApplicationEnd”
> function was triggered twice.
>
> Is that supposed to be triggered just once when the spark job is failed?
> Because the application level of spark job is only launched once, how can
> it be triggered twice when it is failed.
>
> Please let us know
>
> Thank you
>
>
>
>
>
> ___
>
> This email is intended only for the use of the individual(s) to whom it is
> addressed and may be privileged and confidential.
> Unauthorised use or disclosure is prohibited. If you receive this e-mail
> in error, please advise immediately and delete the original message.
> This message may have been altered without your or our knowledge and the
> sender does not accept any liability for any errors or omissions in the
> message.
>
> Ce courriel est confidentiel et protégé. L'expéditeur ne renonce pas aux
> droits et obligations qui s'y rapportent.
> Toute diffusion, utilisation ou copie de ce message ou des renseignements
> qu'il contient par une personne autre que le (les) destinataire(s)
> désigné(s) est interdite.
> Si vous recevez ce courriel par erreur, veuillez m'en aviser
> immédiatement, par retour de courriel ou par un autre moyen.
>


Re: Spark - configuration setting doesn't work

2019-10-27 Thread hemant singh
You should add the configurations while creating the session, I don’t think
you can override it once the session is created. Few are though.

Thanks,
Hemant

On Sun, 27 Oct 2019 at 11:02 AM, Chetan Khatri 
wrote:

> Could someone please help me.
>
> On Thu, Oct 17, 2019 at 7:29 PM Chetan Khatri 
> wrote:
>
>> Hi Users,
>>
>> I am setting spark configuration in below way;
>>
>> val spark = SparkSession.builder().appName(APP_NAME).getOrCreate()
>>
>> spark.conf.set("spark.speculation", "false")
>> spark.conf.set("spark.broadcast.compress", "true")
>> spark.conf.set("spark.sql.broadcastTimeout", "36000")
>> spark.conf.set("spark.network.timeout", "2500s")
>> spark.conf.set("spark.serializer", 
>> "org.apache.spark.serializer.KryoSerializer")
>> spark.conf.set("spark.driver.memory", "10g")
>> spark.conf.set("spark.executor.memory", "10g")
>>
>> import spark.implicits._
>>
>>
>> and submitting spark job with spark - submit. but none of the above 
>> configuration is
>>
>> getting reflected to the job, I have checked at Spark-UI.
>>
>> I know setting up like this while creation of spark object, it's working 
>> well.
>>
>>
>> val spark = SparkSession.builder().appName(APP_NAME)
>>   .config("spark.network.timeout", "1500s")
>>   .config("spark.broadcast.compress", "true")
>>   .config("spark.sql.broadcastTimeout", "36000")
>>   .getOrCreate()
>>
>> import spark.implicits._
>>
>>
>> Can someone please throw light?
>>
>>


Re: Read text file row by row and apply conditions

2019-09-30 Thread hemant singh
You can use csv reader with delimiter as '|' to split data and create a
dataframe on top of the file data.
Second step, filter the dataframe on column value like indicator type=A,D
etc and put it in tables. Saving to tables you can use dataframewriter(not
sure what is you destination db type here).

You can keep this part where based on column type you have to filter data
generic and read the configurations like say - col value | tableName from
config file or metadata table and write the data to destination.

Thanks.

On Mon, Sep 30, 2019 at 8:56 AM swetha kadiyala 
wrote:

> dear friends,
>
> I am new to spark. can you please help me to read the below text file
> using spark and scala.
>
> Sample data
>
> bret|lee|A|12345|ae545|gfddfg|86786786
> 142343345||D|ae342
> 67567|6|U|aadfsd|34k4|84304|020|sdnfsdfn|3243|benej|32432|jsfsdf|3423
> 67564|67747|U|aad434|3435|843454|203|sdn454dfn|3233|gdfg|34325|sdfsddf|7657
>
>
> I am receiving indicator type with 3 rd column of each row. if my
> indicator type=A, then i need to store that particular row data into a
> table called Table1.
> if indicator type=D then i have to store data into seperate table called
> TableB and same as indicator type=U then i have to store all rows data into
> a separate table called Table3.
>
> Can anyone help me how to read row by row and split the columns and apply
> the condition based on indicator type and store columns data into
> respective tables.
>
> Thanks,
> Swetha
>


Spark foreach retry

2019-05-09 Thread hemant singh
Hi,

I want to know what happens if foreach fails for some record. Does foreach
retry like any general task it retries 4 times.
Say I am pushing some payload to an API if for some record it fails then
will it get retried or it is bypassed and rest of the records are processed.

Thanks,
Hemant


Re: write files of a specific size

2019-05-05 Thread hemant singh
Based on size of the output data you can do the math of how many file you
will need to produce 100MB files. Once you have number of files you can do
coalesce or repartition depending on whether your job writes more or less
output partitions.

On Sun, 5 May 2019 at 2:21 PM, rajat kumar 
wrote:

> Hi All,
> My spark sql job produces output as per default partition and creates N
> number of files.
> I want to create each file as 100Mb sized in the final result.
>
> how can I do it ?
>
> thanks
> rajat
>
>


Re: Spark Kafka Batch Write guarantees

2019-04-01 Thread hemant singh
Thanks Shixiong, read in documentation as well that duplicates might exist
because of task retries.

On Mon, 1 Apr 2019 at 9:43 PM, Shixiong(Ryan) Zhu 
wrote:

> The Kafka source doesn’t support transaction. You may see partial data or
> duplicated data if a Spark task fails.
>
> On Wed, Mar 27, 2019 at 1:15 AM hemant singh  wrote:
>
>> We are using spark batch to write Dataframe to Kafka topic. The spark
>> write function with write.format(source = Kafka).
>> Does spark provide similar guarantee like it provides with saving
>> dataframe to disk; that partial data is not written to Kafka i.e. full
>> dataframe is saved or if job fails no data is written to Kafka topic.
>>
>> Thanks.
>>
> --
>
> Best Regards,
> Ryan
>


Spark Kafka Batch Write guarantees

2019-03-27 Thread hemant singh
We are using spark batch to write Dataframe to Kafka topic. The spark write
function with write.format(source = Kafka).
Does spark provide similar guarantee like it provides with saving dataframe
to disk; that partial data is not written to Kafka i.e. full dataframe is
saved or if job fails no data is written to Kafka topic.

Thanks.


Re: Spark DataFrame/DataSet Wide Transformations

2019-02-06 Thread hemant singh
Same concept applies to Dataframe as it is with RDD with respect to
transformations. Both are distributed data set.

Thanks

On Thu, Feb 7, 2019 at 8:51 AM Faiz Chachiya  wrote:

> Hello Team,
>
> With RDDs it is pretty clear which operations would result in wide
> transformations and there are also options available to find out parent
> dependencies
>
> I have been struggling to do the same with DataFrame/DataSet, I need your
> helping in finding out which operations may lead to wide transformations
> like (OrderBy) and if there is way to find out the parent dependencies.
>
> There is one way to find out parent dependencies by converting the DF/DS
> to RDD and invoke the dependencies.
>
> I hope my question is clear and would request your help with it.
>
> Thanks,
> Faiz
>


Re: Create all the combinations of a groupBy

2019-01-23 Thread hemant singh
Check roll up and cube functions in spark sql.

On Wed, 23 Jan 2019 at 10:47 PM, Pierremalliard <
pierre.de-malli...@capgemini.com> wrote:

> Hi,
>
> I am trying to generate a dataframe of all combinations that have a same
> key
> using Pyspark.
>
> example:
>
> (a,1)
> (a,2)
> (a,3)
> (b,1)
> (b,2)
>
> should return:
>
> (a, 1 , 2)
> (a, 1 , 3)
> (a, 2, 3)
> (b, 1 ,2)
>
>
> i want to do something like df.groupBy('key').combinations().apply(...)
>
> any suggestions are welcome !!!
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: use spark cluster in java web service

2018-11-01 Thread hemant singh
Why do't you explore Livy. You can use the Rest API to submit the jobs -
https://community.hortonworks.com/articles/151164/how-to-submit-spark-application-through-livy-rest.html



On Thu, Nov 1, 2018 at 12:52 PM 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> wrote:

> Hi,
> we want to use spark in our java web service , compute data in spark
> cluster according to request,now we have two probles:
> 1、 how to get sparkSession of remote spark cluster (spark on yarn mode) ,
> we want to keep one sparkSession to execute all data compution;
> 2、how to submit to remote spark cluster in java code instead of
> spark-submit , as we want to execute spark code in reponse server;
>
> Thanks for any replys
> 0049003208
> 0049003...@znv.com
>
> 
> 签名由 网易邮箱大师  定制
>


Re: How to do efficient self join with Spark-SQL and Scala

2018-09-21 Thread hemant singh
You can use spark dataframe 'when' 'otherwise' clause to replace SQL case
statement.

This piece will be required to calculate before -

'select student_id from tbl_student where candidate_id = c.candidate_id and
approval_id = 2
and academic_start_date is null'

Take the count of above DF after joining tbl_student and tbl_network DF's
based on condition above.

Overall you can join all three tables first and rest of the query on the
same dataframe.


On Sat, Sep 22, 2018 at 1:08 AM Chetan Khatri 
wrote:

> Dear Spark Users,
>
> I came across little weird MSSQL Query to replace with Spark and I am like
> no clue how to do it in an efficient way with Scala + SparkSQL. Can someone
> please throw light. I can create view of DataFrame and do it as *spark.sql
> *(query) but I would like to do it with Scala + Spark way.
>
> Sample:
>
>
>
>
>
>
>
>
>
>
>
> *select a.student_id,a.candidate_id, a.student_name, a.student_standard,
> a.student_city, b.teacher_name, a.student_status ,a.approval_id, case when
> a.approval_id = 2 and (a.academic_start_date is nulland not exists (select
> student_id from tbl_student where candidate_id = c.candidate_id and
> approval_id = 2and academic_start_date is null)) then 'Yes'else 'No'end as
> is_currentfrom tbl_student a inner join tbl_teacher b on a.candidate_id =
> b.candidate_id inner join tbl_network con c.candidate_id = a.candidate_id*
>
> Thank you.
>
>


Re: Launch a pyspark Job From UI

2018-06-11 Thread hemant singh
You can explore Livy https://dzone.com/articles/quick-start-with-apache-livy

On Mon, Jun 11, 2018 at 3:35 PM, srungarapu vamsi 
wrote:

> Hi,
>
> I am looking for applications where we can trigger spark jobs from UI.
> Are there any such applications available?
>
> I have checked Spark-jobserver using which we can expose an api to submit
> a spark application.
>
> Are there any other alternatives using which i can submit pyspark jobs
> from UI ?
>
> Thanks,
> Vamsi
>


Re: Aggregation of Streaming UI Statistics for multiple jobs

2018-05-27 Thread hemant singh
You can explore the rest API -
https://spark.apache.org/docs/2.0.2/monitoring.html#rest-api

On Sun, May 27, 2018 at 10:18 AM, skmishra 
wrote:

> Hi,
>
> I am working on a streaming use case where I need to run multiple spark
> streaming applications at the same time and measure the throughput and
> latencies. The spark UI provides all the statistics, but if I want to run
> more than 100 applications at the same time then I do not have any clue on
> how to aggregate these statistics. Opening 100 windows and collecting all
> the data does not seem to be an easy job. Hence, if you could provide any
> help on how to collect these statistics from code, then I can write a
> script
> to run my experiment. Any help is greatly appreciated. Thanks in advance.
>
> Regards,
> Sitakanta Mishra
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark 2.3 Tree Error

2018-05-26 Thread hemant singh
Per the sql plan this is where it is failing -

Attribute(s) with the same name appear in the operation:
fnlwgt_bucketed. Please check if the right attribute(s) are used.;



On Sat, May 26, 2018 at 6:16 PM, Aakash Basu 
wrote:

> Hi,
>
> This query is based on one step further from the query in this link
> .
> In this scenario, I add 1 or 2 more columns to be processed, Spark throws
> an ERROR by printing the physical plan of queries.
>
> It says, *Resolved attribute(s) fnlwgt_bucketed#152530 missing* which is
> untrue, as if I run the same code on less than 3 columns where this is one
> column, it works like a charm, so I can clearly assume it is not a bug in
> my query or code.
>
> Is it then a out of memory error? As I think, internally, since there are
> many registered tables on memory, they're getting deleted due to overflow
> of data and getting deleted, this is totally my assumption. Any insight on
> this? Did anyone of you face any issue like this?
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.: 
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) 
> fnlwgt_bucketed#152530 missing from 
> occupation#17,high_income#25,fnlwgt#13,education#14,marital-status#16,relationship#18,workclass#12,sex#20,id_num#10,native_country#24,race#19,education-num#15,hours-per-week#23,age_bucketed#152432,capital-loss#22,age#11,capital-gain#21,fnlwgt_bucketed#99009
>  in operator !Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#152432, 
> fnlwgt_bucketed#152530, if (isnull(cast(hours-per-week#23 as double))) null 
> else if (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else 
> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
> hours-per-week_bucketed#152299]. Attribute(s) with the same name appear in 
> the operation: fnlwgt_bucketed. Please check if the right attribute(s) are 
> used.;;Project [id_num#10, age#11, workclass#12, fnlwgt#13, education#14, 
> education-num#15, marital-status#16, occupation#17, relationship#18, race#19, 
> sex#20, capital-gain#21, capital-loss#22, hours-per-week#23, 
> native_country#24, high_income#25, age_bucketed#48257, fnlwgt_bucketed#99009, 
> hours-per-week_bucketed#152299, age_bucketed_WoE#152431, WoE#152524 AS 
> fnlwgt_bucketed_WoE#152529]+- Join Inner, (fnlwgt_bucketed#99009 = 
> fnlwgt_bucketed#152530)
>:- SubqueryAlias bucketed
>:  +- SubqueryAlias a
>: +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
> fnlwgt_bucketed#99009, hours-per-week_bucketed#152299, WoE#152426 AS 
> age_bucketed_WoE#152431]
>:+- Join Inner, (age_bucketed#48257 = age_bucketed#152432)
>:   :- SubqueryAlias bucketed
>:   :  +- SubqueryAlias a
>:   : +- Project [id_num#10, age#11, workclass#12, fnlwgt#13, 
> education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, 
> fnlwgt_bucketed#99009, if (isnull(cast(hours-per-week#23 as double))) null 
> else if (isnull(cast(hours-per-week#23 as double))) null else if 
> (isnull(cast(hours-per-week#23 as double))) null else 
> UDF:bucketizer_0(cast(hours-per-week#23 as double)) AS 
> hours-per-week_bucketed#152299]
>:   :+- Project [id_num#10, age#11, workclass#12, 
> fnlwgt#13, education#14, education-num#15, marital-status#16, occupation#17, 
> relationship#18, race#19, sex#20, capital-gain#21, capital-loss#22, 
> hours-per-week#23, native_country#24, high_income#25, age_bucketed#48257, if 
> (isnull(cast(fnlwgt#13 as double))) null else if (isnull(cast(fnlwgt#13 as 
> double))) null else if (isnull(cast(fnlwgt#13 as double))) null else 
> UDF:bucketizer_0(cast(fnlwgt#13 as double)) 

Re: pyspark execution

2018-04-17 Thread hemant singh
If it contains only SQL then you can use a function as below -

import subprocess

def run_sql(sql_file_path, your_db_name ,location):

subprocess.call(["spark-sql","-S","--hivevar","",,"--hivevar","LOCATION",location,"-f",sql_file_path])

In you have other pieces like spark code and not only sql in that file-

Write a parse function which parse you sql and replace the placeholders
like DB Name etc in your sql and then execute the new formed sql.

Maintaining your sql in a separate file though de-couples the code and sql
and make it easier from maintenance perspective.

On Tue, Apr 17, 2018 at 8:11 AM, anudeep  wrote:

> Hi All,
>
> I have a python file which I am executing directly with spark-submit
> command.
>
> Inside the python file, I have sql written using hive context.I created a
> generic variable for the  database name inside sql
>
> The problem is : How can I pass the value for this variable dynamically
> just as we give in hive like --hivevar parameter.
>
> Thanks!
> Anudeep
>
>
>
>
>
>
>


Re: Access Table with Spark Dataframe

2018-03-20 Thread hemant singh
See if this helps -
https://stackoverflow.com/questions/42852659/makiing-sql-request-on-columns-containing-dot
enclosing column names in "`"

On Tue, Mar 20, 2018 at 6:47 PM, SNEHASISH DUTTA 
wrote:

> Hi,
>
> I am using Spark 2.2 , a table fetched from database contains a (.) dot in
> one of the column names.
> Whenever I am trying to select that particular column I am getting query
> analysis exception.
>
>
> I have tried creating a temporary table , using createOrReplaceTempView()
> and fetch the column's data but same was the outcome.
>
> How can this ('.') be escaped,while querying ?
>
>
> Thanks and Regards,
> Snehasish
>