Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
7)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
>   at org.apache.spark.scheduler.Task.run(Task.scala:141)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   ... 1 more
>
>
> On Mon, Jul 29, 2024 at 4:34 PM Sadha Chilukoori 
> wrote:
>
>> Hi Mike,
>>
>> I'm not sure about the minimum requirements of a machine for running
>> Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local
>> machine, I found the following steps are the easiest.
>>
>>
>> I installed Amazon corretto and updated the java_home variable as
>> instructed here
>> https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
>> (Any other java works too, I'm used to corretto from work).
>>
>> Then installed the Pyspark module using pip, which enabled me run Pyspark
>> on my machine.
>>
>> -Sadha
>>
>> On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:
>>
>>> Hello,
>>>
>>> I am trying to run Pyspark on my computer without success.  I follow
>>> several different directions from online sources and it appears that I need
>>> to get a faster computer.
>>>
>>> I wanted to ask what are some recommendations for computer
>>> specifications to run PySpark (Apache Spark).
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thank you,
>>>
>>> Mike
>>>
>>


Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike,

I'm not sure about the minimum requirements of a machine for running Spark.
But to run some Pyspark scripts (and Jupiter notbebooks) on a local
machine, I found the following steps are the easiest.


I installed Amazon corretto and updated the java_home variable as
instructed here
https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
(Any other java works too, I'm used to corretto from work).

Then installed the Pyspark module using pip, which enabled me run Pyspark
on my machine.

-Sadha

On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:

> Hello,
>
> I am trying to run Pyspark on my computer without success.  I follow
> several different directions from online sources and it appears that I need
> to get a faster computer.
>
> I wanted to ask what are some recommendations for computer specifications
> to run PySpark (Apache Spark).
>
> Any help would be greatly appreciated.
>
> Thank you,
>
> Mike
>


Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex,

Spark is an open source software available under  Apache License 2.0 (
https://www.apache.org/licenses/), further details can be found here in the
FAQ page (https://spark.apache.org/faq.html).

Hope this helps.


Thanks,

Sadha

On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX 
wrote:

> Hey guys!
>
>
>
> I am part of the team responsible for software approval at EMBRAER S.A.
> We are currently in the process of approving the Apache Spark 3.5.1
> software and are verifying the licensing of the application.
> Therefore, I would like to kindly request you to answer the questions
> below.
>
> -What type of software? (Commercial, Freeware, Component, etc...)
>  A:
>
> -What is the licensing model for commercial use? (Subscription, Perpetual,
> GPL, etc...)
> A:
>
> -What type of license? (By user, Competitor, Device, Server or others)?
> A:
>
> -Number of installations allowed per license/subscription?
> A:
>
> Can it be used in the defense and aerospace industry? (Company that
> manufactures products for national defense)
> A:
>
> -Does the license allow use in any location regardless of the origin of
> the purchase (tax restriction)?
> A:
>
> -Where can I find the End User License Agreement (EULA) for the version in
> question?
> A:
>
>
>
> Desde já, muito obrigado e qualquer dúvida estou à disposição. / Thank you
> very much in advance and I am at your disposal if you have any questions.
>
>
> Att,
>
>
> Alex Santos Souza
>
> Software Asset Management - Embraer
>
> WhatsApp: +55 12 99731-7579
>
> E-mail: alex.santosso...@dxc.com
>
> DXC Technology
>
> São José dos Campos, SP - Brazil
>
>


Re: Spark join produce duplicate rows in resultset

2023-10-22 Thread Sadha Chilukoori
Hi Meena,

I'm asking to clarify, are the *on *& *and* keywords optional in the join
conditions?

Please try this snippet, and see if it helps

select rev.* from rev
inner join customer c
on rev.custumer_id =c.id
inner join product p
on rev.sys = p.sys
and rev.prin = p.prin
and rev.scode= p.bcode

left join item I
on rev.sys = I.sys
and rev.custumer_id = I.custumer_id
and rev. scode = I.scode;

Thanks,
Sadha

On Sat, Oct 21, 2023 at 3:21 PM Meena Rajani  wrote:

> Hello all:
>
> I am using spark sql to join two tables. To my surprise I am
> getting redundant rows. What could be the cause.
>
>
> select rev.* from rev
> inner join customer c
> on rev.custumer_id =c.id
> inner join product p
> rev.sys = p.sys
> rev.prin = p.prin
> rev.scode= p.bcode
>
> left join item I
> on rev.sys = i.sys
> rev.custumer_id = I.custumer_id
> rev. scode = I.scode
>
> where rev.custumer_id = '123456789'
>
> The first part of the code brings one row
>
> select rev.* from rev
> inner join customer c
> on rev.custumer_id =c.id
> inner join product p
> rev.sys = p.sys
> rev.prin = p.prin
> rev.scode= p.bcode
>
>
> The  item has two rows which have common attributes  and the* final join
> should result in 2 rows. But I am seeing 4 rows instead.*
>
> left join item I
> on rev.sys = i.sys
> rev.custumer_id = I.custumer_id
> rev. scode = I.scode
>
>
>
> Regards,
> Meena
>
>
>


Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Sadha Chilukoori
I have faced the same problem, where hive and spark orc were using the
snappy compression.

Hive 2.1
Spark 2.4.8

I'm curious to learn what could be the root cause of this.

-S

On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote:

>
> Hi,All
>
> I encountered  a problem as the e-mail subject described. And the
> followings are the details:
>
> *SQL:*
> insert overwrite table mytable partition(pt='20220518')
> select guid, user_new_id, sum_credit_score, sum_credit_score_change,
> platform_credit_score_change, bike_credit_score_change,
> evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
> from mytable where pt = '20220518’;
>
> *mytable DDL:*
> CREATE TABLE `mytable`(
>  `guid` string COMMENT 'xxx',
>  `user_new_id` bigint COMMENT 'xxx',
>  `sum_credit_score` bigint COMMENT 'xxx',
>  `sum_credit_score_change` bigint COMMENT 'xxx',
>  `platform_credit_score_change` bigint COMMENT 'xxx',
>  `bike_credit_score_change` bigint COMMENT 'xxx',
>  `evbike_credit_score_change` bigint COMMENT 'xxx',
>  `car_credit_score_change` bigint COMMENT 'xxx',
>  `slogan_type` bigint COMMENT 'xxx',
>  `bz_date` string COMMENT 'xxx')
> PARTITIONED BY (
>  `pt` string COMMENT 'increment_partition')
> ROW FORMAT SERDE
>  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION
>  'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
> TBLPROPERTIES (
>  'spark.sql.create.version'='2.2 or prior',
>  'spark.sql.sources.schema.numPartCols'='1',
>  'spark.sql.sources.schema.numParts'='1',
>  'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
>  'spark.sql.sources.schema.partCol.0'='pt',
>  'transient_lastDdlTime'='1653484849’)
>
> *ENV:*
> hive version 2.1.1
> spark version 2.4.4
>
> *hadoop fs -du -h Result:*
> *[hive sql]: *
> *735.2 M  /user/hive/warehouse/mytable/pt=20220518*
> *[spark sql]: *
> *1.1 G  /user/hive/warehouse/mytable/pt=20220518*
>
> How could this happened? And if this is caused by the different version of
> orc? Any replies appreciated.
>
> 13289341606
> 13289341...@163.com
>
> 
>
>