Efficiently updating running sums only on new data

2022-10-11 Thread Greg Kopff
I'm new to Spark and would like to seek some advice on how to approach a 
problem.

I have a large dataset that has dated observations. There are also columns that 
are running sums of some of other columns.

   date | thing |   foo   |   bar   |  foo_sum  |  bar_sum  |
+===+=+=+===+===+
 2020-01-01 |  101  |  1  |   3 |1  |3  |
 2020-01-01 |  202  |  0  |   2 |0  |2  |
 2020-01-01 |  303  |  1  |   1 |1  |1  |
+---+-+-+---+---+
 2020-01-02 |  101  |  1  |   2 |2  |5  |
 2020-01-02 |  202  |  0  |   0 |0  |2  |
 2020-01-02 |  303  |  4  |   1 |5  |2  |
 2020-01-02 |  404  |  2  |   2 |2  |2  |
+---+-+-+---+---+
Currently I generate the running sums using a WindowSpec:

final WindowSpec w =
Window.partitionBy(col("thing"))
.orderBy(col("date"), col("thing"))
.rowsBetween(Window.unboundedPreceding(), Window.currentRow());

return df
.withColumn(col("foo_sum"), sum("foo").over(w))
.withColumn(col("bar_sum"), sum("bar").over(w));
Once these extra sum columns are computed, they are written back to storage.

Periodically this dataset is appended to with new observations. These new 
observations are all chronologically later than any of the previous 
observations.

I need to "continue" the previous running sums for the new observations -- but 
I want to avoid having to recompute the running sums completely from scratch.

   date | thing |   foo   |   bar   |  foo_sum  |  bar_sum  |
+===+=+=+===+===+
 2020-01-01 |  101  |  1  |   3 |1  |3  |
 2020-01-01 |  202  |  0  |   2 |0  |2  |
 2020-01-01 |  303  |  1  |   1 |1  |1  |
+---+-+-+---+---+
 2020-01-02 |  101  |  1  |   2 |2  |5  |
 2020-01-02 |  202  |  0  |   0 |0  |2  |
 2020-01-02 |  303  |  4  |   1 |5  |2  |
 2020-01-02 |  404  |  2  |   2 |2  |2  |
+---+-+-+---+---+   new data
 2020-01-03 |  101  |  2  |   2 |.  |.  |
 2020-01-03 |  303  |  1  |   1 |.  |.  |
 2020-01-02 |  404  |  2  |   1 |.  |.  |
I would appreciate it if anyone had any pointers about how to approach this 
sort of problem that they could share.

Kind regards,

—
Greg

Re: Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Sadha Chilukoori
I have faced the same problem, where hive and spark orc were using the
snappy compression.

Hive 2.1
Spark 2.4.8

I'm curious to learn what could be the root cause of this.

-S

On Tue, Oct 11, 2022, 2:18 AM Chartist <13289341...@163.com> wrote:

>
> Hi,All
>
> I encountered  a problem as the e-mail subject described. And the
> followings are the details:
>
> *SQL:*
> insert overwrite table mytable partition(pt='20220518')
> select guid, user_new_id, sum_credit_score, sum_credit_score_change,
> platform_credit_score_change, bike_credit_score_change,
> evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
> from mytable where pt = '20220518’;
>
> *mytable DDL:*
> CREATE TABLE `mytable`(
>  `guid` string COMMENT 'xxx',
>  `user_new_id` bigint COMMENT 'xxx',
>  `sum_credit_score` bigint COMMENT 'xxx',
>  `sum_credit_score_change` bigint COMMENT 'xxx',
>  `platform_credit_score_change` bigint COMMENT 'xxx',
>  `bike_credit_score_change` bigint COMMENT 'xxx',
>  `evbike_credit_score_change` bigint COMMENT 'xxx',
>  `car_credit_score_change` bigint COMMENT 'xxx',
>  `slogan_type` bigint COMMENT 'xxx',
>  `bz_date` string COMMENT 'xxx')
> PARTITIONED BY (
>  `pt` string COMMENT 'increment_partition')
> ROW FORMAT SERDE
>  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> LOCATION
>  'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
> TBLPROPERTIES (
>  'spark.sql.create.version'='2.2 or prior',
>  'spark.sql.sources.schema.numPartCols'='1',
>  'spark.sql.sources.schema.numParts'='1',
>  'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
>  'spark.sql.sources.schema.partCol.0'='pt',
>  'transient_lastDdlTime'='1653484849’)
>
> *ENV:*
> hive version 2.1.1
> spark version 2.4.4
>
> *hadoop fs -du -h Result:*
> *[hive sql]: *
> *735.2 M  /user/hive/warehouse/mytable/pt=20220518*
> *[spark sql]: *
> *1.1 G  /user/hive/warehouse/mytable/pt=20220518*
>
> How could this happened? And if this is caused by the different version of
> orc? Any replies appreciated.
>
> 13289341606
> 13289341...@163.com
>
> 
>
>


Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
See the pom.xml file
https://github.com/apache/spark/blob/master/pom.xml#L3590
2.13.8 at the moment; IIRC there was some Scala issue that prevented
updating to 2.13.9. Search issues/PRs.

On Tue, Oct 11, 2022 at 6:11 PM Henrik Park  wrote:

> scala 2.13.9 was released. do you know which spark version would have it
> built-in?
>
> thanks
>
> Sean Owen wrote:
> > I would imagine that Scala 2.12 support goes away, and Scala 3 support
> > is added, for maybe Spark 4.0, and maybe that happens in a year or so.
>
> --
> Simple Mail
> https://simplemail.co.in/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Henrik Park
scala 2.13.9 was released. do you know which spark version would have it 
built-in?


thanks

Sean Owen wrote:
I would imagine that Scala 2.12 support goes away, and Scala 3 support 
is added, for maybe Spark 4.0, and maybe that happens in a year or so.


--
Simple Mail
https://simplemail.co.in/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Sean Owen
For Spark, the issue is maintaining simultaneous support for multiple Scala
versions, which has historically been mutually incompatible across minor
versions.
Until Scala 2.12 support is reasonable to remove, it's hard to also support
Scala 3, as it would mean maintaining three versions of code.
I would imagine that Scala 2.12 support goes away, and Scala 3 support is
added, for maybe Spark 4.0, and maybe that happens in a year or so.

For end users, I don't think there are big differences, so I don't think
learning one or the other matters a lot. Scala 3 is a lot like 2.13.
But I don't think you'll be able to write Scala 3 Spark apps anytime soon.

On Tue, Oct 11, 2022 at 7:57 AM Никита Романов 
wrote:

> No one knows for sure except Apache, but I’d learn Scala 2 if I were you.
> Even if Spark one day migrates to Scala 3 (which is not given), it’ll take
> a while for the industry to adjust. It even takes a while to move from
> Spark 2 to Spark 3 (Scala 2.11 to Scala 2.12). I don’t think your knowledge
> of Scala 2 will be outdated any time soon.
>
> You can also compare it with Python 2 vs 3: although Python 3 dominates
> these days (almost 15 years after the release!), Python 2 is still used.
>
>
> Понедельник, 10 октября 2022, 10:24 +03:00 от Oliver Plohmann <
> oli...@objectscape.org >:
>
> Hello,
>
> I was lucky and will be joining a project where Spark is being used in
> conjunction with Python. Scala will not be used at all. Everything will
> be Python. This means that I have free choice whether to start diving
> into Scala 2 or Scala 3. For future Spark jobs knowledge of Scala will
> be very precious (the job ads here for Spark always mention Java, Python
> and Scala.
>
> I was always interested in Scala and because it is a plus when applying
> for Spark jobs I will start learning and develop some spare time project
> with it. Question is now whether first to learn Scala 2 or start right
> away with learning Scala 3. That also boils down to the question whether
> Spark will ever be migrated to Scala 3. I have way too little
> understanding of Spark and Scala to be able to make some reasonable
> guess here.
>
> So that's why I'm asking here: Does anyone have some idea whether Spark
> will ever be migrated toScala 3 or have some idea how long it will take
> till any migration work might be started?
>
> Thank you.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
>
> --
>
> --
> Никита Романов
> Отправлено из Почты Mail.ru 
>


Re: As a Scala newbie starting to work with Spark does it make more sense to learn Scala 2 or Scala 3?

2022-10-11 Thread Никита Романов

No one knows for sure except Apache, but I’d learn Scala 2 if I were you. Even 
if Spark one day migrates to Scala 3 (which is not given), it’ll take a while 
for the industry to adjust. It even takes a while to move from Spark 2 to Spark 
3 (Scala 2.11 to Scala 2.12). I don’t think your knowledge of Scala 2 will be 
outdated any time soon.
 
You can also compare it with Python 2 vs 3: although Python 3 dominates these 
days (almost 15 years after the release!), Python 2 is still used. 
  
>Понедельник, 10 октября 2022, 10:24 +03:00 от Oliver Plohmann < 
>oli...@objectscape.org >:
> 
>Hello,
>
>I was lucky and will be joining a project where Spark is being used in
>conjunction with Python. Scala will not be used at all. Everything will
>be Python. This means that I have free choice whether to start diving
>into Scala 2 or Scala 3. For future Spark jobs knowledge of Scala will
>be very precious (the job ads here for Spark always mention Java, Python
>and Scala.
>
>I was always interested in Scala and because it is a plus when applying
>for Spark jobs I will start learning and develop some spare time project
>with it. Question is now whether first to learn Scala 2 or start right
>away with learning Scala 3. That also boils down to the question whether
>Spark will ever be migrated to Scala 3. I have way too little
>understanding of Spark and Scala to be able to make some reasonable
>guess here.
>
>So that's why I'm asking here: Does anyone have some idea whether Spark
>will ever be migrated toScala 3 or have some idea how long it will take
>till any migration work might be started?
>
>Thank you.
>
>
>-
>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org 
 
   
 
--

--
Никита Романов
Отправлено из Почты  Mail.ru


Why the same INSERT OVERWRITE sql , final table file produced by spark sql is larger than hive sql?

2022-10-11 Thread Chartist


Hi,All


I encountered  a problem as the e-mail subject described. And the followings 
are the details:


SQL:
insert overwrite table mytable partition(pt='20220518') 

select guid, user_new_id, sum_credit_score, sum_credit_score_change, 
platform_credit_score_change, bike_credit_score_change, 
evbike_credit_score_change, car_credit_score_change, slogan_type, bz_date
from mytable where pt = '20220518’;


mytable DDL:
CREATE TABLE `mytable`(
 `guid` string COMMENT 'xxx',
 `user_new_id` bigint COMMENT 'xxx',
 `sum_credit_score` bigint COMMENT 'xxx',
 `sum_credit_score_change` bigint COMMENT 'xxx',
 `platform_credit_score_change` bigint COMMENT 'xxx',
 `bike_credit_score_change` bigint COMMENT 'xxx',
 `evbike_credit_score_change` bigint COMMENT 'xxx',
 `car_credit_score_change` bigint COMMENT 'xxx',
 `slogan_type` bigint COMMENT 'xxx',
 `bz_date` string COMMENT 'xxx')
PARTITIONED BY (
 `pt` string COMMENT 'increment_partition')
ROW FORMAT SERDE
 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
 'hdfs://flashHadoopUAT/user/hive/warehouse/mytable'
TBLPROPERTIES (
 'spark.sql.create.version'='2.2 or prior',
 'spark.sql.sources.schema.numPartCols'='1',
 'spark.sql.sources.schema.numParts'='1',
 'spark.sql.sources.schema.part.0'=‘xxx SOME OMITTED CONTENT xxx',
 'spark.sql.sources.schema.partCol.0'='pt',
 'transient_lastDdlTime'='1653484849’)


ENV:
hive version 2.1.1
spark version 2.4.4


hadoop fs -du -h Result:
[hive sql]: 
735.2 M  /user/hive/warehouse/mytable/pt=20220518
[spark sql]: 
1.1 G  /user/hive/warehouse/mytable/pt=20220518


How could this happened? And if this is caused by the different version of orc? 
Any replies appreciated.


| |
13289341606
|
|
13289341...@163.com
|