Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-24 Thread Mich Talebzadeh
LOL,

Hindsight is a very good thing and often one learns these through
experience.Once told off because strict ordering was not maintained, then
the lesson will never be forgotten!

HTH


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 23 Sept 2023 at 13:29, Steve Loughran 
wrote:

>
> Now, if you are ruthless it'd make sense to randomise the order of results
> if someone left out the order by, to stop complacency.
>
> like that time sun changed the ordering that methods were returned in a
> Class.listMethods() call and everyone's junit test cases failed if they'd
> assumed that ordering was that of the source file -which it was until then,
> even though the language spec said "no guarantees".
>
> People code for what works, not what is documented in places they don't
> read. (this is also why anyone writing network code should really have a
> flaky network connection to keep themselves honest)
>
> On Sat, 23 Sept 2023 at 11:00, beliefer  wrote:
>
>> AFAIK, The order is free whether it's SQL without spcified ORDER BY
>> clause or  DataFrame without sort. The behavior is consistent between them.
>>
>>
>>
>> At 2023-09-18 23:47:40, "Nicholas Chammas" 
>> wrote:
>>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> 
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>>


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-23 Thread Steve Loughran
Now, if you are ruthless it'd make sense to randomise the order of results
if someone left out the order by, to stop complacency.

like that time sun changed the ordering that methods were returned in a
Class.listMethods() call and everyone's junit test cases failed if they'd
assumed that ordering was that of the source file -which it was until then,
even though the language spec said "no guarantees".

People code for what works, not what is documented in places they don't
read. (this is also why anyone writing network code should really have a
flaky network connection to keep themselves honest)

On Sat, 23 Sept 2023 at 11:00, beliefer  wrote:

> AFAIK, The order is free whether it's SQL without spcified ORDER BY clause
> or  DataFrame without sort. The behavior is consistent between them.
>
>
>
> At 2023-09-18 23:47:40, "Nicholas Chammas" 
> wrote:
>
> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
>
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
>
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset
> 
>  implies
> otherwise.
>
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory a
> future implementation change may result in “Bob" being the “first” row in
> the DataFrame, rather than “Tom”. That would make the example incorrect.
>
> Is that not the case?
>
> Nick
>
>


Re:Are DataFrame rows ordered without an explicit ordering clause?

2023-09-23 Thread beliefer
AFAIK, The order is free whether it's SQL without spcified ORDER BY clause or  
DataFrame without sort. The behavior is consistent between them.







At 2023-09-18 23:47:40, "Nicholas Chammas"  wrote:

I’ve always considered DataFrames to be logically equivalent to SQL tables or 
queries.


In SQL, the result order of any query is implementation-dependent without an 
explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 
times in a row and get 10 different orderings.


I thought the same applied to DataFrames, but the docstring for the recently 
added method DataFrame.offset implies otherwise.


This example will work fine in practice, of course. But if DataFrames are 
technically unordered without an explicit ordering clause, then in theory a 
future implementation change may result in “Bob" being the “first” row in the 
DataFrame, rather than “Tom”. That would make the example incorrect.


Is that not the case?


Nick



Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Mich Talebzadeh
These are good points. In traditional RDBMSs, SQL query results without an
explicit *ORDER BY* clause may vary in order due to optimization,
especially when no clustered index is defined. In contrast, systems like
Hive and Spark SQL, which are based on distributed file storage, do not
rely on physical data order (co-location of data blocks). They deploy
techniques like columnar storage and predicate pushdown instead of
traditional indexing due to the distributed nature of their storage
systems.

HTH


On Mon, 18 Sept 2023 at 20:19, Sean Owen  wrote:

> I think it's the same, and always has been - yes you don't have a
> guaranteed ordering unless an operation produces a specific ordering. Could
> be the result of order by, yes; I believe you would be guaranteed that
> reading input files results in data in the order they appear in the file,
> etc. 1:1 operations like map() don't change ordering. But not the result of
> a shuffle, for example. So yeah anything like limit or head might give
> different results in the future (or simply on different cluster setups with
> different parallelism, etc). The existence of operations like offset
> doesn't contradict that. Maybe that's totally fine in some situations (ex:
> I just want to display some sample rows) but otherwise yeah you've always
> had to state your ordering for "first" or "nth" to have a guaranteed result.
>
> On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> 
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>>


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Mich Talebzadeh
Hi Nicholas,

Your point

"In SQL, the result order of any query is implementation-dependent without
an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
table;` 10 times in a row and get 10 different orderings."

yes I concur my understanding is the same.

In SQL, the result order of any query is implementation-dependent without
an explicit ORDER BY clause. Basically this means that the database engine
is free to return the results in any order that it sees fit. This is
because SQL does not guarantee a specific order for results unless an ORDER
BY clause is used.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 18 Sept 2023 at 16:58, Reynold Xin 
wrote:

> It should be the same as SQL. Otherwise it takes away a lot of potential
> future optimization opportunities.
>
>
> On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’ve always considered DataFrames to be logically equivalent to SQL
>> tables or queries.
>>
>> In SQL, the result order of any query is implementation-dependent without
>> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
>> table;` 10 times in a row and get 10 different orderings.
>>
>> I thought the same applied to DataFrames, but the docstring for the
>> recently added method DataFrame.offset
>> 
>>  implies
>> otherwise.
>>
>> This example will work fine in practice, of course. But if DataFrames are
>> technically unordered without an explicit ordering clause, then in theory a
>> future implementation change may result in “Bob" being the “first” row in
>> the DataFrame, rather than “Tom”. That would make the example incorrect.
>>
>> Is that not the case?
>>
>> Nick
>>
>


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Reynold Xin
It should be the same as SQL. Otherwise it takes away a lot of potential future 
optimization opportunities.

On Mon, Sep 18 2023 at 8:47 AM, Nicholas Chammas < nicholas.cham...@gmail.com > 
wrote:

> 
> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
> 
> 
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
> 
> 
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset (
> https://github.com/apache/spark/pull/40873/files#diff-4ff57282598a3b9721b8d6f8c2fea23a62e4bc3c0f1aa5444527549d1daa38baR1293-R1301
> ) implies otherwise.
> 
> 
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory
> a future implementation change may result in “Bob" being the “first” row
> in the DataFrame, rather than “Tom”. That would make the example
> incorrect.
> 
> 
> Is that not the case?
> 
> 
> Nick
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Sean Owen
I think it's the same, and always has been - yes you don't have a
guaranteed ordering unless an operation produces a specific ordering. Could
be the result of order by, yes; I believe you would be guaranteed that
reading input files results in data in the order they appear in the file,
etc. 1:1 operations like map() don't change ordering. But not the result of
a shuffle, for example. So yeah anything like limit or head might give
different results in the future (or simply on different cluster setups with
different parallelism, etc). The existence of operations like offset
doesn't contradict that. Maybe that's totally fine in some situations (ex:
I just want to display some sample rows) but otherwise yeah you've always
had to state your ordering for "first" or "nth" to have a guaranteed result.

On Mon, Sep 18, 2023 at 10:48 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’ve always considered DataFrames to be logically equivalent to SQL tables
> or queries.
>
> In SQL, the result order of any query is implementation-dependent without
> an explicit ORDER BY clause. Technically, you could run `SELECT * FROM
> table;` 10 times in a row and get 10 different orderings.
>
> I thought the same applied to DataFrames, but the docstring for the
> recently added method DataFrame.offset
> 
>  implies
> otherwise.
>
> This example will work fine in practice, of course. But if DataFrames are
> technically unordered without an explicit ordering clause, then in theory a
> future implementation change may result in “Bob" being the “first” row in
> the DataFrame, rather than “Tom”. That would make the example incorrect.
>
> Is that not the case?
>
> Nick
>
>


Are DataFrame rows ordered without an explicit ordering clause?

2023-09-18 Thread Nicholas Chammas
I’ve always considered DataFrames to be logically equivalent to SQL tables or 
queries.

In SQL, the result order of any query is implementation-dependent without an 
explicit ORDER BY clause. Technically, you could run `SELECT * FROM table;` 10 
times in a row and get 10 different orderings.

I thought the same applied to DataFrames, but the docstring for the recently 
added method DataFrame.offset 

 implies otherwise.

This example will work fine in practice, of course. But if DataFrames are 
technically unordered without an explicit ordering clause, then in theory a 
future implementation change may result in “Bob" being the “first” row in the 
DataFrame, rather than “Tom”. That would make the example incorrect.

Is that not the case?

Nick