subject:"Is it a bug\?"

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

sorry i thought i gave an explanation

The issue you are encountering with incorrect record numbers in the
"ShuffleWrite Size/Records" column in the Spark DAG UI when data is read
from cache/persist is a known limitation. This discrepancy arises due to
the way Spark handles and reports shuffle data when caching is involved.

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Sun, 26 May 2024 at 21:16, Prem Sahoo  wrote:

> Can anyone please assist me ?
>
> On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:
>
>> Does anyone have a clue ?
>>
>> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>>> can view the tasks.
>>>
>>> In each task we have a column "ShuffleWrite Size/Records " that column
>>> prints wrong data when it gets the data from cache/persist . it
>>> typically will show the wrong record number though the data size is correct
>>> for e.g  3.2G/ 7400 which is wrong .
>>>
>>> please advise.
>>>
>>

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

Just to further clarify that the Shuffle Write Size/Records column in
the Spark UI can be misleading when working with cached/persisted data
because it reflects the shuffled data size and record count, not the
entire cached/persisted data., So it is fair to say that this is a
limitation of the UI's display, not necessarily a bug in the Spark
framework itself.

HTH

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".



On Sun, 26 May 2024 at 16:45, Mich Talebzadeh  wrote:
>
> Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show 
> incorrect record counts when data is retrieved from cache or persisted data. 
> This happens because the record count reflects the number of records written 
> to disk for shuffling, and not the actual number of records in the cached or 
> persisted data itself. Add to it, because of lazy evaluation:, Spark may only 
> materialize a portion of the cached or persisted data when a task needs it. 
> The "Shuffle Write Size/Records" might only reflect the materialized portion, 
> not the total number of records in the cache/persistence. While the "Shuffle 
> Write Size/Records" might be inaccurate for cached/persisted data, the 
> "Shuffle Read Size/Records" column can be more reliable. This metric shows 
> the number of records read from shuffle by the following stage, which should 
> be closer to the actual number of records processed.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my knowledge 
> but of course cannot be guaranteed . It is essential to note that, as with 
> any advice, quote "one test result is worth one-thousand expert opinions 
> (Werner Von Braun)".
>
>
>
> On Thu, 23 May 2024 at 17:45, Prem Sahoo  wrote:
>>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you can 
>> view the tasks.
>>
>> In each task we have a column "ShuffleWrite Size/Records " that column 
>> prints wrong data when it gets the data from cache/persist . it typically 
>> will show the wrong record number though the data size is correct for e.g  
>> 3.2G/ 7400 which is wrong .
>>
>> please advise.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show
incorrect record counts *when data is retrieved from cache or persisted
data*. This happens because the record count reflects the number of records
written to disk for shuffling, and not the actual number of records in the
cached or persisted data itself. Add to it, because of lazy evaluation:,
Spark may only materialize a portion of the cached or persisted data when a
task needs it. The "Shuffle Write Size/Records" might only reflect the
materialized portion, not the total number of records in the
cache/persistence. While the "Shuffle Write Size/Records" might be
inaccurate for cached/persisted data, the "Shuffle Read Size/Records"
column can be more reliable. This metric shows the number of records read
from shuffle by the following stage, which should be closer to the actual
number of records processed.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Thu, 23 May 2024 at 17:45, Prem Sahoo  wrote:

> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> prints wrong data when it gets the data from cache/persist . it
> typically will show the wrong record number though the data size is correct
> for e.g  3.2G/ 7400 which is wrong .
>
> please advise.
>

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury

Can you please explain how did you realize it’s wrong? Did you check cloudwatch 
for the same metrics and compare? Also are you using do.cache() and expecting 
that shuffle read/write to go away ?


Sent from Yahoo Mail for iPhone


On Sunday, May 26, 2024, 7:53 AM, Prem Sahoo  wrote:

Can anyone please assist me ?
On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:

Does anyone have a clue ?
On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:

Hello Team,in spark DAG UI , we have Stages tab. Once you click on each stage 
you can view the tasks.
In each task we have a column "ShuffleWrite Size/Records " that column prints 
wrong data when it gets the data from cache/persist . it typically will show 
the wrong record number though the data size is correct for e.g  3.2G/ 7400 
which is wrong . 
please advise.

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo

Can anyone please assist me ?

On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:

> Does anyone have a clue ?
>
> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:
>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>> can view the tasks.
>>
>> In each task we have a column "ShuffleWrite Size/Records " that column
>> prints wrong data when it gets the data from cache/persist . it
>> typically will show the wrong record number though the data size is correct
>> for e.g  3.2G/ 7400 which is wrong .
>>
>> please advise.
>>
>

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo

Does anyone have a clue ?

On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:

> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> prints wrong data when it gets the data from cache/persist . it
> typically will show the wrong record number though the data size is correct
> for e.g  3.2G/ 7400 which is wrong .
>
> please advise.
>

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo

Hello Team,
in spark DAG UI , we have Stages tab. Once you click on each stage you can
view the tasks.

In each task we have a column "ShuffleWrite Size/Records " that column
prints wrong data when it gets the data from cache/persist . it
typically will show the wrong record number though the data size is correct
for e.g  3.2G/ 7400 which is wrong .

please advise.

Bug in org.apache.spark.util.sketch.BloomFilter

2024-03-21 Thread Nathan Conroy

Hi All,

I believe that there is a bug that affects the Spark BloomFilter implementation 
when creating a bloom filter with large n. Since this implementation uses 
integer hash functions, it doesn’t work properly when the number of bits 
exceeds MAX_INT.

I asked a question about this on stackoverflow, but didn’t get a satisfactory 
answer. I believe I know what is causing the bug and have documented my 
reasoning there as well:

https://stackoverflow.com/questions/78162973/why-is-observed-false-positive-rate-in-spark-bloom-filter-higher-than-expected

I would just go ahead and create a Jira ticket on the spark jira board, but I’m 
still waiting to hear back regarding getting my account set up.

Huge thanks if anyone can help!

-N

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh

---+-+---++
>>
>> onQueryProgress
>> ---
>> Batch: 2
>> ---
>> ++-+---++
>> | key|doubled_value|op_type| op_time|
>> ++-+---++
>> |a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
>> ++-+---++
>>
>> I am afraid it is not working. Not even printing anything
>>
>> Cheers
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my kno
>> wledge but of course cannot be guaranteed . It is essential to note that,
>> as with any advice, quote "one test result is worth one-thousand expert
>> opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
>> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 11 Mar 2024 at 05:07, 刘唯  wrote:
>>
>>> *now -> not
>>>
>>> 刘唯  于2024年3月10日周日 22:04写道：
>>>
>>>> Have you tried using microbatch_data.get("processedRowsPerSecond")?
>>>> Camel case now snake case
>>>>
>>>> Mich Talebzadeh  于2024年3月10日周日 11:46写道：
>>>>
>>>>>
>>>>> There is a paper from Databricks on this subject
>>>>>
>>>>>
>>>>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>>>>
>>>>> But having tested it, there seems to be a bug there that I reported to
>>>>> Databricks forum as well (in answer to a user question)
>>>>>
>>>>> I have come to a conclusion that this is a bug. In general there is a
>>>>> bug in obtaining individual values from the dictionary. For example, a
>>>>> bug in the way Spark Streaming is populating the processe
>>>>> d_rows_per_second key within the microbatch_data -> microbatch_data =
>>>>> event.progres dictionary or any other key. I have explored various deb
>>>>> ugging steps, and even though the key seems to exist, the value might
>>>>> not be getting set. Note that the dictionary itself prints the el
>>>>> ements correctly. This is with regard to method onQueryProgress(self,
>>>>> event) in class MyListener(StreamingQueryListener):
>>>>>
>>>>> For example with print(microbatch_data), you get all printed as below
>>>>>
>>>>> onQueryProgress
>>>>> microbatch_data received
>>>>> {
>>>>> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
>>>>> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
>>>>> "name" : null,
>>>>> "timestamp" : "2024-03-10T09:21:27.233Z",
>>>>> "batchId" : 21,
>>>>> "numInputRows" : 1,
>>>>> "inputRowsPerSecond" : 100.0,
>>>>> "processedRowsPerSecond" : 5.347593582887701,
>>>>> "durationMs" : {
>>>>> "addBatch" : 37,
>>>>> "commitOffsets" : 41,
>>>>> "getBatch" : 0,
>>>>> "latestOffset" : 0,
>>>>> "queryPlanning" : 5,
>>>>> "triggerExecution" : 187,
>>>>> "walCommit" : 104
>>>>> },
>>>>> "stateOperators" : [ ],
>>>>> "sources" : [ {
>>>>> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
>>>>> numPartitions=default",
>>>>> "startOffset" : 20,
>>>>> "endOffset" : 21,
>>>>> "latestOffset" : 21,
>>>>> "numInputRows" : 1,
>>>>> "inputRowsPerSecond" : 100.0,
>>>>> "processedRowsPerSecond" : 5.347593582887701
>>>>> } ],
>>>>> "sink" : {
>&

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯

Oh I see why the confusion.

microbatch_data = event.progress

means that microbatch_data is a StreamingQueryProgress instance, it's not a
dictionary, so you should use ` microbatch_data.processedRowsPerSecond`,
instead of the `get` method which is used for dictionaries.

But weirdly, for query.lastProgress and query.recentProgress, they should
return StreamingQueryProgress  but instead they returned a dict. So the
`get` method works there.

I think PySpark should improve on this part.

Mich Talebzadeh  于2024年3月11日周一 05:51写道：

> Hi,
>
> Thank you for your advice
>
> This is the amended code
>
>def onQueryProgress(self, event):
> print("onQueryProgress")
> # Access micro-batch data
> microbatch_data = event.progress
> #print("microbatch_data received")  # Check if data is received
> #print(microbatch_data)
> #processed_rows_per_second =
> microbatch_data.get("processed_rows_per_second")
> processed_rows_per_second =
> microbatch_data.get("processedRowsPerSecond")
> print("CPC", processed_rows_per_second)
> if processed_rows_per_second is not None:  # Check if value exists
>print("ocessed_rows_per_second retrieved")
>print(f"Processed rows per second: {processed_rows_per_second}")
> else:
>print("processed_rows_per_second not retrieved!")
>
> This is the output
>
> onQueryStarted
> 'None' [c1a910e6-41bb-493f-b15b-7863d07ff3fe] got started!
> SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
> SLF4J: Defaulting to no-operation MDCAdapter implementation.
> SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for
> further details.
> ---
> Batch: 0
> ---
> +---+-+---+---+
> |key|doubled_value|op_type|op_time|
> +---+-+---+---+
> +---+-+---+---+
>
> onQueryProgress
> ---
> Batch: 1
> ---
> ++-+---++
> | key|doubled_value|op_type| op_time|
> ++-+---++
> |a960f663-d13a-49c...|0|  1|2024-03-11 12:17:...|
> ++-+---++
>
> onQueryProgress
> ---
> Batch: 2
> ---
> ++-+---++
> | key|doubled_value|op_type| op_time|
> ++-+---++
> |a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
> ++-+---++
>
> I am afraid it is not working. Not even printing anything
>
> Cheers
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my kno
> wledge but of course cannot be guaranteed . It is essential to note that,
> as with any advice, quote "one test result is worth one-thousand expert op
> inions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 11 Mar 2024 at 05:07, 刘唯  wrote:
>
>> *now -> not
>>
>> 刘唯  于2024年3月10日周日 22:04写道：
>>
>>> Have you tried using microbatch_data.get("processedRowsPerSecond")?
>>> Camel case now snake case
>>>
>>> Mich Talebzadeh  于2024年3月10日周日 11:46写道：
>>>
>>>>
>>>> There is a paper from Databricks on this subject
>>>>
>>>>
>>>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>>>
>>>> But having tested it, there seems to be a bug there that I reported to
>>>> Databricks forum as well (in answer to a user question)
>>>>
>>>> I have come to a conclusion that this is a bug. In general there is a b
>>>> ug in obtaining individual values from the dictionary. For example, a b
>>>> ug in the way Spark Streaming is populating the processe
>>>> d_rows_per_second key within

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh

Hi,

Thank you for your advice

This is the amended code

   def onQueryProgress(self, event):
print("onQueryProgress")
# Access micro-batch data
microbatch_data = event.progress
#print("microbatch_data received")  # Check if data is received
#print(microbatch_data)
#processed_rows_per_second =
microbatch_data.get("processed_rows_per_second")
processed_rows_per_second =
microbatch_data.get("processedRowsPerSecond")
print("CPC", processed_rows_per_second)
if processed_rows_per_second is not None:  # Check if value exists
   print("ocessed_rows_per_second retrieved")
   print(f"Processed rows per second: {processed_rows_per_second}")
else:
   print("processed_rows_per_second not retrieved!")

This is the output

onQueryStarted
'None' [c1a910e6-41bb-493f-b15b-7863d07ff3fe] got started!
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further
details.
---
Batch: 0
---
+---+-+---+---+
|key|doubled_value|op_type|op_time|
+---+-+---+---+
+---+-+---+---+

onQueryProgress
---
Batch: 1
---
++-+---++
| key|doubled_value|op_type| op_time|
++-+---++
|a960f663-d13a-49c...|0|  1|2024-03-11 12:17:...|
++-+---++

onQueryProgress
---
Batch: 2
---
++-+---++
| key|doubled_value|op_type| op_time|
++-+---++
|a960f663-d13a-49c...|2|  1|2024-03-11 12:17:...|
++-+---++

I am afraid it is not working. Not even printing anything

Cheers

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 11 Mar 2024 at 05:07, 刘唯  wrote:

> *now -> not
>
> 刘唯  于2024年3月10日周日 22:04写道：
>
>> Have you tried using microbatch_data.get("processedRowsPerSecond")?
>> Camel case now snake case
>>
>> Mich Talebzadeh  于2024年3月10日周日 11:46写道：
>>
>>>
>>> There is a paper from Databricks on this subject
>>>
>>>
>>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>>
>>> But having tested it, there seems to be a bug there that I reported to
>>> Databricks forum as well (in answer to a user question)
>>>
>>> I have come to a conclusion that this is a bug. In general there is a
>>> bug in obtaining individual values from the dictionary. For example, a bug
>>> in the way Spark Streaming is populating the processed_rows_per_second key
>>> within the microbatch_data -> microbatch_data = event.progres dictionary or
>>> any other key. I have explored various debugging steps, and even though the
>>> key seems to exist, the value might not be getting set. Note that the
>>> dictionary itself prints the elements correctly. This is with regard to
>>> method onQueryProgress(self, event) in class
>>> MyListener(StreamingQueryListener):
>>>
>>> For example with print(microbatch_data), you get all printed as below
>>>
>>> onQueryProgress
>>> microbatch_data received
>>> {
>>> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
>>> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
>>> "name" : null,
>>> "timestamp" : "2024-03-10T09:21:27.233Z",
>>> "batchId" : 21,
>>> "numInputRows" : 1,
>>> "inputRowsPerSecond" : 10

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯

*now -> not

刘唯  于2024年3月10日周日 22:04写道：

> Have you tried using microbatch_data.get("processedRowsPerSecond")?
> Camel case now snake case
>
> Mich Talebzadeh  于2024年3月10日周日 11:46写道：
>
>>
>> There is a paper from Databricks on this subject
>>
>>
>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>>
>> But having tested it, there seems to be a bug there that I reported to
>> Databricks forum as well (in answer to a user question)
>>
>> I have come to a conclusion that this is a bug. In general there is a bug
>> in obtaining individual values from the dictionary. For example, a bug in
>> the way Spark Streaming is populating the processed_rows_per_second key
>> within the microbatch_data -> microbatch_data = event.progres dictionary or
>> any other key. I have explored various debugging steps, and even though the
>> key seems to exist, the value might not be getting set. Note that the
>> dictionary itself prints the elements correctly. This is with regard to
>> method onQueryProgress(self, event) in class
>> MyListener(StreamingQueryListener):
>>
>> For example with print(microbatch_data), you get all printed as below
>>
>> onQueryProgress
>> microbatch_data received
>> {
>> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
>> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
>> "name" : null,
>> "timestamp" : "2024-03-10T09:21:27.233Z",
>> "batchId" : 21,
>> "numInputRows" : 1,
>> "inputRowsPerSecond" : 100.0,
>> "processedRowsPerSecond" : 5.347593582887701,
>> "durationMs" : {
>> "addBatch" : 37,
>> "commitOffsets" : 41,
>> "getBatch" : 0,
>> "latestOffset" : 0,
>> "queryPlanning" : 5,
>> "triggerExecution" : 187,
>> "walCommit" : 104
>> },
>> "stateOperators" : [ ],
>> "sources" : [ {
>> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
>> numPartitions=default",
>> "startOffset" : 20,
>> "endOffset" : 21,
>> "latestOffset" : 21,
>> "numInputRows" : 1,
>> "inputRowsPerSecond" : 100.0,
>> "processedRowsPerSecond" : 5.347593582887701
>> } ],
>> "sink" : {
>> "description" :
>> "org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
>> "numOutputRows" : 1
>> }
>> }
>> However, the observed behaviour (i.e. processed_rows_per_second is either
>> None or not being updated correctly).
>>
>> The spark version I used for my test is 3.4
>>
>> Sample code uses format=rate for simulating a streaming process. You can
>> test the code yourself, all in one
>> from pyspark.sql import SparkSession
>> from pyspark.sql.functions import col
>> from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
>> from pyspark.sql.functions import col, round, current_timestamp, lit
>> import uuid
>>
>> def process_data(df):
>>
>> processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
>>   withColumn("doubled_value", col("value") * 2). \
>>   withColumn("op_type", lit(1)). \
>>   withColumn("op_time", current_timestamp())
>>
>> return processed_df
>>
>> # Create a Spark session
>> appName = "testListener"
>> spark = SparkSession.builder.appName(appName).getOrCreate()
>>
>> # Define the schema for the streaming data
>> schema = "key string timestamp timestamp, value long"
>>
>> # Define my listener.
>> class MyListener(StreamingQueryListener):
>> def onQueryStarted(self, event):
>> print("onQueryStarted")
>> print(f"'{event.name}' [{event.id}] got started!")
>> def onQueryProgress(self, event):
>> print("onQueryProgress")
>> # Access micro-batch data
>> microbatch_data = event.progress
>> print("microbatch_data received")  # Check if data is received
>> print(microbatch_data)
>> processed_rows_per_second =
>> microbatch_data.get("processed_rows_per_second")
>> if processed_rows_per_second is not None:  # Check if value exists
>&g

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯

Have you tried using microbatch_data.get("processedRowsPerSecond")?
Camel case now snake case

Mich Talebzadeh  于2024年3月10日周日 11:46写道：

>
> There is a paper from Databricks on this subject
>
>
> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html
>
> But having tested it, there seems to be a bug there that I reported to
> Databricks forum as well (in answer to a user question)
>
> I have come to a conclusion that this is a bug. In general there is a bug
> in obtaining individual values from the dictionary. For example, a bug in
> the way Spark Streaming is populating the processed_rows_per_second key
> within the microbatch_data -> microbatch_data = event.progres dictionary or
> any other key. I have explored various debugging steps, and even though the
> key seems to exist, the value might not be getting set. Note that the
> dictionary itself prints the elements correctly. This is with regard to
> method onQueryProgress(self, event) in class
> MyListener(StreamingQueryListener):
>
> For example with print(microbatch_data), you get all printed as below
>
> onQueryProgress
> microbatch_data received
> {
> "id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
> "runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
> "name" : null,
> "timestamp" : "2024-03-10T09:21:27.233Z",
> "batchId" : 21,
> "numInputRows" : 1,
> "inputRowsPerSecond" : 100.0,
> "processedRowsPerSecond" : 5.347593582887701,
> "durationMs" : {
> "addBatch" : 37,
> "commitOffsets" : 41,
> "getBatch" : 0,
> "latestOffset" : 0,
> "queryPlanning" : 5,
> "triggerExecution" : 187,
> "walCommit" : 104
> },
> "stateOperators" : [ ],
> "sources" : [ {
> "description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
> numPartitions=default",
> "startOffset" : 20,
> "endOffset" : 21,
> "latestOffset" : 21,
> "numInputRows" : 1,
> "inputRowsPerSecond" : 100.0,
> "processedRowsPerSecond" : 5.347593582887701
> } ],
> "sink" : {
> "description" :
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
> "numOutputRows" : 1
> }
> }
> However, the observed behaviour (i.e. processed_rows_per_second is either
> None or not being updated correctly).
>
> The spark version I used for my test is 3.4
>
> Sample code uses format=rate for simulating a streaming process. You can
> test the code yourself, all in one
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import col
> from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
> from pyspark.sql.functions import col, round, current_timestamp, lit
> import uuid
>
> def process_data(df):
>
> processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
>   withColumn("doubled_value", col("value") * 2). \
>   withColumn("op_type", lit(1)). \
>   withColumn("op_time", current_timestamp())
>
> return processed_df
>
> # Create a Spark session
> appName = "testListener"
> spark = SparkSession.builder.appName(appName).getOrCreate()
>
> # Define the schema for the streaming data
> schema = "key string timestamp timestamp, value long"
>
> # Define my listener.
> class MyListener(StreamingQueryListener):
> def onQueryStarted(self, event):
> print("onQueryStarted")
> print(f"'{event.name}' [{event.id}] got started!")
> def onQueryProgress(self, event):
> print("onQueryProgress")
> # Access micro-batch data
> microbatch_data = event.progress
> print("microbatch_data received")  # Check if data is received
> print(microbatch_data)
> processed_rows_per_second =
> microbatch_data.get("processed_rows_per_second")
> if processed_rows_per_second is not None:  # Check if value exists
>print("processed_rows_per_second retrieved")
>print(f"Processed rows per second: {processed_rows_per_second}")
> else:
>print("processed_rows_per_second not retrieved!")
> def onQueryTerminated(self, event):
> print("onQueryTerminated")
> if event.exception:
> print(f"Query terminated with exception: {event

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh

There is a paper from Databricks on this subject

https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html

But having tested it, there seems to be a bug there that I reported to
Databricks forum as well (in answer to a user question)

I have come to a conclusion that this is a bug. In general there is a bug
in obtaining individual values from the dictionary. For example, a bug in
the way Spark Streaming is populating the processed_rows_per_second key
within the microbatch_data -> microbatch_data = event.progres dictionary or
any other key. I have explored various debugging steps, and even though the
key seems to exist, the value might not be getting set. Note that the
dictionary itself prints the elements correctly. This is with regard to
method onQueryProgress(self, event) in class
MyListener(StreamingQueryListener):

For example with print(microbatch_data), you get all printed as below

onQueryProgress
microbatch_data received
{
"id" : "941e4cb6-f4ee-41f8-b662-af6dda61dc66",
"runId" : "691d5eb2-140e-48c0-949a-7efbe0fa0967",
"name" : null,
"timestamp" : "2024-03-10T09:21:27.233Z",
"batchId" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701,
"durationMs" : {
"addBatch" : 37,
"commitOffsets" : 41,
"getBatch" : 0,
"latestOffset" : 0,
"queryPlanning" : 5,
"triggerExecution" : 187,
"walCommit" : 104
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "RateStreamV2[rowsPerSecond=1, rampUpTimeSeconds=0,
numPartitions=default",
"startOffset" : 20,
"endOffset" : 21,
"latestOffset" : 21,
"numInputRows" : 1,
"inputRowsPerSecond" : 100.0,
"processedRowsPerSecond" : 5.347593582887701
} ],
"sink" : {
"description" :
"org.apache.spark.sql.execution.streaming.ConsoleTable$@430a977c",
"numOutputRows" : 1
}
}
However, the observed behaviour (i.e. processed_rows_per_second is either
None or not being updated correctly).

The spark version I used for my test is 3.4

Sample code uses format=rate for simulating a streaming process. You can
test the code yourself, all in one
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.streaming import DataStreamWriter, StreamingQueryListener
from pyspark.sql.functions import col, round, current_timestamp, lit
import uuid

def process_data(df):

processed_df = df.withColumn("key", lit(str(uuid.uuid4(.\
  withColumn("doubled_value", col("value") * 2). \
  withColumn("op_type", lit(1)). \
  withColumn("op_time", current_timestamp())

return processed_df

# Create a Spark session
appName = "testListener"
spark = SparkSession.builder.appName(appName).getOrCreate()

# Define the schema for the streaming data
schema = "key string timestamp timestamp, value long"

# Define my listener.
class MyListener(StreamingQueryListener):
def onQueryStarted(self, event):
print("onQueryStarted")
print(f"'{event.name}' [{event.id}] got started!")
def onQueryProgress(self, event):
print("onQueryProgress")
# Access micro-batch data
microbatch_data = event.progress
print("microbatch_data received")  # Check if data is received
print(microbatch_data)
processed_rows_per_second =
microbatch_data.get("processed_rows_per_second")
if processed_rows_per_second is not None:  # Check if value exists
   print("processed_rows_per_second retrieved")
   print(f"Processed rows per second: {processed_rows_per_second}")
else:
   print("processed_rows_per_second not retrieved!")
def onQueryTerminated(self, event):
print("onQueryTerminated")
if event.exception:
print(f"Query terminated with exception: {event.exception}")
else:
print("Query successfully terminated.")
# Add my listener.

listener_instance = MyListener()
spark.streams.addListener(listener_instance)


# Create a streaming DataFrame with the rate source
streaming_df = (
spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.load()
)

# Apply processing function to the streaming DataFrame
processed_streaming_df = process_data(streaming_df)

# Define the output sink (for example, console sink)
query = (
processed_streaming_df.select( \
  col("key").alias("key") \
, col("doubled_va

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh

Hi,

Quick observations from what you have provided

- The observed discrepancy between rdd.count() and
rdd.map(Item::getType).countByValue()in distributed mode suggests a
potential aggregation issue with countByValue(). The correct results in
local mode give credence to this theory.
- Workarounds using mapToPair() and reduceByKey() produce identical
results, indicating a broader pattern rather than method specific behaviour.
- Dataset.groupBy().count()yields accurate results, but this method incurs
overhead for RDD-to-Dataset conversion.

Your expected total count  of 75187 is around  7 times larger than the
observed count of 10519, mapping to the number of your executors 7. This
suggests potentially incorrect aggregation or partial aggregation across
executors.

Now before raising red flag, these could be the culprit

- Data Skew, uneven distribution of data across executors could cause
partial aggregation if a single executor processes most items of a
particular type.
- Partial Aggregations, Spark might be combining partial counts from
executors incorrectly, leading to inaccuracies.
- Finally a bug in 3.5 is possible.

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 27 Feb 2024 at 19:02, Stuart Fehr  wrote:

> Hello, I recently encountered a bug with the results from
> JavaRDD#countByValue that does not reproduce when running locally. For
> background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0.
>
> The code in question is something like this:
>
> JavaRDD rdd = // ...
>> rdd.count();  // 75187
>
>
>
> // Get the count broken down by type
>> rdd.map(Item::getType).countByValue();
>
>
> Which gives these results from the resulting Map:
>
> TypeA: 556
> TypeB: 9168
> TypeC: 590
> TypeD: 205
> (total: 10519)
>
> These values are incorrect, since every item has a type defined, so the
> total of all the types should be 75187. When I inspected this stage in the
> Spark UI, I found that it was using 7 executors. Since the value here is
> about 1/7th of the actual expected value, I suspect that there is some
> issue with the way that the executors report their results back to the
> driver. These results for the same code are correct when I run the job in
> local mode ("local[4]"), so it may also have something to do with how data
> is shared across processes.
>
> For workarounds, I have also tried:
>
> rdd.mapToPair(item -> Tuple2.apply(item.getType(), 1)).countByKey();
>> rdd.mapToPair(item -> Tuple2.apply(item.getType(),
>> 1L)).reduceByKey(Long::sum).collectAsMap();
>
>
> These yielded the same (incorrect) result.
>
> I did find that using Dataset.groupBy().count() did yield the correct
> results:
>
> TypeA: 3996
> TypeB: 65490
> TypeC: 4224
> TypeD: 1477
>
> So, I have an immediate workaround, but it is somewhat awkward since I
> have to create a Dataframe from a JavaRDD each time.
>
> Am I doing something wrong? Do these methods not work the way that I
> expected them to from reading the documentation? Is this a legitimate bug?
>
> I would be happy to provide more details if that would help in debugging
> this scenario.
>
> Thank you for your time,
> ~Stuart Fehr
>

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr

Hello, I recently encountered a bug with the results from
JavaRDD#countByValue that does not reproduce when running locally. For
background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0.

The code in question is something like this:

JavaRDD rdd = // ...
> rdd.count();  // 75187



// Get the count broken down by type
> rdd.map(Item::getType).countByValue();


Which gives these results from the resulting Map:

TypeA: 556
TypeB: 9168
TypeC: 590
TypeD: 205
(total: 10519)

These values are incorrect, since every item has a type defined, so the
total of all the types should be 75187. When I inspected this stage in the
Spark UI, I found that it was using 7 executors. Since the value here is
about 1/7th of the actual expected value, I suspect that there is some
issue with the way that the executors report their results back to the
driver. These results for the same code are correct when I run the job in
local mode ("local[4]"), so it may also have something to do with how data
is shared across processes.

For workarounds, I have also tried:

rdd.mapToPair(item -> Tuple2.apply(item.getType(), 1)).countByKey();
> rdd.mapToPair(item -> Tuple2.apply(item.getType(),
> 1L)).reduceByKey(Long::sum).collectAsMap();


These yielded the same (incorrect) result.

I did find that using Dataset.groupBy().count() did yield the correct
results:

TypeA: 3996
TypeB: 65490
TypeC: 4224
TypeD: 1477

So, I have an immediate workaround, but it is somewhat awkward since I have
to create a Dataframe from a JavaRDD each time.

Am I doing something wrong? Do these methods not work the way that I
expected them to from reading the documentation? Is this a legitimate bug?

I would be happy to provide more details if that would help in debugging
this scenario.

Thank you for your time,
~Stuart Fehr

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh

Indeed valid points raised including the potential typo in the new spark
version. I suggest, in the meantime, you should look for the so called
alternative debugging methods


   -
   - Simpler  explain(), try basic explain() or explain("extended"). This
   might provide a less detailed, but potentially functional, explanation.
   - Manual Analysis*, *analyze the query structure and logical steps
   yourself
   - Spark UI, review the Spark UI (accessible through your Spark
   application on 4040) for delving into query execution and potential
   bottlenecks.


HTH



Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Wed, 21 Feb 2024 at 08:37, Holden Karau  wrote:

> Do you mean Spark 3.4? 4.0 is very much not released yet.
>
> Also it would help if you could share your query & more of the logs
> leading up to the error.
>
> On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup 
> wrote:
>
>> Hi Spark team,
>>
>>
>>
>> We ran into a dataframe issue after upgrading from spark 3.1 to 4.
>>
>>
>>
>> query_result.explain(extended=True)\n  File
>> \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"
>>
>> raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
>> calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
>> java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
>> your query to Spark user mailing list.\n\tat 
>> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
>>  
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
>>  scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
>>  
>> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
>>  
>> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
>>  
>> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
>>  
>> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
>>  scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
>> scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
>> scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
>> scala.collect...
>>
>>
>>
>>
>>
>> Could you please let us know if this is already being looked at?
>>
>>
>>
>> Thanks,
>>
>> Anup
>>
>
>
> --
> Cell : 425-233-8271
>

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup

Apologies. Issue is seen after we upgraded from Spark 3.1 to Spark 3.3.  The 
same query runs fine on Spark 3.1.

Omit the Spark version mentioned in email subject earlier.

Anup

Error trace:
query_result.explain(extended=True)\n  File 
\"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
your query to Spark user mailing list.\n\tat 
org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
 scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
 scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
scala.collect...


From: "Sharma, Anup" 
Date: Tuesday, February 20, 2024 at 4:58 PM
To: "user@spark.apache.org" 
Cc: "Thinderu, Shalini" 
Subject: Spark 4.0 Query Analyzer Bug Report

Hi Spark team,

We ran into a dataframe issue after upgrading from spark 3.1 to 4.

query_result.explain(extended=True)\n  File 
\"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
your query to Spark user mailing list.\n\tat 
org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
 scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
 scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
scala.collect...


Could you please let us know if this is already being looked at?

Thanks,
Anup

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau

Do you mean Spark 3.4? 4.0 is very much not released yet.

Also it would help if you could share your query & more of the logs leading
up to the error.

On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup 
wrote:

> Hi Spark team,
>
>
>
> We ran into a dataframe issue after upgrading from spark 3.1 to 4.
>
>
>
> query_result.explain(extended=True)\n  File
> \"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"
>
> raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
> calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
> java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
> your query to Spark user mailing list.\n\tat 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
>  scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
>  
> org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
>  
> org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
>  
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
>  
> scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
>  scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
> scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
> scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
> scala.collect...
>
>
>
>
>
> Could you please let us know if this is already being looked at?
>
>
>
> Thanks,
>
> Anup
>


-- 
Cell : 425-233-8271

Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup

Hi Spark team,

We ran into a dataframe issue after upgrading from spark 3.1 to 4.

query_result.explain(extended=True)\n  File 
\"…/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py\"

raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while 
calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: 
java.lang.IllegalStateException: You hit a query analyzer bug. Please report 
your query to Spark user mailing list.\n\tat 
org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$1(QueryPlanner.scala:63)\n\tat
 scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)\n\tat 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)\n\tat 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)\n\tat
 
org.apache.spark.sql.execution.SparkStrategies.plan(SparkStrategies.scala:72)\n\tat
 
org.apache.spark.sql.catalyst.planning.QueryPlanner.$anonfun$plan$3(QueryPlanner.scala:78)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:196)\n\tat
 
scala.collection.TraversableOnce$folder$1.apply(TraversableOnce.scala:194)\n\tat
 scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat 
scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat 
scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\n\tat 
scala.collection.TraversableOnce.foldLeft(TraversableOnce.scala:199)\n\tat 
scala.collect...


Could you please let us know if this is already being looked at?

Thanks,
Anup

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman

Hi all,

Wondering if anyone has run into this as I can't find any similar issues in
JIRA, mailing list archives, Stack Overflow, etc. I had a query that was
running successfully, but the query planning time was extremely long (4+
hours). To fix this I added `checkpoint()` calls earlier in the code to
truncate the query plan. This worked to improve the performance, but now I
am getting the error "A column or function parameter with name
`B`.`JOIN_KEY` cannot be resolved." Nothing else in the query changed
besides the `checkpoint()` calls. The only thing I can surmise is that this
is related to a very complex nested query plan where the same table is used
multiple times upstream. The general flow is something like this:

```py
df = spark.sql("...")
df = df.checkpoint()
df.createOrReplaceTempView("df")

df2 = spark.sql("SELECT  JOIN df ...")
df2.createOrReplaceTempView("df2")

# Error happens here: A column or function parameter with name
`a`.`join_key` cannot be resolved. Did you mean one of the following?
[`b`.`join_key`, `a`.`col1`, `b`.`col2`]
spark.sql(""'
SELECT *
FROM  (
SELECT
a.join_key,
a.col1,
b.col2
FROM df2 b
LEFT JOIN df a ON b.join_key = a.join_key
)
""")
```

In the actual code df and df2 are very complex multi-level nested views
built upon other views. If I checkpoint all of the dataframes in the query
right before I run it the error goes away. Unfortunately I have not been
able to put together a minimal reproducible example.

Any ideas?

Thanks,
Robin

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng

Hi Craig,

Thank you for sending us more information.  Can you answer my previous
question which I don't think the document addresses. How did you determine
duplicates in the output?  How was the output data read? The FileStreamSink
provides exactly-once writes ONLY if you read the output with the
FileStreamSource or the FileSource (batch).  A log is used to determine
what data is committed or not and those aforementioned sources know how to
use that log to read the data "exactly-once".  So there may be duplicated
data written on disk.  If you simply just read the data files written to
disk you may see duplicates when there are failures.  However, if you read
the output location with Spark you should get exactly once results (unless
there is a bug) since spark will know how to use the commit log to see what
data files are committed and not.

Best,

Jerry

On Mon, Sep 18, 2023 at 1:18 PM Craig Alfieri 
wrote:

> Hi Russell/Jerry/Mich,
>
>
>
> Appreciate your patience on this.
>
>
>
> Attached are more details on how this duplication “error” was found.
>
> Since we’re still unsure I am using “error” in quotes.
>
>
>
> We’d love the opportunity to work with any of you directly and/or the
> wider Spark community to triage this or get a better understanding of the
> nature of what we’re experiencing.
>
>
>
> Our platform provides the ability to fully reproduce this.
>
>
>
> Once you have had the chance to review the attached draft, let us know if
> there are any questions in the meantime. Again, we welcome the opportunity
> to work with the teams on this.
>
>
>
> Best-
>
> Craig
>
>
>
>
>
>
>
> *From: *Craig Alfieri 
> *Date: *Thursday, September 14, 2023 at 8:45 PM
> *To: *russell.spit...@gmail.com 
> *Cc: *Jerry Peng , Mich Talebzadeh <
> mich.talebza...@gmail.com>, user@spark.apache.org ,
> connor.mc...@antithesis.com 
> *Subject: *Re: Data Duplication Bug Found - Structured Streaming Versions
> 3..4.1, 3.2.4, and 3.3.2
>
> Hi Russell et al,
>
>
>
> Acknowledging receipt; we’ll get these answers back to the group.
>
>
>
> Follow-up forthcoming.
>
>
>
> Craig
>
>
>
>
>
>
>
> On Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:
>
> Exactly once should be output sink dependent, what sink was being used?
>
> Sent from my iPhone
>
>
>
> On Sep 14, 2023, at 4:52 PM, Jerry Peng 
> wrote:
>
> 
>
> Craig,
>
>
>
> Thanks! Please let us know the result!
>
>
>
> Best,
>
>
>
> Jerry
>
>
>
> On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
> Hi Craig,
>
>
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Distinguished Technologist, Solutions Architect & Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Image removed by sender.]  view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
> Hello Spark Community-
>
>
>
> As part of a research effort, our team here at Antithesis tests for
> correctness/fault tolerance of major OSS projects.
>
> Our team recently was testing Spark’s Structured Streaming, and we came
> across a data duplication bug we’d like to work with the teams on to
> resolve.
>
>
>
> Our intention is to utilize this as a future case study for our platform,
> but prior to doing so we like to have a resolution in place so that an
> announcement isn’t alarming to the user base.
>
>
>
> Attached is a high level .pdf that reviews the High Availability set-up
> put under test.
>
> This was also tested across the three latest versions, and the same
> behavior was observed.
>
>
>
> We can reproduce this error readily, since our environment is fully
> deterministic, we are just not Spark experts and would like to work with
> someone in the community to resolve this.
>
>
>
> Please let us know at your earliest convenience.
>
>
>
> Best
>
>
>
> Error! Filename not specified.
>

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri

Hi Russell et al,Acknowledging receipt; we’ll get these answers back to the group.Follow-up forthcoming.CraigOn Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri <craig.alfi...@antithesis.com> wrote:

Hello Spark Community-
As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve.
Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base.
Attached is a high level .pdf that reviews the High Availability set-up put under test.
This was also tested across the three latest versions, and the same behavior was observed.
We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this.
Please let us know at your earliest convenience.
Best

Craig Alfieri

c: 917.841.1652

craig.alfi...@antithesis.com

New York, NY.

Antithesis.com

We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis

-This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer

Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

On Thu, 14 Sept 2023 at 17:48, Craig Alfieri <craig.alfi...@antithesis.com> wrote:

Craig Alfieri

c: 917.841.1652

craig.alfi...@antithesis.com

New York, NY.

Antithesis.com

We can't talk about most of the bugs that we've found for our customers,

but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng

Craig,

Thanks! Please let us know the result!

Best,

Jerry

On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh 
wrote:

>
> Hi Craig,
>
> Can you please clarify what this bug is and provide sample code causing
> this issue?
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 14 Sept 2023 at 17:48, Craig Alfieri 
> wrote:
>
>> Hello Spark Community-
>>
>>
>>
>> As part of a research effort, our team here at Antithesis tests for
>> correctness/fault tolerance of major OSS projects.
>>
>> Our team recently was testing Spark’s Structured Streaming, and we came
>> across a data duplication bug we’d like to work with the teams on to
>> resolve.
>>
>>
>>
>> Our intention is to utilize this as a future case study for our platform,
>> but prior to doing so we like to have a resolution in place so that an
>> announcement isn’t alarming to the user base.
>>
>>
>>
>> Attached is a high level .pdf that reviews the High Availability set-up
>> put under test.
>>
>> This was also tested across the three latest versions, and the same
>> behavior was observed.
>>
>>
>>
>> We can reproduce this error readily, since our environment is fully
>> deterministic, we are just not Spark experts and would like to work with
>> someone in the community to resolve this.
>>
>>
>>
>> Please let us know at your earliest convenience.
>>
>>
>>
>> Best
>>
>>
>>
>> *[image: signature_2327449931]*
>>
>> *Craig Alfieri*
>>
>> c: 917.841.1652
>>
>> craig.alfi...@antithesis.com
>>
>> New York, NY.
>>
>> Antithesis.com
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.antithesis.com_=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=1FbSpGgVIpZO4QkQDmXk7jc1BFVciZWVioOvdJ86ubY=5SVjNvtYuy6icWSaP0lwjzTQw1Cc7JQO9QVaxn5KxqTdH8HC1HHURutlp5rgiaMH=SRmgBE5ImnGZ-GuqL3X6Q_6NPYiay1gLRbcUUofPIHo=>
>>
>>
>>
>> We can't talk about most of the bugs that we've found for our customers,
>>
>> but some customers like to speak about their work with us:
>>
>> https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis
>>
>>
>>
>>
>>
>>
>> *-*
>> *This email and any files transmitted with it are confidential and
>> intended solely for the use of the individual or entity for whom they are
>> addressed. If you received this message in error, please notify the sender
>> and remove it from your system.*
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri

Hi Jerry- This is exactly the type of help we're seeking, to confirm the 
FilestreamSink was not utilized on our test runs.

Our team is going to work towards implementing this and re-running our 
experiments across the versions.

If everything comes back with similar results, we will reach back out to share 
more artifacts with this thread.

Thank you Jerry.


From: Jerry Peng 
Date: Thursday, September 14, 2023 at 1:10 PM
To: Craig Alfieri 
Cc: user@spark.apache.org 
Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 
3.2.4, and 3.3.2
Hi Craig,

Thank you for bringing this to the community's attention! Do you have any 
example code you can share that we can use to reproduce this issue?  By the 
way, how did you determine duplicates in the output?  The FileStreamSink 
provides exactly-once writes ONLY if you read the output with the 
FileStreamSource or the FileSource (batch).  A log is used to determine what 
data is committed or not and those aforementioned sources know how to use that 
log to read the data "exactly-once".

Best,

Jerry

On Thu, Sep 14, 2023 at 9:48 AM Craig Alfieri 
mailto:craig.alfi...@antithesis.com>> wrote:
Hello Spark Community-

As part of a research effort, our team here at Antithesis tests for 
correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across 
a data duplication bug we’d like to work with the teams on to resolve.

Our intention is to utilize this as a future case study for our platform, but 
prior to doing so we like to have a resolution in place so that an announcement 
isn’t alarming to the user base.

Attached is a high level .pdf that reviews the High Availability set-up put 
under test.
This was also tested across the three latest versions, and the same behavior 
was observed.

We can reproduce this error readily, since our environment is fully 
deterministic, we are just not Spark experts and would like to work with 
someone in the community to resolve this.

Please let us know at your earliest convenience.

Best

[signature_2327449931]
Craig Alfieri
c: 917.841.1652
craig.alfi...@antithesis.com<mailto:craig.alfi...@antithesis.com>
New York, NY.
Antithesis.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.antithesis.com_=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=1FbSpGgVIpZO4QkQDmXk7jc1BFVciZWVioOvdJ86ubY=5SVjNvtYuy6icWSaP0lwjzTQw1Cc7JQO9QVaxn5KxqTdH8HC1HHURutlp5rgiaMH=SRmgBE5ImnGZ-GuqL3X6Q_6NPYiay1gLRbcUUofPIHo=>

We can't talk about most of the bugs that we've found for our customers,
but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis



-
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity for whom they are addressed. If 
you received this message in error, please notify the sender and remove it from 
your system.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

-- 

*-*
*This email and any files transmitted with 
it are confidential and intended solely for the use of the individual or 
entity for whom they are addressed. If you received this message in error, 
please notify the sender and remove it from your system.*

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev

Hi Mich,

It's not specific to ORC, and looks like a bug from Hadoop Common project.
I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do
you know if anyone could help me to set the Assignee?
https://issues.apache.org/jira/browse/HADOOP-18856


With Best Regards,

Dipayan Dev



On Sun, Aug 20, 2023 at 2:47 AM Mich Talebzadeh 
wrote:

> Under gs directory
>
> "gs://test_dd1/abc/"
>
> What do you see?
>
> gsutil ls gs://test_dd1/abc
>
> and the same
>
> gs://test_dd1/
>
> gsutil ls gs://test_dd1
>
> I suspect you need a folder for multiple ORC slices!
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 19 Aug 2023 at 21:36, Dipayan Dev  wrote:
>
>> Hi Everyone,
>>
>> I'm stuck with one problem, where I need to provide a custom GCS location
>> for the Hive table from Spark. The code fails while doing an *'insert
>> into'* whenever my Hive table has a flag GS location like
>> gs://, but works for nested locations like
>> gs://bucket_name/blob_name.
>>
>> Is anyone aware if it's an issue from Spark side or any config I need to
>> pass for it?
>>
>> *The issue is happening in 2.x and 3.x both.*
>>
>> Config using:
>>
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
>> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
>> spark.conf.set("hive.exec.dynamic.partition", true)
>>
>>
>> *Case 1 : FAILS*
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>>
>>
>>
>>
>>
>> *java.lang.NullPointerException  at 
>> org.apache.hadoop.fs.Path.(Path.java:141)  at 
>> org.apache.hadoop.fs.Path.(Path.java:120)  at 
>> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>>
>>
>> *Case 2: Succeeds  *
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>>
>> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh

Under gs directory

"gs://test_dd1/abc/"

What do you see?

gsutil ls gs://test_dd1/abc

and the same

gs://test_dd1/

gsutil ls gs://test_dd1

I suspect you need a folder for multiple ORC slices!



Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 19 Aug 2023 at 21:36, Dipayan Dev  wrote:

> Hi Everyone,
>
> I'm stuck with one problem, where I need to provide a custom GCS location
> for the Hive table from Spark. The code fails while doing an *'insert
> into'* whenever my Hive table has a flag GS location like
> gs://, but works for nested locations like
> gs://bucket_name/blob_name.
>
> Is anyone aware if it's an issue from Spark side or any config I need to
> pass for it?
>
> *The issue is happening in 2.x and 3.x both.*
>
> Config using:
>
> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
> spark.conf.set("hive.exec.dynamic.partition", true)
>
>
> *Case 1 : FAILS*
>
> val DF = Seq(("test1", 123)).toDF("name", "num")
>  val partKey = List("num").map(x => x)
>
> DF.write.option("path", 
> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>
> val DF1 = Seq(("test2", 125)).toDF("name", "num")
> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>
>
>
>
>
> *java.lang.NullPointerException  at 
> org.apache.hadoop.fs.Path.(Path.java:141)  at 
> org.apache.hadoop.fs.Path.(Path.java:120)  at 
> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>
>
> *Case 2: Succeeds  *
>
> val DF = Seq(("test1", 123)).toDF("name", "num")
>  val partKey = List("num").map(x => x)
>
> DF.write.option("path", 
> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>
> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>
> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>
>
> With Best Regards,
>
> Dipayan Dev
>

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev

Hi Everyone,

I'm stuck with one problem, where I need to provide a custom GCS location
for the Hive table from Spark. The code fails while doing an *'insert into'*
whenever my Hive table has a flag GS location like gs://, but
works for nested locations like gs://bucket_name/blob_name.

Is anyone aware if it's an issue from Spark side or any config I need to
pass for it?

*The issue is happening in 2.x and 3.x both.*

Config using:

spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition", true)


*Case 1 : FAILS*

val DF = Seq(("test1", 123)).toDF("name", "num")
 val partKey = List("num").map(x => x)

DF.write.option("path",
"gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey:
_*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")

val DF1 = Seq(("test2", 125)).toDF("name", "num")
DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")





*java.lang.NullPointerException  at
org.apache.hadoop.fs.Path.(Path.java:141)  at
org.apache.hadoop.fs.Path.(Path.java:120)  at
org.apache.hadoop.fs.Path.suffix(Path.java:441)  at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*


*Case 2: Succeeds  *

val DF = Seq(("test1", 123)).toDF("name", "num")
 val partKey = List("num").map(x => x)

DF.write.option("path",
"gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey:
_*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")

val DF1 = Seq(("test2", 125)).toDF("name", "num")

DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")


With Best Regards,

Dipayan Dev

Spark UI - Bug Executors tab when using proxy port

2023-07-06 Thread Bruno Pistone

Hello everyone,

I’m really sorry to use this mailing list, but seems impossible to notify a 
strange behaviour that is happening with the Spark UI. I’m sending also the 
link to the stackoverflow question here 
https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty

I’m trying to run the Spark UI on a web server. I need to configure a specific 
port for running the UI and a redirect URL. I’m setting up the following OPTS:

```
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=${LOCAL_PATH_LOGS}
-Dspark.history.ui.port=18080 
-Dspark.eventLog.enabled=true 
-Dspark.ui.proxyRedirectUri=${SERVER_URL}"

./start-history-server.sh
``

What is happening: The UI is accessible through the url 
https://${SERVER_URL}/proxy/18080 

When I’m selecting an application and I’m clicking on the tab “Executors”, it 
remains empty. By looking at the API calls done by the UI, I see there is the 
"/allexecutors” which returns 404.

Instead of calling 
https://${SERVER_URL}/proxy/18080/api/v1/applications/${APP_ID}/allexecutors 

I see that the URL called is 
https://${SERVER_URL}/proxy/18080/api/v1/applications/18080/allexecutors 


Seems that the appId is not correctly identified. Can you please provide a 
solution for this, or an estimated date for fixing the error?

Thank you,

[JDBC] [PySpark] Possible bug when comparing incoming data frame from mssql and empty delta table

2023-02-26 Thread lennart


Hello,

I have been working on a small ETL framework for 
pyspark/delta/databricks on my spare time.


It looks like I might have encountered a bug, however I'm not totally 
sure its actually caused by spark itself and not one of the other 
technologies.


The error shows up when using spark sql to compare a incoming data frame 
from jdbc/mssql with a empty delta table.
Spark sends a query to mssql ending in 'WHERE (1)', which apparently is 
invalid syntax and causes exception to be thrown.
Unless reading in parallel no where clause at all should be needed as 
the code is reading all rows from the source.


The error does not happen on databricks 10.4 LTS with spark 3.2.1 but 
from databricks 11.3 LTS with spark 3.3.0 and beyond it shows up.
The error also does not happen with postgresql or mysql, so either the 
resulting sql is valid there or does not contain the extra 'WHERE (1)'.


I have provided some sample pyspark code below that can be used to 
reproduce the error on databricks community edition and a mssql server.
I have written 2 different versions of the sql statment. Both versions 
result in the same error.


If there is some option or other trick that can be used to circumvent 
the error on newer releases I would be grateful to learn about it.
However being able use a single sql statement for this is preferable to 
try to keep it short, idempotent and atomic.


Best regards
Lennart


Code:

# Add _checksum column to beginning of data frame.
def add_checksum_column(df):
from pyspark.sql.functions import concat_ws, md5
return df.select([md5(concat_ws("<|^|>", 
*sorted(df.columns))).alias("_checksum"), "*"])


# Add _key column to beginning of data frame.
def add_key_column(df, key):
if isinstance(key, str):
df = df.select([df[key].alias("_key"), "*"])
elif isinstance(key, list):
from pyspark.sql.functions import concat_ws
df = df.select([concat_ws("-", *key).alias("_key"), "*"])
else:
raise Exception("Invalid key")
return df

# Create Delta table.
def create_table(df, key, target):
if spark.catalog._jcatalog.tableExists(target):
return
from pyspark.sql.functions import current_timestamp, lit
df = add_checksum_column(df)   # (4) 
_checksum
df = add_key_column(df, key)   # (3) 
_key
df = df.select([current_timestamp().alias("_timestamp"), "*"]) # (2) 
_timestamp
df = df.select([lit("I").alias("_operation"), "*"])# (1) 
_operation

df.filter("1=0").write.format("delta") \
.option("delta.autoOptimize.optimizeWrite", "true") \
.option("delta.autoOptimize.autoCompact", "true") \
.saveAsTable(target)

# Capture inserted and updated records from full or partial source data 
frame.

def insert_update(df, key, target, query):
# Prepare source view.
df = add_checksum_column(df)
df = add_key_column(df, key)
df.createOrReplaceTempView("s")
# Prepare target view.
spark.table(target).createOrReplaceTempView("t")
# Insert records.
return spark.sql(query)

query1 = """
INSERT INTO t
SELECT CASE WHEN a._key IS NULL THEN "I" ELSE "U" END AS _operation, 
CURRENT_TIMESTAMP AS _timestamp, s.*

FROM s
LEFT JOIN
(
SELECT t._key, t._checksum
FROM t
INNER JOIN (SELECT _key, MAX(_timestamp) AS m FROM t GROUP BY _key) 
AS m

ON t._key = m._key AND t._timestamp = m.m
WHERE t._operation <> "D"
)
AS a
ON s._key = a._key
WHERE (a._key IS NULL)  -- Insert
OR (s._checksum <> a._checksum) -- Update
"""

query2 = """
INSERT INTO t
SELECT CASE WHEN a._key IS NULL THEN "I" ELSE "U" END AS _operation, 
CURRENT_TIMESTAMP AS _timestamp, s.*

FROM s
LEFT JOIN
(
SELECT _key, _checksum, ROW_NUMBER() OVER (PARTITION BY _key ORDER 
BY _timestamp DESC) AS rn

FROM t
WHERE _operation <> "D"
)
AS a
ON s._key = a._key AND a.rn = 1
WHERE (a._key IS NULL)  -- Insert
OR (s._checksum <> a._checksum) -- Update
"""

host = "mssql.test.com"
port = "1433"
database = "test"
username = "test"
password = "test"
table= "test"
key  = "test_id"
target   = "archive.test"

df = spark.read.jdbc(
properties = 
{"driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver", 
"user":username, "password":password},

url = f"jdbc:sqlserver://{host}:{port};databaseName={database}",
table = table
)

create_table(df=df, key=key, target=target)
insert_update(df=df, key=key, target=target, query=query1)



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[BUG?] How to handle with special characters or scape them on spark version 3.3.0?

2023-01-04 Thread Vieira, Thiago

Hello everyone,

I’ve already raised this question on stack overflow, but to be honest I truly 
believe this is a bug at new spark version, so I am also sending this email.

Previously I was using spark version 3.2.1 to read data from SAP database by 
JDBC connector, I had no issues to perform the following steps:

df_1 = spark.read.format("jdbc") \
.option("url", "URL_LINK") \
.option("dbtable", 'DATABASE."/ABC/TABLE"') \
.option("user", "USER_HERE") \
.option("password", "PW_HERE") \
.option("driver", "com.sap.db.jdbc.Driver") \
.load()

display(df_1)

df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")

display(df_2)


This code above runs as it should, returning expected rows.

Since I updated my spark version to 3.3.0, because I need to have the new 
trigger 'availableNow' (trigger from streaming process), this process above 
started to fail, does not run at all.

Please follow the error message bellow.


-

--

ParseExceptionTraceback (most recent

call last)

 in ()

  1 df_2 = df_1.filter("`/ABC/COLUMN` = 'ID_HERE'")

  2

> 3 display(df_2)



/databricks/python_shell/dbruntime/display.py in display(self,

input, *args, **kwargs)

 81 raise Exception('Triggers can only be

set for streaming queries.')

 82

---> 83 self.add_custom_display_data("table",

input._jdf)

 84

 85 elif isinstance(input, list):



/databricks/python_shell/dbruntime/display.py in

add_custom_display_data(self, data_type, data)

 34 def add_custom_display_data(self, data_type, data):

 35 custom_display_key = str(uuid.uuid4())

---> 36 return_code =

self.entry_point.addCustomDisplayData(custom_display_key,

data_type, data)

 37 ip_display({

 38 "application/vnd.databricks.v1+display":

custom_display_key,



/databricks/spark/python/lib/py4j-0.10.9.5-

src.zip/py4j/java_gateway.py in __call__(self, *args)

   1319

   1320 answer =

self.gateway_client.send_command(command)

-> 1321 return_value = get_return_value(

   1322 answer, self.gateway_client, self.target_id,

self.name)

   1323



/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)

200 # Hide where the exception came from that

shows a non-Pythonic

201 # JVM exception message.

--> 202 raise converted from None

203 else:

204 raise



ParseException:

[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input

'/'(line 1, pos 0)



== SQL ==

/ABC/COLUMN

^^^

I've already tried to format in so many different ways, following the 
instructions on: https://spark.apache.org/docs/latest/sql-ref-literals.html . 
I've already tried to use function string by previously formatting the string, 
also tried raw string, but nothing seems to work as supposed.

Another important information, I've tried to create a dummy code for you to be 
able to replicate the issue, but when I create those tables with slashes 
'/ABC/TABLE' containing columns with slashes '/ABC/COLUMN' directly on pyspark, 
instead of using JDBC connector, it actually works, I was able to filter, so I 
believe this error is related to SQL / JDBC, I am not able to space special 
characters at spark 3.3.0 anymore.


Regards,

Thiago Vieira
Data Engineer

This e-mail and any attachments contain privileged and confidential information 
intended only for the use of the addressee(s). If you are not an intended 
recipient of this e-mail, you are hereby notified that any dissemination, 
copying or use of information within it is strictly prohibited. If you received 
this e-mail in error or without authorization, please notify us immediately by 
reply e-mail and delete the e-mail from your system. Thank you in advance.

Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, 
contiene información de carácter confidencial exclusivamente dirigida a su(s) 
destinatario(s). En el caso de haber recibido este correo electrónico por 
error, se ruega notificar inmediatamente esta circunstancia mediante reenvío a 
la dirección electrónica del remitente y el borrado del mismo, y se informa que 
cualquier transmisión, copia o uso de esta información está estrictamente 
prohibida. Muchas gracias

Este e-mail e quaisquer anexos seus podem conter informação confidencial para 
uso exclusivo do destinatário. Se não for o destinatário, não deverá usar, 
distribuir ou copiar este e-mail, devendo proceder à sua eliminação e informar 
o emissor. Obrigado

[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe

2022-11-25 Thread Michael Bílý

Hello there,

I ran into this problem on pyspark:
when using the groupby.cogroup functionality on the same dataframe, it
silently drops columns from the first instance, minimal example:
spark = (
SparkSession.builder
.getOrCreate()
)

df = spark.createDataFrame([["2017-08-17", 1,]], schema=["day",
"value"]).cache()

def in_pandas(df1, df2):
assert "value" in df1.columns
return df1

df = (
df
.groupby("day")
.cogroup(df.groupby("day"))
.applyInPandas(
in_pandas,
schema=df.schema,
)
)

df.show(20, False)

Fails on assertion error

My versions:
import pyspark.version
import pandas as pd
import pyarrow

print(sys.version)
# 3.8.10 (default, Jun 22 2022, 20:18:18)
# [GCC 9.4.0]
print(pyspark.version.__version__)
# 3.3.1
print(pd.__version__)
# 1.5.2
print(pyarrow.__version__)
# 10.0.1

It works on AWS Glue session with these versions:
[image: image.png]
It prints:
+--+-+
|day   |value|
+--+-+
|2017-08-17|1|
+--+-+

as expected.

Thank you,
Michael

[PySpark, Spark Streaming] Bug in timestamp handling in Structured Streaming?

2022-10-21 Thread kai-michael.roes...@sap.com.INVALID

Hi,

I suspect I may have come across a bug in the handling of data containing
timestamps in PySpark "Structured Streaming" using the foreach option. I'm
"just" a user of PySpark, no Spark community member, so I don't know how to
properly address the issue. I have posted a
question<https://stackoverflow.com/questions/74113270/how-to-handle-timestamp-data-in-pyspark-streaming-by-row>
about this on StackOverflow but that didn't get any attention, yet. Could
someone please have a look at it to check whether it is really a bug? In case a
Jira ticket is created could you please send me the link?
Thanks and best regards
Kai Roesner.
Dr. Kai-Michael Roesner
Development Architect
Technology & Innovation, Common Data Services
SAP SE
Robert-Bosch-Strasse 30/34
69190 Walldorf, Germany
T +49 6227 7-64216
F +49 6227 78-28459
E kai-michael.roes...@sap.com<mailto:kai-michael.roes...@sap.com>
www.sap.com<http://www.sap.com/>

Please consider the impact on the environment before printing this e-mail.

Pflichtangaben/Mandatory Disclosure Statements:
www.sap.com/corporate-en/impressum<http://www.sap.com/corporate-en/impressum>

Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse oder sonstige
vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrtümlich
erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielfältigung
oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte benachrichtigen Sie
uns und vernichten Sie die empfangene E-Mail. Vielen Dank.

This e-mail may contain trade secrets or privileged, undisclosed, or otherwise
confidential information. If you have received this e-mail in error, you are
hereby notified that any review, copying, or distribution of it is strictly
prohibited. Please inform us immediately and destroy the original transmittal.
Thank you for your cooperation.

Unusual bug,please help me,i can do nothing!!!

2022-03-30 Thread spark User

Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

error bug,please help me!!!

2022-03-20 Thread spark User

Hello, I am a spark user. I use the "spark-shell.cmd" startup command in 
windows cmd, the first startup is normal, when I use the "ctrl+c" command to 
force the end of the spark window, it can't start normally again. .The error 
message is as follows "Failed to initialize Spark 
session.org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@x.168.137.41:49963".
When I try to add "x.168.137.41" in 'etc/hosts' it works fine, then use 
"ctrl+c" again.
The result is that it cannot start normally. Please help me

Fwd: metastore bug when hive update spark table ?

2022-01-06 Thread Mich Talebzadeh

>From my experience this is an spark issue (more code base diversion on
spark-sql from Hive), but of course there is the work-around as below



-- Forwarded message -
From: Mich Talebzadeh 
Date: Thu, 6 Jan 2022 at 17:29
Subject: Re: metastore bug when hive update spark table ?
To: user 


Well I have seen this type of error before.

I tend to create the table in hive first and alter it in spark if needed.
This is spark 3.1.1 with Hive (version 3.1.1)

0: jdbc:hive2://rhes75:10099/default> create table my_table2 (col1 int,
col2 int)
0: jdbc:hive2://rhes75:10099/default> describe my_table2;
+---++--+
| col_name  | data_type  | comment  |
+---++--+
| col1  | int|  |
| col2  | int|  |
+---++--+
2 rows selected (0.17 seconds)

in Spark

>>> spark.sql("""ALTER TABLE my_table2 ADD column col3 string""")
DataFrame[]
>>> for c in spark.sql("""describe formatted my_table2 """).collect():
...   print(c)
...
*Row(col_name='col1', data_type='int', comment=None)*
*Row(col_name='col2', data_type='int', comment=None)*
*Row(col_name='col3', data_type='string', comment=None)*
Row(col_name='', data_type='', comment='')
Row(col_name='# Detailed Table Information', data_type='', comment='')
Row(col_name='Database', data_type='default', comment='')
Row(col_name='Table', data_type='my_table2', comment='')
Row(col_name='Owner', data_type='hduser', comment='')
Row(col_name='Created Time', data_type='Thu Jan 06 17:16:37 GMT 2022',
comment='')
Row(col_name='Last Access', data_type='UNKNOWN', comment='')
Row(col_name='Created By', data_type='Spark 2.2 or prior', comment='')
Row(col_name='Type', data_type='MANAGED', comment='')
Row(col_name='Provider', data_type='hive', comment='')
Row(col_name='Table Properties', data_type='[bucketing_version=2,
transient_lastDdlTime=1641489641]', comment='')
Row(col_name='Location',
data_type='hdfs://rhes75:9000/user/hive/warehouse/my_table2', comment='')
Row(col_name='Serde Library',
data_type='org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe', comment='')
Row(col_name='InputFormat',
data_type='org.apache.hadoop.mapred.TextInputFormat', comment='')
Row(col_name='OutputFormat',
data_type='org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
comment='')
Row(col_name='Storage Properties', data_type='[serialization.format=1]',
comment='')
Row(col_name='Partition Provider', data_type='Catalog', comment='')


This is my work around

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 6 Jan 2022 at 16:17, Nicolas Paris  wrote:

> Hi there.
>
> I also posted this problem in the spark list. I am no sure this is a
> spark or a hive metastore problem. Or if there is some metastore tunning
> configuration as workaround.
>
>
> Spark can't see hive schema updates partly because it stores the schema
> in a weird way in hive metastore.
>
>
> 1. FROM SPARK: create a table
> 
> >>> spark.sql("select 1 col1, 2
> col2").write.format("parquet").saveAsTable("my_table")
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 2. FROM HIVE: alter the schema
> ==
> 0: jdbc:hive2://localhost:1> ALTER TABLE my_table REPLACE
> COLUMNS(`col1` int, `col2` int, `col3` string);
> 0: jdbc:hive2://localhost:1> describe my_table;
> +---++--+
> | col_name | data_type | comment |
> +---++--+
> | col1 | int | |
> | col2 | int | |
> | col3 | string | |
> +---++--+
>
>
> 3. FROM SPARK: problem, column does not appear
> ==
> >>> spark.table("my_table").printSchema()
> root
> |-- col1: integer (nullable = true)
> |-- col2: integer (nullable = true)
>
>
> 4. FROM METASTORE DB: two ways of storing the columns
> ==
> metastore=# select * from "COLUMNS_V2";
> CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
> ---+-+-+---+-
> 2 | | col1 | int | 0
> 2 | | col2 | int | 1
> 2 | | col3 | string | 2
>
>
> metastore=# select * from "TABLE_PARAMS";
> TBL_ID | PARAM_KEY | PARAM_VALUE
>
>
> --

spark metadata metastore bug ?

2022-01-06 Thread Nicolas Paris

Spark can't see hive schema updates partly because it stores the schema
in a weird way in hive metastore.


1. FROM SPARK: create a table

>>> spark.sql("select 1 col1, 2 
>>> col2").write.format("parquet").saveAsTable("my_table")
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)


2. FROM HIVE: alter the schema
==
0: jdbc:hive2://localhost:1> ALTER TABLE my_table REPLACE COLUMNS(`col1` 
int, `col2` int, `col3` string);
0: jdbc:hive2://localhost:1> describe my_table;
+---++--+
| col_name  | data_type  | comment  |
+---++--+
| col1  | int|  |
| col2  | int|  |
| col3  | string |  |
+---++--+


3. FROM SPARK: problem, column does not appear
==
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)


4. FROM METASTORE DB: two ways of storing the columns
==
metastore=# select * from "COLUMNS_V2";
 CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
---+-+-+---+-
 2 | | col1| int   |   0
 2 | | col2| int   |   1
 2 | | col3| string|   2


metastore=# select * from "TABLE_PARAMS";
 TBL_ID | PARAM_KEY |   
 PARAM_VALUE

+---+-
---
  1 | spark.sql.sources.provider| parquet
  1 | spark.sql.sources.schema.part.0   | 
{"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}},{"name":"col2","type":"integer","n
ullable":true,"metadata":{}}]}
  1 | spark.sql.create.version  | 2.4.8
  1 | spark.sql.sources.schema.numParts | 1
  1 | last_modified_time| 1641483180
  1 | transient_lastDdlTime | 1641483180
  1 | last_modified_by  | anonymous

metastore=# truncate "TABLE_PARAMS";
TRUNCATE TABLE


5. FROM SPARK: now the column magically appears
==
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
 |-- col3: string (nullable = true)


Then is it necessary to store that stuff in the TABLE_PARAMS ?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: possible bug

2021-04-09 Thread Mich Talebzadeh

Spark 3.1.1



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Apr 2021 at 17:36, Mich Talebzadeh 
wrote:

> I ran this one on RHES 7.6 with 64GB of memory and it hit OOM
>
> >>> data=list(range(rows))
> >>> rdd=sc.parallelize(data,rows)
> >>> assert rdd.getNumPartitions()==rows
> >>> rdd0=rdd.filter(lambda x:False)
> >>> assert rdd0.getNumPartitions()==rows
> >>> rdd00=rdd0.coalesce(1)
> >>> data=rdd00.collect()
> 2021-04-09 17:19:01,452 WARN scheduler.TaskSetManager: Stage 1 contains a
> task of very large size (4729 KiB). The maximum recommended task size is
> 1000 KiB.
> 2021-04-09 17:25:14,249 ERROR executor.Executor: Exception in task 0.0 in
> stage 1.0 (TID 1)
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Apr 2021 at 17:33, Sean Owen  wrote:
>
>> OK so it's '7 threads overwhelming off heap mem in the JVM' kind of
>> thing. Or running afoul of ulimits in the OS.
>>
>> On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi Sean!
>>>
>>> So the "coalesce" without shuffle will create a CoalescedRDD which
>>> during its computation delegates to the parent RDD partitions.
>>> As the CoalescedRDD contains only 1 partition so we talk about 1 task
>>> and 1 task context.
>>>
>>> The next stop is PythonRunner.
>>>
>>> Here the python workers at least are reused (when
>>> "spark.python.worker.reuse" is true, and true is the default) but the
>>> MonitorThreads are not reused and what is worse all the MonitorThreads are
>>> created for the same worker and same TaskContext.
>>> This means the CoalescedRDD's 1 tasks should be completed to stop the
>>> first monitor thread, relevant code:
>>>
>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L570
>>>
>>> So this will lead to creating 7 extra threads when 1 would be enough.
>>>
>>> The jira is: https://issues.apache.org/jira/browse/SPARK-35009
>>> The PR will next week maybe (I am a bit uncertain as I have many other
>>> things to do right now).
>>>
>>> Best Regards,
>>> Attila
>>>

>

Re: possible bug

2021-04-09 Thread Mich Talebzadeh

I ran this one on RHES 7.6 with 64GB of memory and it hit OOM

>>> data=list(range(rows))
>>> rdd=sc.parallelize(data,rows)
>>> assert rdd.getNumPartitions()==rows
>>> rdd0=rdd.filter(lambda x:False)
>>> assert rdd0.getNumPartitions()==rows
>>> rdd00=rdd0.coalesce(1)
>>> data=rdd00.collect()
2021-04-09 17:19:01,452 WARN scheduler.TaskSetManager: Stage 1 contains a
task of very large size (4729 KiB). The maximum recommended task size is
1000 KiB.
2021-04-09 17:25:14,249 ERROR executor.Executor: Exception in task 0.0 in
stage 1.0 (TID 1)
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Apr 2021 at 17:33, Sean Owen  wrote:

> OK so it's '7 threads overwhelming off heap mem in the JVM' kind of
> thing. Or running afoul of ulimits in the OS.
>
> On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi Sean!
>>
>> So the "coalesce" without shuffle will create a CoalescedRDD which during
>> its computation delegates to the parent RDD partitions.
>> As the CoalescedRDD contains only 1 partition so we talk about 1 task and
>> 1 task context.
>>
>> The next stop is PythonRunner.
>>
>> Here the python workers at least are reused (when
>> "spark.python.worker.reuse" is true, and true is the default) but the
>> MonitorThreads are not reused and what is worse all the MonitorThreads are
>> created for the same worker and same TaskContext.
>> This means the CoalescedRDD's 1 tasks should be completed to stop the
>> first monitor thread, relevant code:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L570
>>
>> So this will lead to creating 7 extra threads when 1 would be enough.
>>
>> The jira is: https://issues.apache.org/jira/browse/SPARK-35009
>> The PR will next week maybe (I am a bit uncertain as I have many other
>> things to do right now).
>>
>> Best Regards,
>> Attila
>>
>>>

Re: possible bug

2021-04-09 Thread Sean Owen

OK so it's '7 threads overwhelming off heap mem in the JVM' kind of
thing. Or running afoul of ulimits in the OS.

On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Sean!
>
> So the "coalesce" without shuffle will create a CoalescedRDD which during
> its computation delegates to the parent RDD partitions.
> As the CoalescedRDD contains only 1 partition so we talk about 1 task and
> 1 task context.
>
> The next stop is PythonRunner.
>
> Here the python workers at least are reused (when
> "spark.python.worker.reuse" is true, and true is the default) but the
> MonitorThreads are not reused and what is worse all the MonitorThreads are
> created for the same worker and same TaskContext.
> This means the CoalescedRDD's 1 tasks should be completed to stop the
> first monitor thread, relevant code:
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L570
>
> So this will lead to creating 7 extra threads when 1 would be enough.
>
> The jira is: https://issues.apache.org/jira/browse/SPARK-35009
> The PR will next week maybe (I am a bit uncertain as I have many other
> things to do right now).
>
> Best Regards,
> Attila
>
>>
>>>

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros

Hi Sean!

So the "coalesce" without shuffle will create a CoalescedRDD which during
its computation delegates to the parent RDD partitions.
As the CoalescedRDD contains only 1 partition so we talk about 1 task and 1
task context.

The next stop is PythonRunner.

Here the python workers at least are reused (when
"spark.python.worker.reuse" is true, and true is the default) but the
MonitorThreads are not reused and what is worse all the MonitorThreads are
created for the same worker and same TaskContext.
This means the CoalescedRDD's 1 tasks should be completed to stop the first
monitor thread, relevant code:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala#L570

So this will lead to creating 7 extra threads when 1 would be enough.

The jira is: https://issues.apache.org/jira/browse/SPARK-35009
The PR will next week maybe (I am a bit uncertain as I have many other
things to do right now).

Best Regards,
Attila

On Fri, Apr 9, 2021 at 5:54 PM Sean Owen  wrote:

> Yeah I figured it's not something fundamental to the task or Spark. The
> error is very odd, never seen that. Do you have a theory on what's going on
> there? I don't!
>
> On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi!
>>
>> I looked into the code and find a way to improve it.
>>
>> With the improvement your test runs just fine:
>>
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>>   /_/
>>
>> Using Python version 3.8.1 (default, Dec 30 2020 22:53:18)
>> Spark context Web UI available at http://192.168.0.199:4040
>> Spark context available as 'sc' (master = local, app id =
>> local-1617982367872).
>> SparkSession available as 'spark'.
>>
>> In [1]: import pyspark
>>
>> In [2]:
>> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
>>
>> In [3]: sc=pyspark.SparkContext.getOrCreate(conf)
>>
>> In [4]: rows=7
>>
>> In [5]: data=list(range(rows))
>>
>> In [6]: rdd=sc.parallelize(data,rows)
>>
>> In [7]: assert rdd.getNumPartitions()==rows
>>
>> In [8]: rdd0=rdd.filter(lambda x:False)
>>
>> In [9]: assert rdd0.getNumPartitions()==rows
>>
>> In [10]: rdd00=rdd0.coalesce(1)
>>
>> In [11]: data=rdd00.collect()
>> 21/04/09 17:32:54 WARN TaskSetManager: Stage 0 contains a task of very
>> large siz
>> e (4729 KiB). The maximum recommended task size is 1000 KiB.
>>
>> In [12]: assert data==[]
>>
>> In [13]:
>>
>>
>> I will create a jira and need to add some unittest before opening the PR.
>>
>> Best Regards,
>> Attila
>>
>>>

Re: possible bug

2021-04-09 Thread Mich Talebzadeh

Interesting unitest not pytest :)

What is data in [11] reused compared to 5  -- list()?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Apr 2021 at 16:44, Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi!
>
> I looked into the code and find a way to improve it.
>
> With the improvement your test runs just fine:
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
>
> Using Python version 3.8.1 (default, Dec 30 2020 22:53:18)
> Spark context Web UI available at http://192.168.0.199:4040
> Spark context available as 'sc' (master = local, app id =
> local-1617982367872).
> SparkSession available as 'spark'.
>
> In [1]: import pyspark
>
> In [2]:
> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
>
> In [3]: sc=pyspark.SparkContext.getOrCreate(conf)
>
> In [4]: rows=7
>
> In [5]: data=list(range(rows))
>
> In [6]: rdd=sc.parallelize(data,rows)
>
> In [7]: assert rdd.getNumPartitions()==rows
>
> In [8]: rdd0=rdd.filter(lambda x:False)
>
> In [9]: assert rdd0.getNumPartitions()==rows
>
> In [10]: rdd00=rdd0.coalesce(1)
>
> In [11]: data=rdd00.collect()
> 21/04/09 17:32:54 WARN TaskSetManager: Stage 0 contains a task of very
> large siz
> e (4729 KiB). The maximum recommended task size is 1000 KiB.
>
> In [12]: assert data==[]
>
> In [13]:
>
>
> I will create a jira and need to add some unittest before opening the PR.
>
> Best Regards,
> Attila
>
> On Fri, Apr 9, 2021 at 7:04 AM Weiand, Markus, NMA-CFD <
> markus.wei...@bertelsmann.de> wrote:
>
>> I’ve changed the code to set driver memory to 100g, changed python code:
>>
>> import pyspark
>>
>>
>> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1").set(key="spark.driver.memory",
>> value="100g")
>>
>> sc=pyspark.SparkContext.getOrCreate(conf)
>>
>> rows=7
>>
>> data=list(range(rows))
>>
>> rdd=sc.parallelize(data,rows)
>>
>> assert rdd.getNumPartitions()==rows
>>
>> rdd0=rdd.filter(lambda x:False)
>>
>> assert rdd0.getNumPartitions()==rows
>>
>> rdd00=rdd0.coalesce(1)
>>
>> data=rdd00.collect()
>>
>> assert data==[]
>>
>>
>>
>> Still the same error happens:
>>
>>
>>
>> 21/04/09 04:48:38 WARN TaskSetManager: Stage 0 contains a task of very
>> large size (4732 KiB). The maximum recommended task size is 1000 KiB.
>>
>> OpenJDK 64-Bit Server VM warning: INFO:
>> os::commit_memory(0x7f464355, 16384, 0) failed; error='Not enough
>> space' (errno=12)
>>
>> [423.701s][warning][os,thread] Attempt to protect stack guard pages
>> failed (0x7f4640d28000-0x7f4640d2c000).
>>
>> [423.701s][warning][os,thread] Attempt to deallocate stack guard pages
>> failed.
>>
>> [423.704s][warning][os,thread] Failed to start thread - pthread_create
>> failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
>>
>> #
>>
>> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>>
>> # Native memory allocation (mmap) failed to map 16384 bytes for
>> committing reserved memory.
>>
>>
>>
>> A function which needs 423 seconds to crash with excessive memory
>> consumption when trying to coalesce 7 empty partitions is not very
>> practical. As I do not know the limits in which coalesce without shuffling
>> can be used safely and with performance, I will now always use coalesce
>> with shuffling, even though in theory this will come with quite a
>> performance decrease.
>>
>>
>>
>> Markus
>>
>>
>>
>> *Von:* Russell Spitzer 
>> *Gesendet:* Donnerstag, 8. April 2021 15:24
>> *An:* Weiand, Markus, NMA-CFD 
>> *Cc:* user@spark.apache.org
>> *Betreff:* Re: possible bug
>>
>>
>>
>> Could be that the driver JVM cannot handle the metadata required to store
>> the p

Re: possible bug

2021-04-09 Thread Sean Owen

Yeah I figured it's not something fundamental to the task or Spark. The
error is very odd, never seen that. Do you have a theory on what's going on
there? I don't!

On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi!
>
> I looked into the code and find a way to improve it.
>
> With the improvement your test runs just fine:
>
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
>
> Using Python version 3.8.1 (default, Dec 30 2020 22:53:18)
> Spark context Web UI available at http://192.168.0.199:4040
> Spark context available as 'sc' (master = local, app id =
> local-1617982367872).
> SparkSession available as 'spark'.
>
> In [1]: import pyspark
>
> In [2]:
> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
>
> In [3]: sc=pyspark.SparkContext.getOrCreate(conf)
>
> In [4]: rows=7
>
> In [5]: data=list(range(rows))
>
> In [6]: rdd=sc.parallelize(data,rows)
>
> In [7]: assert rdd.getNumPartitions()==rows
>
> In [8]: rdd0=rdd.filter(lambda x:False)
>
> In [9]: assert rdd0.getNumPartitions()==rows
>
> In [10]: rdd00=rdd0.coalesce(1)
>
> In [11]: data=rdd00.collect()
> 21/04/09 17:32:54 WARN TaskSetManager: Stage 0 contains a task of very
> large siz
> e (4729 KiB). The maximum recommended task size is 1000 KiB.
>
> In [12]: assert data==[]
>
> In [13]:
>
>
> I will create a jira and need to add some unittest before opening the PR.
>
> Best Regards,
> Attila
>
>>

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros

Hi!

I looked into the code and find a way to improve it.

With the improvement your test runs just fine:

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/

Using Python version 3.8.1 (default, Dec 30 2020 22:53:18)
Spark context Web UI available at http://192.168.0.199:4040
Spark context available as 'sc' (master = local, app id =
local-1617982367872).
SparkSession available as 'spark'.

In [1]: import pyspark

In [2]:
conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")

In [3]: sc=pyspark.SparkContext.getOrCreate(conf)

In [4]: rows=7

In [5]: data=list(range(rows))

In [6]: rdd=sc.parallelize(data,rows)

In [7]: assert rdd.getNumPartitions()==rows

In [8]: rdd0=rdd.filter(lambda x:False)

In [9]: assert rdd0.getNumPartitions()==rows

In [10]: rdd00=rdd0.coalesce(1)

In [11]: data=rdd00.collect()
21/04/09 17:32:54 WARN TaskSetManager: Stage 0 contains a task of very
large siz
e (4729 KiB). The maximum recommended task size is 1000 KiB.

In [12]: assert data==[]

In [13]:


I will create a jira and need to add some unittest before opening the PR.

Best Regards,
Attila

On Fri, Apr 9, 2021 at 7:04 AM Weiand, Markus, NMA-CFD <
markus.wei...@bertelsmann.de> wrote:

> I’ve changed the code to set driver memory to 100g, changed python code:
>
> import pyspark
>
>
> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1").set(key="spark.driver.memory",
> value="100g")
>
> sc=pyspark.SparkContext.getOrCreate(conf)
>
> rows=7
>
> data=list(range(rows))
>
> rdd=sc.parallelize(data,rows)
>
> assert rdd.getNumPartitions()==rows
>
> rdd0=rdd.filter(lambda x:False)
>
> assert rdd0.getNumPartitions()==rows
>
> rdd00=rdd0.coalesce(1)
>
> data=rdd00.collect()
>
> assert data==[]
>
>
>
> Still the same error happens:
>
>
>
> 21/04/09 04:48:38 WARN TaskSetManager: Stage 0 contains a task of very
> large size (4732 KiB). The maximum recommended task size is 1000 KiB.
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f464355, 16384, 0) failed; error='Not enough
> space' (errno=12)
>
> [423.701s][warning][os,thread] Attempt to protect stack guard pages failed
> (0x7f4640d28000-0x7f4640d2c000).
>
> [423.701s][warning][os,thread] Attempt to deallocate stack guard pages
> failed.
>
> [423.704s][warning][os,thread] Failed to start thread - pthread_create
> failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
>
> #
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
>
> # Native memory allocation (mmap) failed to map 16384 bytes for committing
> reserved memory.
>
>
>
> A function which needs 423 seconds to crash with excessive memory
> consumption when trying to coalesce 7 empty partitions is not very
> practical. As I do not know the limits in which coalesce without shuffling
> can be used safely and with performance, I will now always use coalesce
> with shuffling, even though in theory this will come with quite a
> performance decrease.
>
>
>
> Markus
>
>
>
> *Von:* Russell Spitzer 
> *Gesendet:* Donnerstag, 8. April 2021 15:24
> *An:* Weiand, Markus, NMA-CFD 
> *Cc:* user@spark.apache.org
> *Betreff:* Re: possible bug
>
>
>
> Could be that the driver JVM cannot handle the metadata required to store
> the partition information of a 70k partition RDD. I see you say you have a
> 100GB driver but i'm not sure where you configured that?
>
> Did you set --driver-memory 100G ?
>
>
>
> On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD <
> markus.wei...@bertelsmann.de> wrote:
>
> This is the reduction of an error in a complex program where allocated 100
> GB driver (=worker=executor as local mode) memory. In the example I used
> the default size, as the puny example shouldn’t need more anyway.
>
> And without the coalesce or with coalesce(1,True) everything works fine.
>
> I’m trying to coalesce an empty rdd with 7 partitions in an empty rdd
> with 1 partition, why is this a problem without shuffling?
>
>
>
> *Von:* Sean Owen 
> *Gesendet:* Donnerstag, 8. April 2021 15:00
> *An:* Weiand, Markus, NMA-CFD 
> *Cc:* user@spark.apache.org
> *Betreff:* Re: possible bug
>
>
>
> That's a very low level error from the JVM. Any chance you are
> misconfiguring the executor size? like to 10MB instead of 10GB, that kind
> of thing. Trying to think of why the JVM would have very little memory to
> operate.

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD

I've changed the code to set driver memory to 100g, changed python code:
import pyspark

conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1").set(key="spark.driver.memory",
 value="100g")
sc=pyspark.SparkContext.getOrCreate(conf)
rows=7
data=list(range(rows))
rdd=sc.parallelize(data,rows)
assert rdd.getNumPartitions()==rows
rdd0=rdd.filter(lambda x:False)
assert rdd0.getNumPartitions()==rows
rdd00=rdd0.coalesce(1)
data=rdd00.collect()
assert data==[]

Still the same error happens:

21/04/09 04:48:38 WARN TaskSetManager: Stage 0 contains a task of very large 
size (4732 KiB). The maximum recommended task size is 1000 KiB.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7f464355, 
16384, 0) failed; error='Not enough space' (errno=12)
[423.701s][warning][os,thread] Attempt to protect stack guard pages failed 
(0x7f4640d28000-0x7f4640d2c000).
[423.701s][warning][os,thread] Attempt to deallocate stack guard pages failed.
[423.704s][warning][os,thread] Failed to start thread - pthread_create failed 
(EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing 
reserved memory.

A function which needs 423 seconds to crash with excessive memory consumption 
when trying to coalesce 7 empty partitions is not very practical. As I do 
not know the limits in which coalesce without shuffling can be used safely and 
with performance, I will now always use coalesce with shuffling, even though in 
theory this will come with quite a performance decrease.

Markus

Von: Russell Spitzer 
Gesendet: Donnerstag, 8. April 2021 15:24
An: Weiand, Markus, NMA-CFD 
Cc: user@spark.apache.org
Betreff: Re: possible bug

Could be that the driver JVM cannot handle the metadata required to store the 
partition information of a 70k partition RDD. I see you say you have a 100GB 
driver but i'm not sure where you configured that?

Did you set --driver-memory 100G ?

On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD 
mailto:markus.wei...@bertelsmann.de>> wrote:
This is the reduction of an error in a complex program where allocated 100 GB 
driver (=worker=executor as local mode) memory. In the example I used the 
default size, as the puny example shouldn't need more anyway.
And without the coalesce or with coalesce(1,True) everything works fine.
I'm trying to coalesce an empty rdd with 7 partitions in an empty rdd with 
1 partition, why is this a problem without shuffling?

Von: Sean Owen mailto:sro...@gmail.com>>
Gesendet: Donnerstag, 8. April 2021 15:00
An: Weiand, Markus, NMA-CFD 
mailto:markus.wei...@bertelsmann.de>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Betreff: Re: possible bug

That's a very low level error from the JVM. Any chance you are misconfiguring 
the executor size? like to 10MB instead of 10GB, that kind of thing. Trying to 
think of why the JVM would have very little memory to operate.
An app running out of mem would not look like this.

On Thu, Apr 8, 2021 at 7:53 AM Weiand, Markus, NMA-CFD 
mailto:markus.wei...@bertelsmann.de>> wrote:
Hi all,

I'm using spark on a c5a.16xlarge machine in amazon cloud (so having  64 cores 
and 128 GB RAM). I'm using spark 3.01.

The following python code leads to an exception, is this a bug or is my 
understanding of the API incorrect?

import pyspark
conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
sc=pyspark.SparkContext.getOrCreate(conf)
rows=7
data=list(range(rows))
rdd=sc.parallelize(data,rows)
assert rdd.getNumPartitions()==rows
rdd0=rdd.filter(lambda x:False)
assert rdd0.getNumPartitions()==rows
rdd00=rdd0.coalesce(1)
data=rdd00.collect()
assert data==[]

output when starting from PyCharm:

/home/ubuntu/PycharmProjects//venv/bin/python 
/opt/pycharm-2020.2.3/plugins/python/helpers/pydev/pydevconsole.py 
--mode=client --port=41185
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/home/ubuntu/PycharmProjects/'])
PyDev console: starting.
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
import os
os.environ['PYTHONHASHSEED'] = '0'
runfile('/home/ubuntu/PycharmProjects//tests/test.py', 
wdir='/home/ubuntu/PycharmProjects//tests')
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be

Re: possible bug

2021-04-08 Thread Russell Spitzer

Could be that the driver JVM cannot handle the metadata required to store
the partition information of a 70k partition RDD. I see you say you have a
100GB driver but i'm not sure where you configured that?

Did you set --driver-memory 100G ?

On Thu, Apr 8, 2021 at 8:08 AM Weiand, Markus, NMA-CFD <
markus.wei...@bertelsmann.de> wrote:

> This is the reduction of an error in a complex program where allocated 100
> GB driver (=worker=executor as local mode) memory. In the example I used
> the default size, as the puny example shouldn’t need more anyway.
>
> And without the coalesce or with coalesce(1,True) everything works fine.
>
> I’m trying to coalesce an empty rdd with 7 partitions in an empty rdd
> with 1 partition, why is this a problem without shuffling?
>
>
>
> *Von:* Sean Owen 
> *Gesendet:* Donnerstag, 8. April 2021 15:00
> *An:* Weiand, Markus, NMA-CFD 
> *Cc:* user@spark.apache.org
> *Betreff:* Re: possible bug
>
>
>
> That's a very low level error from the JVM. Any chance you are
> misconfiguring the executor size? like to 10MB instead of 10GB, that kind
> of thing. Trying to think of why the JVM would have very little memory to
> operate.
>
> An app running out of mem would not look like this.
>
>
>
> On Thu, Apr 8, 2021 at 7:53 AM Weiand, Markus, NMA-CFD <
> markus.wei...@bertelsmann.de> wrote:
>
> Hi all,
>
>
>
> I'm using spark on a c5a.16xlarge machine in amazon cloud (so having  64
> cores and 128 GB RAM). I'm using spark 3.01.
>
>
>
> The following python code leads to an exception, is this a bug or is my
> understanding of the API incorrect?
>
>
>
> import pyspark
>
> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
>
> sc=pyspark.SparkContext.getOrCreate(conf)
>
> rows=7
>
> data=list(range(rows))
>
> rdd=sc.parallelize(data,rows)
>
> assert rdd.getNumPartitions()==rows
>
> rdd0=rdd.filter(lambda x:False)
>
> assert rdd0.getNumPartitions()==rows
>
> rdd00=rdd0.coalesce(1)
>
> data=rdd00.collect()
>
> assert data==[]
>
>
>
> output when starting from PyCharm:
>
>
>
> /home/ubuntu/PycharmProjects//venv/bin/python
> /opt/pycharm-2020.2.3/plugins/python/helpers/pydev/pydevconsole.py
> --mode=client --port=41185
>
> import sys; print('Python %s on %s' % (sys.version, sys.platform))
>
> sys.path.extend(['/home/ubuntu/PycharmProjects/'])
>
> PyDev console: starting.
>
> Python 3.8.5 (default, Jan 27 2021, 15:41:15)
>
> [GCC 9.3.0] on linux
>
> import os
>
> os.environ['PYTHONHASHSEED'] = '0'
>
> runfile('/home/ubuntu/PycharmProjects//tests/test.py',
> wdir='/home/ubuntu/PycharmProjects//tests')
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
> java.nio.DirectByteBuffer(long,int)
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.spark.unsafe.Platform
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> 21/04/08 12:12:26 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 21/04/08 12:12:29 WARN TaskSetManager: Stage 0 contains a task of very
> large size (4732 KiB). The maximum recommended task size is 1000 KiB.
>
> [Stage 0:>  (0 +
> 1) / 1][423.190s][warning][os,thread] Attempt to protect stack guard pages
> failed (0x7f43d23ff000-0x7f43d2403000).
>
> [423.190s][warning][os,thread] Attempt to deallocate stack guard pages
> failed.
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f43d300b000, 16384, 0) failed; error='Not enough
> space' (errno=12)
>
> [423.231s][warning][os,thread] Failed to start thread - pthread_create
> failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
>
> #
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
>
> # Native memory allocation (mmap) failed to map 16384 bytes for committing
> reserved memory.
>
> # An error report file with more information is saved as:
>
> # /home/ubuntu/PycharmProjects//tests/hs_err_pid17755.log
>
> [thread 17966 also had an error]
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f4b7bd81000, 262144, 0) failed; error='Not enough
> space' (err

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD

This is the reduction of an error in a complex program where allocated 100 GB 
driver (=worker=executor as local mode) memory. In the example I used the 
default size, as the puny example shouldn't need more anyway.
And without the coalesce or with coalesce(1,True) everything works fine.
I'm trying to coalesce an empty rdd with 7 partitions in an empty rdd with 
1 partition, why is this a problem without shuffling?

Von: Sean Owen 
Gesendet: Donnerstag, 8. April 2021 15:00
An: Weiand, Markus, NMA-CFD 
Cc: user@spark.apache.org
Betreff: Re: possible bug

That's a very low level error from the JVM. Any chance you are misconfiguring 
the executor size? like to 10MB instead of 10GB, that kind of thing. Trying to 
think of why the JVM would have very little memory to operate.
An app running out of mem would not look like this.

On Thu, Apr 8, 2021 at 7:53 AM Weiand, Markus, NMA-CFD 
mailto:markus.wei...@bertelsmann.de>> wrote:
Hi all,

I'm using spark on a c5a.16xlarge machine in amazon cloud (so having  64 cores 
and 128 GB RAM). I'm using spark 3.01.

The following python code leads to an exception, is this a bug or is my 
understanding of the API incorrect?

import pyspark
conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
sc=pyspark.SparkContext.getOrCreate(conf)
rows=7
data=list(range(rows))
rdd=sc.parallelize(data,rows)
assert rdd.getNumPartitions()==rows
rdd0=rdd.filter(lambda x:False)
assert rdd0.getNumPartitions()==rows
rdd00=rdd0.coalesce(1)
data=rdd00.collect()
assert data==[]

output when starting from PyCharm:

/home/ubuntu/PycharmProjects//venv/bin/python 
/opt/pycharm-2020.2.3/plugins/python/helpers/pydev/pydevconsole.py 
--mode=client --port=41185
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/home/ubuntu/PycharmProjects/'])
PyDev console: starting.
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
import os
os.environ['PYTHONHASHSEED'] = '0'
runfile('/home/ubuntu/PycharmProjects//tests/test.py', 
wdir='/home/ubuntu/PycharmProjects//tests')
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/08 12:12:26 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
21/04/08 12:12:29 WARN TaskSetManager: Stage 0 contains a task of very large 
size (4732 KiB). The maximum recommended task size is 1000 KiB.
[Stage 0:>  (0 + 1) / 
1][423.190s][warning][os,thread] Attempt to protect stack guard pages failed 
(0x7f43d23ff000-0x7f43d2403000).
[423.190s][warning][os,thread] Attempt to deallocate stack guard pages failed.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7f43d300b000, 
16384, 0) failed; error='Not enough space' (errno=12)
[423.231s][warning][os,thread] Failed to start thread - pthread_create failed 
(EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing 
reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/PycharmProjects//tests/hs_err_pid17755.log
[thread 17966 also had an error]
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7f4b7bd81000, 
262144, 0) failed; error='Not enough space' (errno=12)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1033, in send_command
response = connection.send_command(command)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1211, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server 
(127.0.0.1:42439<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A42439%2F=04%7C01%7C%7C67312308cd7647973de608d8fa8e420f%7

Re: possible bug

2021-04-08 Thread Sean Owen

That's a very low level error from the JVM. Any chance you are
misconfiguring the executor size? like to 10MB instead of 10GB, that kind
of thing. Trying to think of why the JVM would have very little memory to
operate.
An app running out of mem would not look like this.

On Thu, Apr 8, 2021 at 7:53 AM Weiand, Markus, NMA-CFD <
markus.wei...@bertelsmann.de> wrote:

> Hi all,
>
>
>
> I'm using spark on a c5a.16xlarge machine in amazon cloud (so having  64
> cores and 128 GB RAM). I'm using spark 3.01.
>
>
>
> The following python code leads to an exception, is this a bug or is my
> understanding of the API incorrect?
>
>
>
> import pyspark
>
> conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
>
> sc=pyspark.SparkContext.getOrCreate(conf)
>
> rows=7
>
> data=list(range(rows))
>
> rdd=sc.parallelize(data,rows)
>
> assert rdd.getNumPartitions()==rows
>
> rdd0=rdd.filter(lambda x:False)
>
> assert rdd0.getNumPartitions()==rows
>
> rdd00=rdd0.coalesce(1)
>
> data=rdd00.collect()
>
> assert data==[]
>
>
>
> output when starting from PyCharm:
>
>
>
> /home/ubuntu/PycharmProjects//venv/bin/python
> /opt/pycharm-2020.2.3/plugins/python/helpers/pydev/pydevconsole.py
> --mode=client --port=41185
>
> import sys; print('Python %s on %s' % (sys.version, sys.platform))
>
> sys.path.extend(['/home/ubuntu/PycharmProjects/'])
>
> PyDev console: starting.
>
> Python 3.8.5 (default, Jan 27 2021, 15:41:15)
>
> [GCC 9.3.0] on linux
>
> import os
>
> os.environ['PYTHONHASHSEED'] = '0'
>
> runfile('/home/ubuntu/PycharmProjects//tests/test.py',
> wdir='/home/ubuntu/PycharmProjects//tests')
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor
> java.nio.DirectByteBuffer(long,int)
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.spark.unsafe.Platform
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> 21/04/08 12:12:26 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 21/04/08 12:12:29 WARN TaskSetManager: Stage 0 contains a task of very
> large size (4732 KiB). The maximum recommended task size is 1000 KiB.
>
> [Stage 0:>  (0 +
> 1) / 1][423.190s][warning][os,thread] Attempt to protect stack guard pages
> failed (0x7f43d23ff000-0x7f43d2403000).
>
> [423.190s][warning][os,thread] Attempt to deallocate stack guard pages
> failed.
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f43d300b000, 16384, 0) failed; error='Not enough
> space' (errno=12)
>
> [423.231s][warning][os,thread] Failed to start thread - pthread_create
> failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
>
> #
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
>
> # Native memory allocation (mmap) failed to map 16384 bytes for committing
> reserved memory.
>
> # An error report file with more information is saved as:
>
> # /home/ubuntu/PycharmProjects//tests/hs_err_pid17755.log
>
> [thread 17966 also had an error]
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x7f4b7bd81000, 262144, 0) failed; error='Not enough
> space' (errno=12)
>
> ERROR:root:Exception while sending command.
>
> Traceback (most recent call last):
>
>   File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1207, in send_command
>
> raise Py4JNetworkError("Answer from Java side is empty")
>
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
>
>   File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1033, in send_command
>
> response = connection.send_command(command)
>
>   File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
> line 1211, in send_command
>
> raise Py4JNetworkError(
>
> py4j.protocol.Py4JNetworkError: Error while receiving
>
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the
> Java server (127.0.0.1:42439)
>
> T

possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD

Hi all,

I'm using spark on a c5a.16xlarge machine in amazon cloud (so having  64 cores 
and 128 GB RAM). I'm using spark 3.01.

The following python code leads to an exception, is this a bug or is my 
understanding of the API incorrect?

import pyspark
conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1")
sc=pyspark.SparkContext.getOrCreate(conf)
rows=7
data=list(range(rows))
rdd=sc.parallelize(data,rows)
assert rdd.getNumPartitions()==rows
rdd0=rdd.filter(lambda x:False)
assert rdd0.getNumPartitions()==rows
rdd00=rdd0.coalesce(1)
data=rdd00.collect()
assert data==[]

output when starting from PyCharm:

/home/ubuntu/PycharmProjects//venv/bin/python 
/opt/pycharm-2020.2.3/plugins/python/helpers/pydev/pydevconsole.py 
--mode=client --port=41185
import sys; print('Python %s on %s' % (sys.version, sys.platform))
sys.path.extend(['/home/ubuntu/PycharmProjects/'])
PyDev console: starting.
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
import os
os.environ['PYTHONHASHSEED'] = '0'
runfile('/home/ubuntu/PycharmProjects//tests/test.py', 
wdir='/home/ubuntu/PycharmProjects//tests')
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/08 12:12:26 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
21/04/08 12:12:29 WARN TaskSetManager: Stage 0 contains a task of very large 
size (4732 KiB). The maximum recommended task size is 1000 KiB.
[Stage 0:>  (0 + 1) / 
1][423.190s][warning][os,thread] Attempt to protect stack guard pages failed 
(0x7f43d23ff000-0x7f43d2403000).
[423.190s][warning][os,thread] Attempt to deallocate stack guard pages failed.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7f43d300b000, 
16384, 0) failed; error='Not enough space' (errno=12)
[423.231s][warning][os,thread] Failed to start thread - pthread_create failed 
(EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing 
reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/PycharmProjects//tests/hs_err_pid17755.log
[thread 17966 also had an error]
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x7f4b7bd81000, 
262144, 0) failed; error='Not enough space' (errno=12)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1207, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1033, in send_command
response = connection.send_command(command)
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1211, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
server (127.0.0.1:42439)
Traceback (most recent call last):
  File "/opt/spark/python/pyspark/rdd.py", line 889, in collect
sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1304, in __call__
return_value = get_return_value(
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 334, 
in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
977, in _get_connection
connection = self.deque.pop()
IndexError: pop from an empty deque
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1115, in start
self.socket.connect((self.address, s

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic

That certainly is a solution if you know about the issue and we've used it in 
the end.

I'm trying to find out if there is a solution that would prevent users who 
don't know about it from accidentally corrupting data. Something like "enable 
strict schema matching".

From: Jeff Evans 
Sent: Thursday, March 4, 2021 2:55 PM
To: Oldrich Vlasic 
Cc: Russell Spitzer ; Sean Owen ; 
user ; Ondřej Havlíček 
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of 
insertInto

Why not perform a df.select(...) before the final write to ensure a consistent 
ordering.

On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic 
mailto:oldrich.vla...@datasentics.com>> wrote:
Thanks for reply! Is there something to be done, setting a config property for 
example? I'd like to prevent users (mainly data scientists) from falling victim 
to this.

From: Russell Spitzer 
mailto:russell.spit...@gmail.com>>
Sent: Wednesday, March 3, 2021 3:31 PM
To: Sean Owen mailto:sro...@gmail.com>>
Cc: Oldrich Vlasic 
mailto:oldrich.vla...@datasentics.com>>; user 
mailto:user@spark.apache.org>>; Ondřej Havlíček 
mailto:ondrej.havli...@datasentics.com>>
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of 
insertInto

Yep this is the behavior for Insert Into, using the other write apis does 
schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen 
mailto:sro...@gmail.com>> wrote:

I don't have any good answer here, but, I seem to recall that this is because 
of SQL semantics, which follows column ordering not naming when performing 
operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic 
mailto:oldrich.vla...@datasentics.com>> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark 
concerning
partial overwrites of partitioned data. Not sure if this is a bug or just 
abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any 
relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks 
runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out 
part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either 
fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition 
column is
placed at the end). On the second write this is not taken into consideration 
and Spark
tries to insert values into the columns based on their order and not on their 
name. If
they have different types this will fail. If not, values will be written to 
incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about 
it to prevent
it? This issue can be avoided by doing a select with schema loaded from the 
target table.
However, when user is not aware this could cause hard to track down errors in 
data.

Best regards,
Oldřich Vlašic

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Jeff Evans

Why not perform a df.select(...) before the final write to ensure a
consistent ordering.

On Thu, Mar 4, 2021, 7:39 AM Oldrich Vlasic 
wrote:

> Thanks for reply! Is there something to be done, setting a config property
> for example? I'd like to prevent users (mainly data scientists) from
> falling victim to this.
> --
> *From:* Russell Spitzer 
> *Sent:* Wednesday, March 3, 2021 3:31 PM
> *To:* Sean Owen 
> *Cc:* Oldrich Vlasic ; user <
> user@spark.apache.org>; Ondřej Havlíček 
> *Subject:* Re: [Spark SQL, intermediate+] possible bug or weird behavior
> of insertInto
>
> Yep this is the behavior for Insert Into, using the other write apis does
> schema matching I believe.
>
> On Mar 3, 2021, at 8:29 AM, Sean Owen  wrote:
>
> I don't have any good answer here, but, I seem to recall that this is
> because of SQL semantics, which follows column ordering not naming when
> performing operations like this. It may well be as intended.
>
> On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <
> oldrich.vla...@datasentics.com> wrote:
>
> Hi,
>
> I have encountered a weird and potentially dangerous behaviour of Spark
> concerning
> partial overwrites of partitioned data. Not sure if this is a bug or just
> abstraction
> leak. I have checked Spark section of Stack Overflow and haven't found any
> relevant
> questions or answers.
>
> Full minimal working example provided as attachment. Tested on Databricks
> runtime 7.3 LTS
> ML (Spark 3.0.1). Short summary:
>
> Write dataframe using partitioning by a column using saveAsTable. Filter
> out part of the
> dataframe, change some values (simulates new increment of data) and write
> again,
> overwriting a subset of partitions using insertInto. This operation will
> either fail on
> schema mismatch or cause data corruption.
>
> Reason: on the first write, the ordering of the columns is changed
> (partition column is
> placed at the end). On the second write this is not taken into
> consideration and Spark
> tries to insert values into the columns based on their order and not on
> their name. If
> they have different types this will fail. If not, values will be written
> to incorrect
> columns causing data corruption.
>
> My question: is this a bug or intended behaviour? Can something be done
> about it to prevent
> it? This issue can be avoided by doing a select with schema loaded from
> the target table.
> However, when user is not aware this could cause hard to track down errors
> in data.
>
> Best regards,
> Oldřich Vlašic
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic

Thanks for reply! Is there something to be done, setting a config property for 
example? I'd like to prevent users (mainly data scientists) from falling victim 
to this.

From: Russell Spitzer 
Sent: Wednesday, March 3, 2021 3:31 PM
To: Sean Owen 
Cc: Oldrich Vlasic ; user 
; Ondřej Havlíček 
Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of 
insertInto

Yep this is the behavior for Insert Into, using the other write apis does 
schema matching I believe.

On Mar 3, 2021, at 8:29 AM, Sean Owen 
mailto:sro...@gmail.com>> wrote:

I don't have any good answer here, but, I seem to recall that this is because 
of SQL semantics, which follows column ordering not naming when performing 
operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic 
mailto:oldrich.vla...@datasentics.com>> wrote:
Hi,

I have encountered a weird and potentially dangerous behaviour of Spark 
concerning
partial overwrites of partitioned data. Not sure if this is a bug or just 
abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any 
relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks 
runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out 
part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either 
fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition 
column is
placed at the end). On the second write this is not taken into consideration 
and Spark
tries to insert values into the columns based on their order and not on their 
name. If
they have different types this will fail. If not, values will be written to 
incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about 
it to prevent
it? This issue can be avoided by doing a select with schema loaded from the 
target table.
However, when user is not aware this could cause hard to track down errors in 
data.

Best regards,
Oldřich Vlašic

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Russell Spitzer

Yep this is the behavior for Insert Into, using the other write apis does 
schema matching I believe.

> On Mar 3, 2021, at 8:29 AM, Sean Owen  wrote:
> 
> I don't have any good answer here, but, I seem to recall that this is because 
> of SQL semantics, which follows column ordering not naming when performing 
> operations like this. It may well be as intended.
> 
> On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic  <mailto:oldrich.vla...@datasentics.com>> wrote:
> Hi,
> 
> I have encountered a weird and potentially dangerous behaviour of Spark 
> concerning
> partial overwrites of partitioned data. Not sure if this is a bug or just 
> abstraction
> leak. I have checked Spark section of Stack Overflow and haven't found any 
> relevant
> questions or answers.
> 
> Full minimal working example provided as attachment. Tested on Databricks 
> runtime 7.3 LTS
> ML (Spark 3.0.1). Short summary:
> 
> Write dataframe using partitioning by a column using saveAsTable. Filter out 
> part of the
> dataframe, change some values (simulates new increment of data) and write 
> again,
> overwriting a subset of partitions using insertInto. This operation will 
> either fail on
> schema mismatch or cause data corruption.
> 
> Reason: on the first write, the ordering of the columns is changed (partition 
> column is
> placed at the end). On the second write this is not taken into consideration 
> and Spark
> tries to insert values into the columns based on their order and not on their 
> name. If
> they have different types this will fail. If not, values will be written to 
> incorrect
> columns causing data corruption.
> 
> My question: is this a bug or intended behaviour? Can something be done about 
> it to prevent
> it? This issue can be avoided by doing a select with schema loaded from the 
> target table.
> However, when user is not aware this could cause hard to track down errors in 
> data.
> 
> Best regards,
> Oldřich Vlašic
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Sean Owen

I don't have any good answer here, but, I seem to recall that this is
because of SQL semantics, which follows column ordering not naming when
performing operations like this. It may well be as intended.

On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic <
oldrich.vla...@datasentics.com> wrote:

> Hi,
>
> I have encountered a weird and potentially dangerous behaviour of Spark
> concerning
> partial overwrites of partitioned data. Not sure if this is a bug or just
> abstraction
> leak. I have checked Spark section of Stack Overflow and haven't found any
> relevant
> questions or answers.
>
> Full minimal working example provided as attachment. Tested on Databricks
> runtime 7.3 LTS
> ML (Spark 3.0.1). Short summary:
>
> Write dataframe using partitioning by a column using saveAsTable. Filter
> out part of the
> dataframe, change some values (simulates new increment of data) and write
> again,
> overwriting a subset of partitions using insertInto. This operation will
> either fail on
> schema mismatch or cause data corruption.
>
> Reason: on the first write, the ordering of the columns is changed
> (partition column is
> placed at the end). On the second write this is not taken into
> consideration and Spark
> tries to insert values into the columns based on their order and not on
> their name. If
> they have different types this will fail. If not, values will be written
> to incorrect
> columns causing data corruption.
>
> My question: is this a bug or intended behaviour? Can something be done
> about it to prevent
> it? This issue can be avoided by doing a select with schema loaded from
> the target table.
> However, when user is not aware this could cause hard to track down errors
> in data.
>
> Best regards,
> Oldřich Vlašic
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-02 Thread Oldrich Vlasic

Hi,

I have encountered a weird and potentially dangerous behaviour of Spark 
concerning
partial overwrites of partitioned data. Not sure if this is a bug or just 
abstraction
leak. I have checked Spark section of Stack Overflow and haven't found any 
relevant
questions or answers.

Full minimal working example provided as attachment. Tested on Databricks 
runtime 7.3 LTS
ML (Spark 3.0.1). Short summary:

Write dataframe using partitioning by a column using saveAsTable. Filter out 
part of the
dataframe, change some values (simulates new increment of data) and write again,
overwriting a subset of partitions using insertInto. This operation will either 
fail on
schema mismatch or cause data corruption.

Reason: on the first write, the ordering of the columns is changed (partition 
column is
placed at the end). On the second write this is not taken into consideration 
and Spark
tries to insert values into the columns based on their order and not on their 
name. If
they have different types this will fail. If not, values will be written to 
incorrect
columns causing data corruption.

My question: is this a bug or intended behaviour? Can something be done about 
it to prevent
it? This issue can be avoided by doing a select with schema loaded from the 
target table.
However, when user is not aware this could cause hard to track down errors in 
data.

Best regards,
Oldřich Vlašic
# Databricks notebook source
import pyspark.sql.functions as F

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

# COMMAND --

print(spark.version)
# 3.0.1

# COMMAND --

table_name = "insert_into_mve_301"
spark.sql("DROP TABLE IF EXISTS insert_into_mve_301")

# COMMAND --

df_data = (
spark.range(10_000)
.withColumnRenamed("id", "x")
.crossJoin(
spark.range(5).withColumnRenamed("id", "y")
)
.withColumn("z", (F.rand() > 0.5).cast("integer"))
.repartitionByRange(5, "y")
)

# COMMAND --

print(df_data.filter(F.col("x") < 5).show())

"""
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  0|  0|  0|
|  1|  0|  0|
|  2|  0|  1|
|  3|  0|  1|
|  4|  0|  0|
|  0|  1|  1|
|  1|  1|  1|
|  2|  1|  1|
|  3|  1|  0|
|  4|  1|  1|
|  0|  2|  1|
|  1|  2|  0|
|  2|  2|  0|
|  3|  2|  0|
|  4|  2|  1|
|  0|  3|  0|
|  1|  3|  1|
|  2|  3|  0|
|  3|  3|  0|
|  4|  3|  0|
+---+---+---+
only showing top 20 rows
"""

# COMMAND --

(
df_data
.write
.mode("overwrite")
.partitionBy("y")
.format("parquet")
# .option("path", "dbfs:/ov_test/foo_01")
.saveAsTable(table_name)
)

# COMMAND --

df_increment = (
df_data.filter(F.col("y") < 2)  # rewrite only some partitions
.withColumn("z", F.lit(42))  # change value so that we can see it
)

print(df_increment.filter(F.col("x") < 5).show())

"""
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  0|  0| 42|
|  1|  0| 42|
|  2|  0| 42|
|  3|  0| 42|
|  4|  0| 42|
|  0|  1| 42|
|  1|  1| 42|
|  2|  1| 42|
|  3|  1| 42|
|  4|  1| 42|
+---+---+---+
"""

# COMMAND --

(
df_increment
.write
.mode("overwrite")
.insertInto(table_name)
)

# COMMAND --

print(
spark
.table(table_name)
.filter(F.col("y") == 42)  # note that we inserted value 42 to column "z"
.limit(10)
.show()
)

"""
+---+---+---+
|  x|  z|  y|
+---+---+---+
|  0|  1| 42|
|  1|  1| 42|
|  2|  1| 42|
|  3|  1| 42|
|  4|  1| 42|
|  5|  1| 42|
|  6|  1| 42|
|  7|  1| 42|
|  8|  1| 42|
|  9|  1| 42|
+---+---+---+
"""

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Sean Owen

I'll take a look. At a glance - is it converging? might turn down the
tolerance to check.
Also what does scikit learn say on the same data? we can continue on the
JIRA.

On Mon, Feb 22, 2021 at 5:42 PM Yakov Kerzhner  wrote:

> I have written up a JIRA, and there is a gist attached that has code that
> reproduces the issue.  This is a fairly serious issue as it probably
> affects everyone who uses spark to fit binary logistic regressions.
> https://issues.apache.org/jira/browse/SPARK-34448
> Would be great if someone who understands binary logistic regressions and
> the implementation in scala to take a look.
>

A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Yakov Kerzhner

I have written up a JIRA, and there is a gist attached that has code that 
reproduces the issue.  This is a fairly serious issue as it probably affects 
everyone who uses spark to fit binary logistic regressions.
https://issues.apache.org/jira/browse/SPARK-34448
Would be great if someone who understands binary logistic regressions and the 
implementation in scala to take a look.

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-18 Thread 王长春

Hi Shiao-An Yuan
I also found this correctness problem in my production environment.
My spark version is 2.3.1。 I thought it was because Spark-23243 before .
But you said You also have this problem in your environment
， and your version is 2.4.4 which had solved spark-23243. So Maybe this problem 
is not because SPARK-23243.
 
As you said ,if it was caused by ‘first’ before ‘repartition’, then how to 
solve this problem fundamentally. And is there any workaround?


> 2021年1月18日 上午10:35，Shiao-An Yuan  写道：
> 
> Hi, 
> I am using Spark 2.4.4 standalone mode.
> 
> On Mon, Jan 18, 2021 at 4:26 AM Sean Owen  <mailto:sro...@gmail.com>> wrote:
> Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using?
> 
> On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan  <mailto:shiao.an.y...@gmail.com>> wrote:
> Hi folks,
> 
> I finally found the root cause of this issue.
> It can be easily reproduced by the following code.
> We ran it on a standalone mode 4 cores * 4 instances (total 16 cores) 
> environment.
> 
> ```
> import org.apache.spark.TaskContext
> import scala.sys.process._
> import org.apache.spark.sql.functions._
> import com.google.common.hash.Hashing
> val murmur3 = Hashing.murmur3_32()
> 
> // create a Dataset with the cardinality of the second element equals 5.
> val ds = spark.range(0, 10, 1, 130).map(i => 
> (murmur3.hashLong(i).asInt(), i/2))
> 
> ds.groupByKey(_._2)
>   .agg(first($"_1").as[Long])
>   .repartition(200)
>   .map { x =>
> if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId == 
> 100 && TaskContext.get.stageAttemptNumber == 0) {
>   throw new Exception("pkill -f CoarseGrainedExecutorBackend".!!)
> }
> x
>   }
>   .map(_._2).distinct().count()   // the correct result is 5, but we 
> always got fewer number
> ```
> 
> The problem here is SPARK-23207 use sorting to make RoundRobinPartitioning 
> always generate the same distribution,
> but the UDAF `first` may return non-deterministic results and caused the 
> sorting result non-deterministic.
> Therefore, the first stage and the retry stage might have different 
> distribution and cause duplications and loss.
> 
> Thanks,
> Shiao-An Yuan
> 
> On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan  <mailto:shiao.an.y...@gmail.com>> wrote:
> Hi folks,
> 
> We recently identified a data correctness issue in our pipeline.
> 
> The data processing flow is as follows:
> 1. read the current snapshot (provide empty if it doesn't exist yet)
> 2. read unprocessed new data
> 3. union them and do a `reduceByKey` operation
> 4. output a new version of the snapshot
> 5. repeat step 1~4
> 
> The simplified version of code:
> ```
> // schema
> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
> 
> // function for reduce
> def merge(left: Log, right: Log): Log = {
>   Log(pkey = left.pkey
>   a= if (left.a!=null) left.a else right.a,
>   b= if (left.a!=null) left.b else right.b,
>   ...
>   )
> }
> 
> // a very large parquet file (>10G, 200 partitions)
> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]   
> 
> // multiple small parquet files
> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
> 
> val newSnapshot = currentSnapshot.union(newAddedLog)
>   .groupByKey(new String(pkey))  // generate key
>   .reduceGroups(_.merge(_))// 
> spark.sql.shuffle.partitions=200
>   .map(_._2) // drop key
> 
> newSnapshot
>   .repartition(60)  // (1)
>   .write.parquet(newPath)
> ```
> 
> The issue we have is that some data were duplicated or lost, and the amount of
> duplicated and loss data are similar.
> 
> We also noticed that this situation only happens if some instances got
> preempted. Spark will retry the stage, so some of the partitioned files are
> generated at the 1st time, and other files are generated at the 2nd(retry) 
> time.
> Moreover, those duplicated logs will be duplicated exactly twice and located 
> in
> both batches (one in the first batch; and one in the second batch).
> 
> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
> standalone deployment. Workers running on GCP preemptible instances and they
> being preempted very frequently.
> 
> The pipeline is running in a single long-running process with multi-threads,
> each snapshot represent an "hour" of data, and we do the "read-reduce-write" 
> operations
> on multiple snapshots(hours) simultaneously. We pretty sure th

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan

Hi,
I am using Spark 2.4.4 standalone mode.

On Mon, Jan 18, 2021 at 4:26 AM Sean Owen  wrote:

> Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using?
>
> On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan 
> wrote:
>
>> Hi folks,
>>
>> I finally found the root cause of this issue.
>> It can be easily reproduced by the following code.
>> We ran it on a standalone mode 4 cores * 4 instances (total 16 cores)
>> environment.
>>
>> ```
>> import org.apache.spark.TaskContext
>> import scala.sys.process._
>> import org.apache.spark.sql.functions._
>> import com.google.common.hash.Hashing
>> val murmur3 = Hashing.murmur3_32()
>>
>> // create a Dataset with the cardinality of the second element equals
>> 5.
>> val ds = spark.range(0, 10, 1, 130).map(i =>
>> (murmur3.hashLong(i).asInt(), i/2))
>>
>> ds.groupByKey(_._2)
>>   .agg(first($"_1").as[Long])
>>   .repartition(200)
>>   .map { x =>
>> if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId
>> == 100 && TaskContext.get.stageAttemptNumber == 0) {
>>   throw new Exception("pkill -f CoarseGrainedExecutorBackend".!!)
>> }
>> x
>>   }
>>   .map(_._2).distinct().count()   // the correct result is 5, but we
>> always got fewer number
>> ```
>>
>> The problem here is SPARK-23207 use sorting to make
>> RoundRobinPartitioning always generate the same distribution,
>> but the UDAF `first` may return non-deterministic results and caused the
>> sorting result non-deterministic.
>> Therefore, the first stage and the retry stage might have different
>> distribution and cause duplications and loss.
>>
>> Thanks,
>> Shiao-An Yuan
>>
>> On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan 
>> wrote:
>>
>>> Hi folks,
>>>
>>> We recently identified a data correctness issue in our pipeline.
>>>
>>> The data processing flow is as follows:
>>> 1. read the current snapshot (provide empty if it doesn't exist yet)
>>> 2. read unprocessed new data
>>> 3. union them and do a `reduceByKey` operation
>>> 4. output a new version of the snapshot
>>> 5. repeat step 1~4
>>>
>>> The simplified version of code:
>>> ```
>>> // schema
>>> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>>>
>>> // function for reduce
>>> def merge(left: Log, right: Log): Log = {
>>>   Log(pkey = left.pkey
>>>   a= if (left.a!=null) left.a else right.a,
>>>   b= if (left.a!=null) left.b else right.b,
>>>   ...
>>>   )
>>> }
>>>
>>> // a very large parquet file (>10G, 200 partitions)
>>> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>>>
>>> // multiple small parquet files
>>> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>>>
>>> val newSnapshot = currentSnapshot.union(newAddedLog)
>>>   .groupByKey(new String(pkey))  // generate key
>>>   .reduceGroups(_.merge(_))//
>>> spark.sql.shuffle.partitions=200
>>>   .map(_._2) // drop key
>>>
>>> newSnapshot
>>>   .repartition(60)  // (1)
>>>   .write.parquet(newPath)
>>> ```
>>>
>>> The issue we have is that some data were duplicated or lost, and the
>>> amount of
>>> duplicated and loss data are similar.
>>>
>>> We also noticed that this situation only happens if some instances got
>>> preempted. Spark will retry the stage, so some of the partitioned files
>>> are
>>> generated at the 1st time, and other files are generated at the
>>> 2nd(retry) time.
>>> Moreover, those duplicated logs will be duplicated exactly twice and
>>> located in
>>> both batches (one in the first batch; and one in the second batch).
>>>
>>> The input/output files are parquet on GCS. The Spark version is 2.4.4
>>> with
>>> standalone deployment. Workers running on GCP preemptible instances and
>>> they
>>> being preempted very frequently.
>>>
>>> The pipeline is running in a single long-running process with
>>> multi-threads,
>>> each snapshot represent an "hour" of data, and we do the
>>> "read-reduce-write" operations
>>> on multiple snapshots(hours) simultaneously. We pretty sure the same
>>> snapshot(hour) never process parallelly and the output path always
>>> generated with a timestamp, so those jobs shouldn't affect each other.
>>>
>>> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
>>> the issue
>>> was gone, but I believe there is still a correctness bug that hasn't
>>> been reported yet.
>>>
>>> We have tried to reproduce this bug on a smaller scale but haven't
>>> succeeded yet. I
>>> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
>>> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>>>
>>> Can anyone give me some advice about the following tasks?
>>> Thanks in advance.
>>>
>>> Shiao-An Yuan
>>>
>>

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Mich Talebzadeh

Hi Shiao-An,

With regard to your set-up below and I quote:

"The input/output files are parquet on GCS. The Spark version is 2.4.4
with standalone deployment. Workers running on GCP preemptible instances
and they being preempted very frequently."

Am I correct that you have foregone deploying Dataproc clusters on GCP in
favour of selecting some VM boxes, installing your own Spark cluster
running Spark in standalone mode (assuming to save costs $$$). What is the
rationale behind this choice?  Accordingly
<https://cloud.google.com/compute/docs/instances/preemptible> *Compute
Engine might stop (preempt) these instances if it requires access to those
resources for other tasks. Preemptible instances are excess Compute Engine
capacity, so their availability varies with usage*. So what causes some VM
instances to be preempted? I have not touched standalone mode for couple of
years myself. So your ETL process reads the raw snapshots, does some joins
and creates new hourly processed snapshots. There seems to be not an
intermediate stage to verify the sanity of data (data lineage).
Personally I would deploy a database to do this ETL. That would give you an
option to look at your data easier and store everything in a staging area
before final push to the analytics layer.

HTH

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 29 Dec 2020 at 14:01, Shiao-An Yuan  wrote:

> Hi folks,
>
> We recently identified a data correctness issue in our pipeline.
>
> The data processing flow is as follows:
> 1. read the current snapshot (provide empty if it doesn't exist yet)
> 2. read unprocessed new data
> 3. union them and do a `reduceByKey` operation
> 4. output a new version of the snapshot
> 5. repeat step 1~4
>
> The simplified version of code:
> ```
> // schema
> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>
> // function for reduce
> def merge(left: Log, right: Log): Log = {
>   Log(pkey = left.pkey
>   a= if (left.a!=null) left.a else right.a,
>   b= if (left.a!=null) left.b else right.b,
>   ...
>   )
> }
>
> // a very large parquet file (>10G, 200 partitions)
> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>
> // multiple small parquet files
> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>
> val newSnapshot = currentSnapshot.union(newAddedLog)
>   .groupByKey(new String(pkey))  // generate key
>   .reduceGroups(_.merge(_))//
> spark.sql.shuffle.partitions=200
>   .map(_._2) // drop key
>
> newSnapshot
>   .repartition(60)  // (1)
>   .write.parquet(newPath)
> ```
>
> The issue we have is that some data were duplicated or lost, and the
> amount of
> duplicated and loss data are similar.
>
> We also noticed that this situation only happens if some instances got
> preempted. Spark will retry the stage, so some of the partitioned files are
> generated at the 1st time, and other files are generated at the 2nd(retry)
> time.
> Moreover, those duplicated logs will be duplicated exactly twice and
> located in
> both batches (one in the first batch; and one in the second batch).
>
> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
> standalone deployment. Workers running on GCP preemptible instances and
> they
> being preempted very frequently.
>
> The pipeline is running in a single long-running process with
> multi-threads,
> each snapshot represent an "hour" of data, and we do the
> "read-reduce-write" operations
> on multiple snapshots(hours) simultaneously. We pretty sure the same
> snapshot(hour) never process parallelly and the output path always
> generated with a timestamp, so those jobs shouldn't affect each other.
>
> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
> the issue
> was gone, but I believe there is still a correctness bug that hasn't been
> reported yet.
>
> We have tried to reproduce this bug on a smaller scale but haven't
> succeeded yet. I
> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>
> Can anyone give me some advice about the following tasks?
> Thanks in advance.
>
> Shiao-An Yuan
>

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Gourav Sengupta

Hi,

I may be wrong, but this looks like a massively complicated solution for
what could have been a simple SQL.

It always seems okay to be to first reduce the complexity and then solve
it, rather than solve a problem which should not even exist in the first
instance.

Regards,
Gourav

On Sun, Jan 17, 2021 at 12:22 PM Shiao-An Yuan 
wrote:

> Hi folks,
>
> I finally found the root cause of this issue.
> It can be easily reproduced by the following code.
> We ran it on a standalone mode 4 cores * 4 instances (total 16 cores)
> environment.
>
> ```
> import org.apache.spark.TaskContext
> import scala.sys.process._
> import org.apache.spark.sql.functions._
> import com.google.common.hash.Hashing
> val murmur3 = Hashing.murmur3_32()
>
> // create a Dataset with the cardinality of the second element equals
> 5.
> val ds = spark.range(0, 10, 1, 130).map(i =>
> (murmur3.hashLong(i).asInt(), i/2))
>
> ds.groupByKey(_._2)
>   .agg(first($"_1").as[Long])
>   .repartition(200)
>   .map { x =>
> if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId
> == 100 && TaskContext.get.stageAttemptNumber == 0) {
>   throw new Exception("pkill -f CoarseGrainedExecutorBackend".!!)
> }
> x
>   }
>   .map(_._2).distinct().count()   // the correct result is 5, but we
> always got fewer number
> ```
>
> The problem here is SPARK-23207 use sorting to make RoundRobinPartitioning
> always generate the same distribution,
> but the UDAF `first` may return non-deterministic results and caused the
> sorting result non-deterministic.
> Therefore, the first stage and the retry stage might have different
> distribution and cause duplications and loss.
>
> Thanks,
> Shiao-An Yuan
>
> On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan 
> wrote:
>
>> Hi folks,
>>
>> We recently identified a data correctness issue in our pipeline.
>>
>> The data processing flow is as follows:
>> 1. read the current snapshot (provide empty if it doesn't exist yet)
>> 2. read unprocessed new data
>> 3. union them and do a `reduceByKey` operation
>> 4. output a new version of the snapshot
>> 5. repeat step 1~4
>>
>> The simplified version of code:
>> ```
>> // schema
>> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>>
>> // function for reduce
>> def merge(left: Log, right: Log): Log = {
>>   Log(pkey = left.pkey
>>   a= if (left.a!=null) left.a else right.a,
>>   b= if (left.a!=null) left.b else right.b,
>>   ...
>>   )
>> }
>>
>> // a very large parquet file (>10G, 200 partitions)
>> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>>
>> // multiple small parquet files
>> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>>
>> val newSnapshot = currentSnapshot.union(newAddedLog)
>>   .groupByKey(new String(pkey))  // generate key
>>   .reduceGroups(_.merge(_))//
>> spark.sql.shuffle.partitions=200
>>   .map(_._2) // drop key
>>
>> newSnapshot
>>   .repartition(60)  // (1)
>>   .write.parquet(newPath)
>> ```
>>
>> The issue we have is that some data were duplicated or lost, and the
>> amount of
>> duplicated and loss data are similar.
>>
>> We also noticed that this situation only happens if some instances got
>> preempted. Spark will retry the stage, so some of the partitioned files
>> are
>> generated at the 1st time, and other files are generated at the
>> 2nd(retry) time.
>> Moreover, those duplicated logs will be duplicated exactly twice and
>> located in
>> both batches (one in the first batch; and one in the second batch).
>>
>> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
>> standalone deployment. Workers running on GCP preemptible instances and
>> they
>> being preempted very frequently.
>>
>> The pipeline is running in a single long-running process with
>> multi-threads,
>> each snapshot represent an "hour" of data, and we do the
>> "read-reduce-write" operations
>> on multiple snapshots(hours) simultaneously. We pretty sure the same
>> snapshot(hour) never process parallelly and the output path always
>> generated with a timestamp, so those jobs shouldn't affect each other.
>>
>> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
>> the issue
>> was gone, but I believe there is still a correctness bug that hasn't been
>> reported yet.
>>
>> We have tried to reproduce this bug on a smaller scale but haven't
>> succeeded yet. I
>> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
>> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>>
>> Can anyone give me some advice about the following tasks?
>> Thanks in advance.
>>
>> Shiao-An Yuan
>>
>

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen

Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using?

On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan 
wrote:

> Hi folks,
>
> I finally found the root cause of this issue.
> It can be easily reproduced by the following code.
> We ran it on a standalone mode 4 cores * 4 instances (total 16 cores)
> environment.
>
> ```
> import org.apache.spark.TaskContext
> import scala.sys.process._
> import org.apache.spark.sql.functions._
> import com.google.common.hash.Hashing
> val murmur3 = Hashing.murmur3_32()
>
> // create a Dataset with the cardinality of the second element equals
> 5.
> val ds = spark.range(0, 10, 1, 130).map(i =>
> (murmur3.hashLong(i).asInt(), i/2))
>
> ds.groupByKey(_._2)
>   .agg(first($"_1").as[Long])
>   .repartition(200)
>   .map { x =>
> if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId
> == 100 && TaskContext.get.stageAttemptNumber == 0) {
>   throw new Exception("pkill -f CoarseGrainedExecutorBackend".!!)
> }
> x
>   }
>   .map(_._2).distinct().count()   // the correct result is 5, but we
> always got fewer number
> ```
>
> The problem here is SPARK-23207 use sorting to make RoundRobinPartitioning
> always generate the same distribution,
> but the UDAF `first` may return non-deterministic results and caused the
> sorting result non-deterministic.
> Therefore, the first stage and the retry stage might have different
> distribution and cause duplications and loss.
>
> Thanks,
> Shiao-An Yuan
>
> On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan 
> wrote:
>
>> Hi folks,
>>
>> We recently identified a data correctness issue in our pipeline.
>>
>> The data processing flow is as follows:
>> 1. read the current snapshot (provide empty if it doesn't exist yet)
>> 2. read unprocessed new data
>> 3. union them and do a `reduceByKey` operation
>> 4. output a new version of the snapshot
>> 5. repeat step 1~4
>>
>> The simplified version of code:
>> ```
>> // schema
>> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>>
>> // function for reduce
>> def merge(left: Log, right: Log): Log = {
>>   Log(pkey = left.pkey
>>   a= if (left.a!=null) left.a else right.a,
>>   b= if (left.a!=null) left.b else right.b,
>>   ...
>>   )
>> }
>>
>> // a very large parquet file (>10G, 200 partitions)
>> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>>
>> // multiple small parquet files
>> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>>
>> val newSnapshot = currentSnapshot.union(newAddedLog)
>>   .groupByKey(new String(pkey))  // generate key
>>   .reduceGroups(_.merge(_))//
>> spark.sql.shuffle.partitions=200
>>   .map(_._2) // drop key
>>
>> newSnapshot
>>   .repartition(60)  // (1)
>>   .write.parquet(newPath)
>> ```
>>
>> The issue we have is that some data were duplicated or lost, and the
>> amount of
>> duplicated and loss data are similar.
>>
>> We also noticed that this situation only happens if some instances got
>> preempted. Spark will retry the stage, so some of the partitioned files
>> are
>> generated at the 1st time, and other files are generated at the
>> 2nd(retry) time.
>> Moreover, those duplicated logs will be duplicated exactly twice and
>> located in
>> both batches (one in the first batch; and one in the second batch).
>>
>> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
>> standalone deployment. Workers running on GCP preemptible instances and
>> they
>> being preempted very frequently.
>>
>> The pipeline is running in a single long-running process with
>> multi-threads,
>> each snapshot represent an "hour" of data, and we do the
>> "read-reduce-write" operations
>> on multiple snapshots(hours) simultaneously. We pretty sure the same
>> snapshot(hour) never process parallelly and the output path always
>> generated with a timestamp, so those jobs shouldn't affect each other.
>>
>> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
>> the issue
>> was gone, but I believe there is still a correctness bug that hasn't been
>> reported yet.
>>
>> We have tried to reproduce this bug on a smaller scale but haven't
>> succeeded yet. I
>> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
>> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>>
>> Can anyone give me some advice about the following tasks?
>> Thanks in advance.
>>
>> Shiao-An Yuan
>>
>

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan

Hi folks,

I finally found the root cause of this issue.
It can be easily reproduced by the following code.
We ran it on a standalone mode 4 cores * 4 instances (total 16 cores)
environment.

```
import org.apache.spark.TaskContext
import scala.sys.process._
import org.apache.spark.sql.functions._
import com.google.common.hash.Hashing
val murmur3 = Hashing.murmur3_32()

// create a Dataset with the cardinality of the second element equals 5.
val ds = spark.range(0, 10, 1, 130).map(i =>
(murmur3.hashLong(i).asInt(), i/2))

ds.groupByKey(_._2)
  .agg(first($"_1").as[Long])
  .repartition(200)
  .map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId
== 100 && TaskContext.get.stageAttemptNumber == 0) {
  throw new Exception("pkill -f CoarseGrainedExecutorBackend".!!)
}
x
  }
  .map(_._2).distinct().count()   // the correct result is 5, but we
always got fewer number
```

The problem here is SPARK-23207 use sorting to make RoundRobinPartitioning
always generate the same distribution,
but the UDAF `first` may return non-deterministic results and caused the
sorting result non-deterministic.
Therefore, the first stage and the retry stage might have different
distribution and cause duplications and loss.

Thanks,
Shiao-An Yuan

On Tue, Dec 29, 2020 at 10:00 PM Shiao-An Yuan 
wrote:

> Hi folks,
>
> We recently identified a data correctness issue in our pipeline.
>
> The data processing flow is as follows:
> 1. read the current snapshot (provide empty if it doesn't exist yet)
> 2. read unprocessed new data
> 3. union them and do a `reduceByKey` operation
> 4. output a new version of the snapshot
> 5. repeat step 1~4
>
> The simplified version of code:
> ```
> // schema
> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>
> // function for reduce
> def merge(left: Log, right: Log): Log = {
>   Log(pkey = left.pkey
>   a= if (left.a!=null) left.a else right.a,
>   b= if (left.a!=null) left.b else right.b,
>   ...
>   )
> }
>
> // a very large parquet file (>10G, 200 partitions)
> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>
> // multiple small parquet files
> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>
> val newSnapshot = currentSnapshot.union(newAddedLog)
>   .groupByKey(new String(pkey))  // generate key
>   .reduceGroups(_.merge(_))//
> spark.sql.shuffle.partitions=200
>   .map(_._2) // drop key
>
> newSnapshot
>   .repartition(60)  // (1)
>   .write.parquet(newPath)
> ```
>
> The issue we have is that some data were duplicated or lost, and the
> amount of
> duplicated and loss data are similar.
>
> We also noticed that this situation only happens if some instances got
> preempted. Spark will retry the stage, so some of the partitioned files are
> generated at the 1st time, and other files are generated at the 2nd(retry)
> time.
> Moreover, those duplicated logs will be duplicated exactly twice and
> located in
> both batches (one in the first batch; and one in the second batch).
>
> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
> standalone deployment. Workers running on GCP preemptible instances and
> they
> being preempted very frequently.
>
> The pipeline is running in a single long-running process with
> multi-threads,
> each snapshot represent an "hour" of data, and we do the
> "read-reduce-write" operations
> on multiple snapshots(hours) simultaneously. We pretty sure the same
> snapshot(hour) never process parallelly and the output path always
> generated with a timestamp, so those jobs shouldn't affect each other.
>
> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
> the issue
> was gone, but I believe there is still a correctness bug that hasn't been
> reported yet.
>
> We have tried to reproduce this bug on a smaller scale but haven't
> succeeded yet. I
> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>
> Can anyone give me some advice about the following tasks?
> Thanks in advance.
>
> Shiao-An Yuan
>

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen

I don't think this addresses my comment at all. Please try correctly
implementing equals and hashCode for your key class first.

On Tue, Dec 29, 2020 at 8:31 PM Shiao-An Yuan 
wrote:

> Hi Sean,
>
> Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary
> Key" and I do "reduce by key" on this column, so the "amount of rows"
> should always equal to the "cardinality of pkey".
> When I said data get duplicated & lost, I mean duplicated "pkey" exists in
> the output file (after "reduce by key") and some "pkey" missing.
> Since it only happens when executors being preempted, I believe this is a
> bug (nondeterministic shuffle) that SPARK-23207 trying to solve.
>
> Thanks,
>
> Shiao-An Yuan
>
> On Tue, Dec 29, 2020 at 10:53 PM Sean Owen  wrote:
>
>> Total guess here, but your key is a case class. It does define hashCode
>> and equals for you, but, you have an array as one of the members. Array
>> equality is by reference, so, two arrays of the same elements are not
>> equal. You may have to define hashCode and equals manually to make them
>> correct.
>>
>> On Tue, Dec 29, 2020 at 8:01 AM Shiao-An Yuan 
>> wrote:
>>
>>> Hi folks,
>>>
>>> We recently identified a data correctness issue in our pipeline.
>>>
>>> The data processing flow is as follows:
>>> 1. read the current snapshot (provide empty if it doesn't exist yet)
>>> 2. read unprocessed new data
>>> 3. union them and do a `reduceByKey` operation
>>> 4. output a new version of the snapshot
>>> 5. repeat step 1~4
>>>
>>> The simplified version of code:
>>> ```
>>> // schema
>>> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>>>
>>> // function for reduce
>>> def merge(left: Log, right: Log): Log = {
>>>   Log(pkey = left.pkey
>>>   a= if (left.a!=null) left.a else right.a,
>>>   b= if (left.a!=null) left.b else right.b,
>>>   ...
>>>   )
>>> }
>>>
>>> // a very large parquet file (>10G, 200 partitions)
>>> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>>>
>>> // multiple small parquet files
>>> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>>>
>>> val newSnapshot = currentSnapshot.union(newAddedLog)
>>>   .groupByKey(new String(pkey))  // generate key
>>>   .reduceGroups(_.merge(_))//
>>> spark.sql.shuffle.partitions=200
>>>   .map(_._2) // drop key
>>>
>>> newSnapshot
>>>   .repartition(60)  // (1)
>>>   .write.parquet(newPath)
>>> ```
>>>
>>> The issue we have is that some data were duplicated or lost, and the
>>> amount of
>>> duplicated and loss data are similar.
>>>
>>> We also noticed that this situation only happens if some instances got
>>> preempted. Spark will retry the stage, so some of the partitioned files
>>> are
>>> generated at the 1st time, and other files are generated at the
>>> 2nd(retry) time.
>>> Moreover, those duplicated logs will be duplicated exactly twice and
>>> located in
>>> both batches (one in the first batch; and one in the second batch).
>>>
>>> The input/output files are parquet on GCS. The Spark version is 2.4.4
>>> with
>>> standalone deployment. Workers running on GCP preemptible instances and
>>> they
>>> being preempted very frequently.
>>>
>>> The pipeline is running in a single long-running process with
>>> multi-threads,
>>> each snapshot represent an "hour" of data, and we do the
>>> "read-reduce-write" operations
>>> on multiple snapshots(hours) simultaneously. We pretty sure the same
>>> snapshot(hour) never process parallelly and the output path always
>>> generated with a timestamp, so those jobs shouldn't affect each other.
>>>
>>> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
>>> the issue
>>> was gone, but I believe there is still a correctness bug that hasn't
>>> been reported yet.
>>>
>>> We have tried to reproduce this bug on a smaller scale but haven't
>>> succeeded yet. I
>>> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
>>> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>>>
>>> Can anyone give me some advice about the following tasks?
>>> Thanks in advance.
>>>
>>> Shiao-An Yuan
>>>
>>

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan

Hi Sean,

Sorry, I didn't describe it clearly. The column "pkey" is like a "Primary
Key" and I do "reduce by key" on this column, so the "amount of rows"
should always equal to the "cardinality of pkey".
When I said data get duplicated & lost, I mean duplicated "pkey" exists in
the output file (after "reduce by key") and some "pkey" missing.
Since it only happens when executors being preempted, I believe this is a
bug (nondeterministic shuffle) that SPARK-23207 trying to solve.

Thanks,

Shiao-An Yuan

On Tue, Dec 29, 2020 at 10:53 PM Sean Owen  wrote:

> Total guess here, but your key is a case class. It does define hashCode
> and equals for you, but, you have an array as one of the members. Array
> equality is by reference, so, two arrays of the same elements are not
> equal. You may have to define hashCode and equals manually to make them
> correct.
>
> On Tue, Dec 29, 2020 at 8:01 AM Shiao-An Yuan 
> wrote:
>
>> Hi folks,
>>
>> We recently identified a data correctness issue in our pipeline.
>>
>> The data processing flow is as follows:
>> 1. read the current snapshot (provide empty if it doesn't exist yet)
>> 2. read unprocessed new data
>> 3. union them and do a `reduceByKey` operation
>> 4. output a new version of the snapshot
>> 5. repeat step 1~4
>>
>> The simplified version of code:
>> ```
>> // schema
>> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>>
>> // function for reduce
>> def merge(left: Log, right: Log): Log = {
>>   Log(pkey = left.pkey
>>   a= if (left.a!=null) left.a else right.a,
>>   b= if (left.a!=null) left.b else right.b,
>>   ...
>>   )
>> }
>>
>> // a very large parquet file (>10G, 200 partitions)
>> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>>
>> // multiple small parquet files
>> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>>
>> val newSnapshot = currentSnapshot.union(newAddedLog)
>>   .groupByKey(new String(pkey))  // generate key
>>   .reduceGroups(_.merge(_))//
>> spark.sql.shuffle.partitions=200
>>   .map(_._2) // drop key
>>
>> newSnapshot
>>   .repartition(60)  // (1)
>>   .write.parquet(newPath)
>> ```
>>
>> The issue we have is that some data were duplicated or lost, and the
>> amount of
>> duplicated and loss data are similar.
>>
>> We also noticed that this situation only happens if some instances got
>> preempted. Spark will retry the stage, so some of the partitioned files
>> are
>> generated at the 1st time, and other files are generated at the
>> 2nd(retry) time.
>> Moreover, those duplicated logs will be duplicated exactly twice and
>> located in
>> both batches (one in the first batch; and one in the second batch).
>>
>> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
>> standalone deployment. Workers running on GCP preemptible instances and
>> they
>> being preempted very frequently.
>>
>> The pipeline is running in a single long-running process with
>> multi-threads,
>> each snapshot represent an "hour" of data, and we do the
>> "read-reduce-write" operations
>> on multiple snapshots(hours) simultaneously. We pretty sure the same
>> snapshot(hour) never process parallelly and the output path always
>> generated with a timestamp, so those jobs shouldn't affect each other.
>>
>> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
>> the issue
>> was gone, but I believe there is still a correctness bug that hasn't been
>> reported yet.
>>
>> We have tried to reproduce this bug on a smaller scale but haven't
>> succeeded yet. I
>> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
>> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>>
>> Can anyone give me some advice about the following tasks?
>> Thanks in advance.
>>
>> Shiao-An Yuan
>>
>

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen

Total guess here, but your key is a case class. It does define hashCode and
equals for you, but, you have an array as one of the members. Array
equality is by reference, so, two arrays of the same elements are not
equal. You may have to define hashCode and equals manually to make them
correct.

On Tue, Dec 29, 2020 at 8:01 AM Shiao-An Yuan 
wrote:

> Hi folks,
>
> We recently identified a data correctness issue in our pipeline.
>
> The data processing flow is as follows:
> 1. read the current snapshot (provide empty if it doesn't exist yet)
> 2. read unprocessed new data
> 3. union them and do a `reduceByKey` operation
> 4. output a new version of the snapshot
> 5. repeat step 1~4
>
> The simplified version of code:
> ```
> // schema
> case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)
>
> // function for reduce
> def merge(left: Log, right: Log): Log = {
>   Log(pkey = left.pkey
>   a= if (left.a!=null) left.a else right.a,
>   b= if (left.a!=null) left.b else right.b,
>   ...
>   )
> }
>
> // a very large parquet file (>10G, 200 partitions)
> val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]
>
> // multiple small parquet files
> val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]
>
> val newSnapshot = currentSnapshot.union(newAddedLog)
>   .groupByKey(new String(pkey))  // generate key
>   .reduceGroups(_.merge(_))//
> spark.sql.shuffle.partitions=200
>   .map(_._2) // drop key
>
> newSnapshot
>   .repartition(60)  // (1)
>   .write.parquet(newPath)
> ```
>
> The issue we have is that some data were duplicated or lost, and the
> amount of
> duplicated and loss data are similar.
>
> We also noticed that this situation only happens if some instances got
> preempted. Spark will retry the stage, so some of the partitioned files are
> generated at the 1st time, and other files are generated at the 2nd(retry)
> time.
> Moreover, those duplicated logs will be duplicated exactly twice and
> located in
> both batches (one in the first batch; and one in the second batch).
>
> The input/output files are parquet on GCS. The Spark version is 2.4.4 with
> standalone deployment. Workers running on GCP preemptible instances and
> they
> being preempted very frequently.
>
> The pipeline is running in a single long-running process with
> multi-threads,
> each snapshot represent an "hour" of data, and we do the
> "read-reduce-write" operations
> on multiple snapshots(hours) simultaneously. We pretty sure the same
> snapshot(hour) never process parallelly and the output path always
> generated with a timestamp, so those jobs shouldn't affect each other.
>
> After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
> the issue
> was gone, but I believe there is still a correctness bug that hasn't been
> reported yet.
>
> We have tried to reproduce this bug on a smaller scale but haven't
> succeeded yet. I
> have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
> Since this case is DataSet, I believe it is unrelated to SPARK-24243.
>
> Can anyone give me some advice about the following tasks?
> Thanks in advance.
>
> Shiao-An Yuan
>

Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan

Hi folks,

We recently identified a data correctness issue in our pipeline.

The data processing flow is as follows:
1. read the current snapshot (provide empty if it doesn't exist yet)
2. read unprocessed new data
3. union them and do a `reduceByKey` operation
4. output a new version of the snapshot
5. repeat step 1~4

The simplified version of code:
```
// schema
case class Log(pkey: Array[Byte], a: String, b: Int, /* 100+ columns */)

// function for reduce
def merge(left: Log, right: Log): Log = {
  Log(pkey = left.pkey
  a= if (left.a!=null) left.a else right.a,
  b= if (left.a!=null) left.b else right.b,
  ...
  )
}

// a very large parquet file (>10G, 200 partitions)
val currentSnapshot = spark.read.schema(schema).parquet(...).as[Log]

// multiple small parquet files
val newAddedLogs = spark.read.schema(schema).parquet(...).as[Log]

val newSnapshot = currentSnapshot.union(newAddedLog)
  .groupByKey(new String(pkey))  // generate key
  .reduceGroups(_.merge(_))//
spark.sql.shuffle.partitions=200
  .map(_._2) // drop key

newSnapshot
  .repartition(60)  // (1)
  .write.parquet(newPath)
```

The issue we have is that some data were duplicated or lost, and the amount
of
duplicated and loss data are similar.

We also noticed that this situation only happens if some instances got
preempted. Spark will retry the stage, so some of the partitioned files are
generated at the 1st time, and other files are generated at the 2nd(retry)
time.
Moreover, those duplicated logs will be duplicated exactly twice and
located in
both batches (one in the first batch; and one in the second batch).

The input/output files are parquet on GCS. The Spark version is 2.4.4 with
standalone deployment. Workers running on GCP preemptible instances and they
being preempted very frequently.

The pipeline is running in a single long-running process with multi-threads,
each snapshot represent an "hour" of data, and we do the
"read-reduce-write" operations
on multiple snapshots(hours) simultaneously. We pretty sure the same
snapshot(hour) never process parallelly and the output path always
generated with a timestamp, so those jobs shouldn't affect each other.

After changing the line (1) to `coalesce` or `repartition(100, $"pkey")`
the issue
was gone, but I believe there is still a correctness bug that hasn't been
reported yet.

We have tried to reproduce this bug on a smaller scale but haven't
succeeded yet. I
have read SPARK-23207 and SPARK-28699, but couldn't found the bug.
Since this case is DataSet, I believe it is unrelated to SPARK-24243.

Can anyone give me some advice about the following tasks?
Thanks in advance.

Shiao-An Yuan

[bug] Scala reflection "assertion failed: class Byte" in Dataset.toJSON

2020-05-30 Thread Brandon Vincent

Hi all,

I have a job that executes a query and collects the results as JSON using
Dataset.toJSON. For the most part it is stable, but sometimes it fails
randomly with a scala assertion error. Here is the stack trace:

  org.apache.spark.sql.Dataset.toJSON
 Dataset.scala: 3222
org.apache.spark.sql.Encoders$.STRING
Encoders.scala:   96
  org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply
 ExpressionEncoder.scala:   72
   org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor
 ScalaReflection.scala:  161
   org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor
 ScalaReflection.scala:  173
  org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects
 ScalaReflection.scala:   49
  org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$
 ScalaReflection.scala:  925
   org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects
 ScalaReflection.scala:  926
  scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo
 TypeConstraints.scala:   68
org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$deserializerFor$1
 ScalaReflection.scala:  260
   org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf
 ScalaReflection.scala:   49
   org.apache.spark.sql.catalyst.ScalaReflection.localTypeOf$
 ScalaReflection.scala:  939
org.apache.spark.sql.catalyst.ScalaReflection.localTypeOf
 ScalaReflection.scala:  941
   scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe
TypeTags.scala:  237
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute
TypeTags.scala:  237
  org.apache.spark.sql.catalyst.ScalaReflection$$typecreator7$1.apply
 ScalaReflection.scala:  260
  scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor
 Symbols.scala: 3081
scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor
 Symbols.scala:  194
scala.reflect.internal.Symbols$TypeSymbol.typeConstructor
 Symbols.scala: 3154
  scala.reflect.internal.Symbols$TypeSymbol.setTyconCache
 Symbols.scala: 3163
   scala.reflect.internal.SymbolTable.throwAssertionError
 SymbolTable.scala:  183
java.lang.AssertionError: assertion failed: class Byte

It can also come up with "class Boolean" with the same stack trace.
Any clues on this? I wasn't able to find information about this specific
assertion error.

The spark version is 2.4.4 compiled with scala 2.12.

Thank you.

Have you paid your bug bounty or did you log him off without paying

2020-05-01 Thread Nelson Mandela




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Zahid Rahman

Thanks for the tip!

But if the first thing you come across
Is somebody  using the trim function to strip away spaces in /etc/hostnames
like so from :

127.0.0.1 hostname local

To

127.0.0.1hostnamelocal

Then there is a log error message showing the outcome of unnecessarily
using the trim function.

Especially when one of the spark core functionality is to read lines from
files separated by a space, comma.

Also have you seen the log4j.properties
Setting to ERROR and in one case FATAL
for suppressing discrepancies.

Please May I draw your attention and attention of all in the community to
this page Which shows turning on compiler WARNINGS  before releasing
software and other software best practices.

“The Power of 10 — NASA’s Rules for Coding” by Riccardo Giorato
https://link.medium.com/PUz88PIql3

What impression  would you have  ?



On Sat, 28 Mar 2020, 15:50 Jeff Evans, 
wrote:

> Dude, you really need to chill. Have you ever worked with a large open
> source project before? It seems not. Even so, insinuating there are tons of
> bugs that were left uncovered until you came along (despite the fact that
> the project is used by millions across many different organizations) is
> ludicrous. Learn a little bit of humility
>
> If you're new to something, assume you have made a mistake rather than
> that there is a bug. Lurk a bit more, or even do a simple Google search,
> and you will realize Sean is a very senior committer (i.e. expert) in
> Spark, and has been for many years. He, and everyone else participating in
> these lists, is doing it voluntarily on their own time. They're not being
> paid to handhold you and quickly answer to your every whim.
>
> On Sat, Mar 28, 2020, 10:46 AM Zahid Rahman  wrote:
>
>> So the schema is limited to holding only the DEFINITION of schema. For
>> example as you say  the columns, I.e. first column User:Int 2nd column
>> String:password.
>>
>> Not location of source I.e. csv file with or without header.  SQL DB
>> tables.
>>
>> I am pleased for once I am wrong about being another bug, and it was a
>> design decision adding flexibility.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, 28 Mar 2020, 15:24 Russell Spitzer, 
>> wrote:
>>
>>> This is probably more of a question for the user support list, but I
>>> believe I understand the issue.
>>>
>>> Schema inside of spark refers to the structure of the output rows, for
>>> example the schema for a particular dataframe could be
>>> (User: Int, Password: String) - Two Columns the first is User of type
>>> int and the second is Password of Type String.
>>>
>>> When you pass the schema from one reader to another, you are only
>>> copyting this structure, not all of the other options associated with the
>>> dataframe.
>>> This is usually useful when you are reading from sources with different
>>> options but data that needs to be read into the same structure.
>>>
>>> The other properties such as "format" and "options" exist independently
>>> of Schema. This is helpful if I was reading from both MySQL and
>>> a comma separated file for example. While the Schema is the same, the
>>> options like ("inferSchema") do not apply to both MySql and CSV and
>>> format actually picks whether to us "JDBC" or "CSV" so copying that
>>> wouldn't be helpful either.
>>>
>>> I hope this clears things up,
>>> Russ
>>>
>>> On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman 
>>> wrote:
>>>
>>>> Hi,
>>>> version: spark-3.0.0-preview2-bin-hadoop2.7
>>>>
>>>> As you can see from the code :
>>>>
>>>> STEP 1:  I  create a object of type static frame which holds all the
>>>> information to the datasource (csv files).
>>>>
>>>> STEP 2: Then I create a variable  called staticSchema  assigning the
>>>> information of the schema from the original static data frame.
>>>>
>>>> STEP 3: then I create another variable called val streamingDataFrame of
>>>> type spark.readStream.
>>>> and Into the .schema function parameters I pass the object staticSchema
>>>> which is meant to hold the information to the  csv files including the
>>>> .load(path) function etc.
>>>>
>>>> So then when I am creating val StreamingDataFrame and passing it
>>>> .schema(staticSchema)
>>>> the variable StreamingDataFrame  should have all the information.
>>>> I should only have to call .optio

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Zahid Rahman

~/spark-3.0.0-preview2-bin-hadoop2.7$ sbin/start-slave.sh spark://
192.168.0.38:7077
~/spark-3.0.0-preview2-bin-hadoop2.7$ sbin/start-master.sh

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Fri, 27 Mar 2020 at 06:12, Zahid Rahman  wrote:

> sbin/start-master.sh
> sbin/start-slave.sh spark://192.168.0.38:7077
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
> On Fri, 27 Mar 2020 at 05:59, Wenchen Fan  wrote:
>
>> Your Spark cluster, spark://192.168.0.38:7077, how is it deployed if you
>> just include Spark dependency in IntelliJ?
>>
>> On Fri, Mar 27, 2020 at 1:54 PM Zahid Rahman 
>> wrote:
>>
>>> I have configured  in IntelliJ as external jars
>>> spark-3.0.0-preview2-bin-hadoop2.7/jar
>>>
>>> not pulling anything from maven.
>>>
>>> Backbutton.co.uk
>>> ¯\_(ツ)_/¯
>>> ♡۶Java♡۶RMI ♡۶
>>> Make Use Method {MUM}
>>> makeuse.org
>>> 
>>>
>>>
>>> On Fri, 27 Mar 2020 at 05:45, Wenchen Fan  wrote:
>>>
 Which Spark/Scala version do you use?

 On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman 
 wrote:

>
> with the following sparksession configuration
>
> val spark = SparkSession.builder().master("local[*]").appName("Spark 
> Session take").getOrCreate();
>
> this line works
>
> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
> "Canada").map(flight_row => flight_row).take(5)
>
>
> however if change the master url like so, with the ip address then the
> following error is produced by the position of .take(5)
>
> val spark = 
> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
> Session take").getOrCreate();
>
>
> 20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
> 1, 192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
> instance of java.lang.invoke.SerializedLambda to field
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in 
> instance
> of org.apache.spark.rdd.MapPartitionsRDD
>
> BUT if I  remove take(5) or change the position of take(5) or insert
> an extra take(5) as illustrated in code then it works. I don't see why the
> position of take(5) should cause such an error or be caused by changing 
> the
> master url
>
> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
> "Canada").map(flight_row => flight_row).take(5)
>
>   flights.take(5)
>
>   flights
>   .take(5)
>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count 
> + 5))
>flights.show(5)
>
>
> complete code if you wish to replicate it.
>
> import org.apache.spark.sql.SparkSession
>
> object sessiontest {
>
>   // define specific  data type class then manipulate it using the filter 
> and map functions
>   // this is also known as an Encoder
>   case class flight (DEST_COUNTRY_NAME: String,
>  ORIGIN_COUNTRY_NAME:String,
>  count: BigInt)
>
>
>   def main(args:Array[String]): Unit ={
>
> val spark = 
> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
> Session take").getOrCreate();
>
> import spark.implicits._
> val flightDf = 
> spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
> val flights = flightDf.as[flight]
>
> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME 
> != "Canada").map(flight_row => flight_row).take(5)
>
>   flights.take(5)
>
>   flights
>   .take(5)
>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>   .map(fr => flight(fr.DEST_COUNTRY_NAME, 
> fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
>flights.show(5)
>
>   } // main
> }
>
>
>
>
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Zahid Rahman

sbin/start-master.sh
sbin/start-slave.sh spark://192.168.0.38:7077

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Fri, 27 Mar 2020 at 05:59, Wenchen Fan  wrote:

> Your Spark cluster, spark://192.168.0.38:7077, how is it deployed if you
> just include Spark dependency in IntelliJ?
>
> On Fri, Mar 27, 2020 at 1:54 PM Zahid Rahman  wrote:
>
>> I have configured  in IntelliJ as external jars
>> spark-3.0.0-preview2-bin-hadoop2.7/jar
>>
>> not pulling anything from maven.
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>> On Fri, 27 Mar 2020 at 05:45, Wenchen Fan  wrote:
>>
>>> Which Spark/Scala version do you use?
>>>
>>> On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman 
>>> wrote:
>>>

 with the following sparksession configuration

 val spark = SparkSession.builder().master("local[*]").appName("Spark 
 Session take").getOrCreate();

 this line works

 flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
 "Canada").map(flight_row => flight_row).take(5)


 however if change the master url like so, with the ip address then the
 following error is produced by the position of .take(5)

 val spark = 
 SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
 Session take").getOrCreate();


 20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
 1, 192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
 instance of java.lang.invoke.SerializedLambda to field
 org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance
 of org.apache.spark.rdd.MapPartitionsRDD

 BUT if I  remove take(5) or change the position of take(5) or insert an
 extra take(5) as illustrated in code then it works. I don't see why the
 position of take(5) should cause such an error or be caused by changing the
 master url

 flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
 "Canada").map(flight_row => flight_row).take(5)

   flights.take(5)

   flights
   .take(5)
   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count 
 + 5))
flights.show(5)


 complete code if you wish to replicate it.

 import org.apache.spark.sql.SparkSession

 object sessiontest {

   // define specific  data type class then manipulate it using the filter 
 and map functions
   // this is also known as an Encoder
   case class flight (DEST_COUNTRY_NAME: String,
  ORIGIN_COUNTRY_NAME:String,
  count: BigInt)


   def main(args:Array[String]): Unit ={

 val spark = 
 SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
 Session take").getOrCreate();

 import spark.implicits._
 val flightDf = 
 spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
 val flights = flightDf.as[flight]

 flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
 "Canada").map(flight_row => flight_row).take(5)

   flights.take(5)

   flights
   .take(5)
   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
   .map(fr => flight(fr.DEST_COUNTRY_NAME, 
 fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
flights.show(5)

   } // main
 }





 Backbutton.co.uk
 ¯\_(ツ)_/¯
 ♡۶Java♡۶RMI ♡۶
 Make Use Method {MUM}
 makeuse.org
 

>>>

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan

Your Spark cluster, spark://192.168.0.38:7077, how is it deployed if you
just include Spark dependency in IntelliJ?

On Fri, Mar 27, 2020 at 1:54 PM Zahid Rahman  wrote:

> I have configured  in IntelliJ as external jars
> spark-3.0.0-preview2-bin-hadoop2.7/jar
>
> not pulling anything from maven.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
> On Fri, 27 Mar 2020 at 05:45, Wenchen Fan  wrote:
>
>> Which Spark/Scala version do you use?
>>
>> On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman 
>> wrote:
>>
>>>
>>> with the following sparksession configuration
>>>
>>> val spark = SparkSession.builder().master("local[*]").appName("Spark 
>>> Session take").getOrCreate();
>>>
>>> this line works
>>>
>>> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>>> "Canada").map(flight_row => flight_row).take(5)
>>>
>>>
>>> however if change the master url like so, with the ip address then the
>>> following error is produced by the position of .take(5)
>>>
>>> val spark = 
>>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
>>> Session take").getOrCreate();
>>>
>>>
>>> 20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
>>> 1, 192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
>>> instance of java.lang.invoke.SerializedLambda to field
>>> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance
>>> of org.apache.spark.rdd.MapPartitionsRDD
>>>
>>> BUT if I  remove take(5) or change the position of take(5) or insert an
>>> extra take(5) as illustrated in code then it works. I don't see why the
>>> position of take(5) should cause such an error or be caused by changing the
>>> master url
>>>
>>> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>>> "Canada").map(flight_row => flight_row).take(5)
>>>
>>>   flights.take(5)
>>>
>>>   flights
>>>   .take(5)
>>>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>>>   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count + 
>>> 5))
>>>flights.show(5)
>>>
>>>
>>> complete code if you wish to replicate it.
>>>
>>> import org.apache.spark.sql.SparkSession
>>>
>>> object sessiontest {
>>>
>>>   // define specific  data type class then manipulate it using the filter 
>>> and map functions
>>>   // this is also known as an Encoder
>>>   case class flight (DEST_COUNTRY_NAME: String,
>>>  ORIGIN_COUNTRY_NAME:String,
>>>  count: BigInt)
>>>
>>>
>>>   def main(args:Array[String]): Unit ={
>>>
>>> val spark = 
>>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
>>> Session take").getOrCreate();
>>>
>>> import spark.implicits._
>>> val flightDf = 
>>> spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
>>> val flights = flightDf.as[flight]
>>>
>>> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>>> "Canada").map(flight_row => flight_row).take(5)
>>>
>>>   flights.take(5)
>>>
>>>   flights
>>>   .take(5)
>>>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>>>   .map(fr => flight(fr.DEST_COUNTRY_NAME, 
>>> fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
>>>flights.show(5)
>>>
>>>   } // main
>>> }
>>>
>>>
>>>
>>>
>>>
>>> Backbutton.co.uk
>>> ¯\_(ツ)_/¯
>>> ♡۶Java♡۶RMI ♡۶
>>> Make Use Method {MUM}
>>> makeuse.org
>>> 
>>>
>>

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Zahid Rahman

I have configured  in IntelliJ as external jars
spark-3.0.0-preview2-bin-hadoop2.7/jar

not pulling anything from maven.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Fri, 27 Mar 2020 at 05:45, Wenchen Fan  wrote:

> Which Spark/Scala version do you use?
>
> On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman  wrote:
>
>>
>> with the following sparksession configuration
>>
>> val spark = SparkSession.builder().master("local[*]").appName("Spark Session 
>> take").getOrCreate();
>>
>> this line works
>>
>> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>> "Canada").map(flight_row => flight_row).take(5)
>>
>>
>> however if change the master url like so, with the ip address then the
>> following error is produced by the position of .take(5)
>>
>> val spark = 
>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
>> Session take").getOrCreate();
>>
>>
>> 20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
>> 192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
>> instance of java.lang.invoke.SerializedLambda to field
>> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance
>> of org.apache.spark.rdd.MapPartitionsRDD
>>
>> BUT if I  remove take(5) or change the position of take(5) or insert an
>> extra take(5) as illustrated in code then it works. I don't see why the
>> position of take(5) should cause such an error or be caused by changing the
>> master url
>>
>> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>> "Canada").map(flight_row => flight_row).take(5)
>>
>>   flights.take(5)
>>
>>   flights
>>   .take(5)
>>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>>   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count + 
>> 5))
>>flights.show(5)
>>
>>
>> complete code if you wish to replicate it.
>>
>> import org.apache.spark.sql.SparkSession
>>
>> object sessiontest {
>>
>>   // define specific  data type class then manipulate it using the filter 
>> and map functions
>>   // this is also known as an Encoder
>>   case class flight (DEST_COUNTRY_NAME: String,
>>  ORIGIN_COUNTRY_NAME:String,
>>  count: BigInt)
>>
>>
>>   def main(args:Array[String]): Unit ={
>>
>> val spark = 
>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
>> Session take").getOrCreate();
>>
>> import spark.implicits._
>> val flightDf = 
>> spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
>> val flights = flightDf.as[flight]
>>
>> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
>> "Canada").map(flight_row => flight_row).take(5)
>>
>>   flights.take(5)
>>
>>   flights
>>   .take(5)
>>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>>   .map(fr => flight(fr.DEST_COUNTRY_NAME, 
>> fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
>>flights.show(5)
>>
>>   } // main
>> }
>>
>>
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan

Which Spark/Scala version do you use?

On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman  wrote:

>
> with the following sparksession configuration
>
> val spark = SparkSession.builder().master("local[*]").appName("Spark Session 
> take").getOrCreate();
>
> this line works
>
> flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
> "Canada").map(flight_row => flight_row).take(5)
>
>
> however if change the master url like so, with the ip address then the
> following error is produced by the position of .take(5)
>
> val spark = 
> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
> Session take").getOrCreate();
>
>
> 20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
> 192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
> instance of java.lang.invoke.SerializedLambda to field
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance
> of org.apache.spark.rdd.MapPartitionsRDD
>
> BUT if I  remove take(5) or change the position of take(5) or insert an
> extra take(5) as illustrated in code then it works. I don't see why the
> position of take(5) should cause such an error or be caused by changing the
> master url
>
> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
> "Canada").map(flight_row => flight_row).take(5)
>
>   flights.take(5)
>
>   flights
>   .take(5)
>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count + 
> 5))
>flights.show(5)
>
>
> complete code if you wish to replicate it.
>
> import org.apache.spark.sql.SparkSession
>
> object sessiontest {
>
>   // define specific  data type class then manipulate it using the filter and 
> map functions
>   // this is also known as an Encoder
>   case class flight (DEST_COUNTRY_NAME: String,
>  ORIGIN_COUNTRY_NAME:String,
>  count: BigInt)
>
>
>   def main(args:Array[String]): Unit ={
>
> val spark = 
> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark 
> Session take").getOrCreate();
>
> import spark.implicits._
> val flightDf = 
> spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
> val flights = flightDf.as[flight]
>
> flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != 
> "Canada").map(flight_row => flight_row).take(5)
>
>   flights.take(5)
>
>   flights
>   .take(5)
>   .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
>   .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count 
> + 5))
>flights.show(5)
>
>   } // main
> }
>
>
>
>
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>

BUG: take with SparkSession.master[url]

2020-03-26 Thread Zahid Rahman

with the following sparksession configuration

val spark = SparkSession.builder().master("local[*]").appName("Spark
Session take").getOrCreate();

this line works

flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
"Canada").map(flight_row => flight_row).take(5)


however if change the master url like so, with the ip address then the
following error is produced by the position of .take(5)

val spark = 
SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark
Session take").getOrCreate();


20/03/27 05:15:20 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
192.168.0.38, executor 0): java.lang.ClassCastException: cannot assign
instance of java.lang.invoke.SerializedLambda to field
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance
of org.apache.spark.rdd.MapPartitionsRDD

BUT if I  remove take(5) or change the position of take(5) or insert an
extra take(5) as illustrated in code then it works. I don't see why the
position of take(5) should cause such an error or be caused by changing the
master url

flights.take(5).filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME !=
"Canada").map(flight_row => flight_row).take(5)

  flights.take(5)

  flights
  .take(5)
  .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
  .map(fr => flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
   flights.show(5)


complete code if you wish to replicate it.

import org.apache.spark.sql.SparkSession

object sessiontest {

  // define specific  data type class then manipulate it using the
filter and map functions
  // this is also known as an Encoder
  case class flight (DEST_COUNTRY_NAME: String,
 ORIGIN_COUNTRY_NAME:String,
 count: BigInt)


  def main(args:Array[String]): Unit ={

val spark =
SparkSession.builder().master("spark://192.168.0.38:7077").appName("Spark
Session take").getOrCreate();

import spark.implicits._
val flightDf =
spark.read.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightDf.as[flight]

flights.take(5).filter(flight_row =>
flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row =>
flight_row).take(5)

  flights.take(5)

  flights
  .take(5)
  .filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
  .map(fr => flight(fr.DEST_COUNTRY_NAME,
fr.ORIGIN_COUNTRY_NAME,fr.count + 5))
   flights.show(5)

  } // main
}





Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org

Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann


Hi all!

Is it possible that Spark creates under certain circumstances duplicate 
rows when doing multiple joins?


What I did:

buse.count

res0: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count

res1: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").count


res2: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), 
$"crnb._c1"===$"wreg._c5").count


res3: Long = 21633023

For explanation: buse and crnb are 1:1 relationship tables.

In the last join I expected again 20554365 but suddenly duplicate rows 
exist. "wreg._c5" is a unique key, so it should not create more records:


wreg.groupBy($"_c5").agg(count($"_c2") as "cnt").filter($"cnt">1).show
+---+---+
|_c5|cnt|
+---+---+
+---+---+

When doing a distinct on the 4-way join I get the expected number of 
records:


buse.alias("buse").join(bdef.alias("bdef"), 
$"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), 
$"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), 
$"crnb._c1"===$"wreg._c5").distinct.count

res10: Long = 20554365

This (in my opinion) means, that Spark is creating duplicte rows, 
although it shouldn't. Or do I miss something?



Best, Rico.

Re: Hostname :BUG

2020-03-12 Thread Zahid Rahman

hey Dodgy Bob, Linux &  C programmers, conscientious non - objector,

I have a great idea I want share with you.
In linux I am familiar with wc {wc = word count} (linux users don't like
long winded typing ).
wc  flags are :
-c, --bytes print the byte counts
   -m, --chars
  print the character counts
   -l, --lines
  print the newline counts.


*zahid@192:~/Downloads> wc -w /etc/hostname55 /etc/hostname*

The first programme I was tasked to write in C was to replicate the linux
wc utility .
I called it wordcount.c with word -c -l -m or word wordcount  -c -l  /etc.

Anyway  on this page https://spark.apache.org/examples.html
there are examples of word count in scala,python and Java.

I kinda feel left out because I know a little  C and little Linux.
I think  it is great idea for the sake of "*familiarity* *for the client"*
( application developer ).
I was thinking of raising a JIRA but I thought I would consult with fellow
developers first. :)

Please be kind.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Mon, 9 Mar 2020 at 08:57, Zahid Rahman  wrote:

> Hey floyd ,
>
> I just realised something:
> You need to practice using the adduser command to create users
> or in your case useradd  because that's  less painless for you to create a
> user.
> Instead of working in root.
> Trust me it is good for you.
> Then you will realise this bit of code new SparkConf() is reading from the
> etc/hostname and not etc/host file for ip_address.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
> On Wed, 4 Mar 2020 at 21:14, Andrew Melo  wrote:
>
>> Hello Zabid,
>>
>> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>>
>>> Hi,
>>>
>>> I found the problem was because on my  Linux   Operating System the
>>> /etc/hostname was blank.
>>>
>>> *STEP 1*
>>> I searched  on google the error message and there was an answer
>>> suggesting
>>> I should add to /etc/hostname
>>>
>>> 127.0.0.1  [hostname] localhost.
>>>
>>
>> I believe you've confused /etc/hostname and /etc/hosts --
>>
>>
>>>
>>> I did that but there was still  an error,  this time the spark  log in
>>> standard output was concatenating the text content
>>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>>
>>> *STEP 2*
>>> My second attempt was to change the /etc/hostname to 127.0.0.1
>>> This time I was getting a warning with information about "using loop
>>> back"  rather than an error.
>>>
>>> *STEP 3*
>>> I wasn't happy with that so then I changed the /etc/hostname to (see
>>> below) ,
>>> then the warning message disappeared. my guess is that it is the act of
>>> creating spark session as to the cause of error,
>>> in SparkConf() API.
>>>
>>>  SparkConf sparkConf = new SparkConf()
>>>  .setAppName("Simple Application")
>>>  .setMaster("local")
>>>  .set("spark.executor.memory","2g");
>>>
>>> $ cat /etc/hostname
>>> # hosts This file describes a number of hostname-to-address
>>> #   mappings for the TCP/IP subsystem.  It is mostly
>>> #   used at boot time, when no name servers are running.
>>> #   On small systems, this file can be used instead of a
>>> #   "named" name server.
>>> # Syntax:
>>> #
>>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>>> #
>>>
>>> 192.168.0.42
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> zahid@localhost
>>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>>> -Dexec.args="input.txt"
>>> [INFO] Scanning for projects...
>>> [WARNING]
>>> [WARNING] Some problems were encountered while building the effective
>>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>>> [WARNING] 'build.plugins.plugin.version' for
>>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>>> column 21
>>> [WARNING]
>>> [WARNING] It is highly recommended to fix these problems because they
>>> threaten the stability of your build.
>>> [WARNING]
>>> [WARNING] For this reason, future Maven versions might no longer support
>>> building such malformed projects.
>>> [WARNING]
>>> [INFO]
>>> [INFO] ---< javacodegeek:examples
>>> >
>>> [INFO] Building examples 1.0-SNAPSHOT
>>> [INFO] [ jar
>>> ]-
>>> [INFO]
>>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>>> to method java.nio.Bits.unaligned()
>>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.spark.unsafe.Platform
>>>

Re: Hostname :BUG

2020-03-09 Thread Zahid Rahman

Hey floyd ,

I just realised something:
You need to practice using the adduser command to create users
or in your case useradd  because that's  less painless for you to create a
user.
Instead of working in root.
Trust me it is good for you.
Then you will realise this bit of code new SparkConf() is reading from the
etc/hostname and not etc/host file for ip_address.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Wed, 4 Mar 2020 at 21:14, Andrew Melo  wrote:

> Hello Zabid,
>
> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>
>> Hi,
>>
>> I found the problem was because on my  Linux   Operating System the
>> /etc/hostname was blank.
>>
>> *STEP 1*
>> I searched  on google the error message and there was an answer suggesting
>> I should add to /etc/hostname
>>
>> 127.0.0.1  [hostname] localhost.
>>
>
> I believe you've confused /etc/hostname and /etc/hosts --
>
>
>>
>> I did that but there was still  an error,  this time the spark  log in
>> standard output was concatenating the text content
>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>
>> *STEP 2*
>> My second attempt was to change the /etc/hostname to 127.0.0.1
>> This time I was getting a warning with information about "using loop
>> back"  rather than an error.
>>
>> *STEP 3*
>> I wasn't happy with that so then I changed the /etc/hostname to (see
>> below) ,
>> then the warning message disappeared. my guess is that it is the act of
>> creating spark session as to the cause of error,
>> in SparkConf() API.
>>
>>  SparkConf sparkConf = new SparkConf()
>>  .setAppName("Simple Application")
>>  .setMaster("local")
>>  .set("spark.executor.memory","2g");
>>
>> $ cat /etc/hostname
>> # hosts This file describes a number of hostname-to-address
>> #   mappings for the TCP/IP subsystem.  It is mostly
>> #   used at boot time, when no name servers are running.
>> #   On small systems, this file can be used instead of a
>> #   "named" name server.
>> # Syntax:
>> #
>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>> #
>>
>> 192.168.0.42
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> zahid@localhost
>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>> -Dexec.args="input.txt"
>> [INFO] Scanning for projects...
>> [WARNING]
>> [WARNING] Some problems were encountered while building the effective
>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>> [WARNING] 'build.plugins.plugin.version' for
>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>> column 21
>> [WARNING]
>> [WARNING] It is highly recommended to fix these problems because they
>> threaten the stability of your build.
>> [WARNING]
>> [WARNING] For this reason, future Maven versions might no longer support
>> building such malformed projects.
>> [WARNING]
>> [INFO]
>> [INFO] ---< javacodegeek:examples
>> >
>> [INFO] Building examples 1.0-SNAPSHOT
>> [INFO] [ jar
>> ]-
>> [INFO]
>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>> to method java.nio.Bits.unaligned()
>> WARNING: Please consider reporting this to the maintainers of
>> org.apache.spark.unsafe.Platform
>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>> reflective access operations
>> WARNING: All illegal access operations will be denied in a future release
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
>> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(zahid);
>> groups with view permissions: Set(); users  with modify permissions:
>> Set(zahid); groups with modify permissions: Set()
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. You may check whether configuring an appropriate binding
>> address.
>> 20/02/29

Re: Hostname :BUG

2020-03-05 Thread Zahid Rahman

Talking about copy and paste
Larry Tesler The *inventor* of *cut*/*copy* & *paste*, find & replace
past away last week age 74.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Thu, 5 Mar 2020 at 07:01, Zahid Rahman  wrote:

> Please explain why you think that if there is a different reason from this
> : -
>
> If you think that, because the header of /etc/hostname says hosts then
> that is because I copied the file header from /etc/hosts to  /etc/hostname.
>
>
>
>
> On Wed, 4 Mar 2020, 21:14 Andrew Melo,  wrote:
>
>> Hello Zabid,
>>
>> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>>
>>> Hi,
>>>
>>> I found the problem was because on my  Linux   Operating System the
>>> /etc/hostname was blank.
>>>
>>> *STEP 1*
>>> I searched  on google the error message and there was an answer
>>> suggesting
>>> I should add to /etc/hostname
>>>
>>> 127.0.0.1  [hostname] localhost.
>>>
>>
>> I believe you've confused /etc/hostname and /etc/hosts --
>>
>>
>>>
>>> I did that but there was still  an error,  this time the spark  log in
>>> standard output was concatenating the text content
>>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>>
>>> *STEP 2*
>>> My second attempt was to change the /etc/hostname to 127.0.0.1
>>> This time I was getting a warning with information about "using loop
>>> back"  rather than an error.
>>>
>>> *STEP 3*
>>> I wasn't happy with that so then I changed the /etc/hostname to (see
>>> below) ,
>>> then the warning message disappeared. my guess is that it is the act of
>>> creating spark session as to the cause of error,
>>> in SparkConf() API.
>>>
>>>  SparkConf sparkConf = new SparkConf()
>>>  .setAppName("Simple Application")
>>>  .setMaster("local")
>>>  .set("spark.executor.memory","2g");
>>>
>>> $ cat /etc/hostname
>>> # hosts This file describes a number of hostname-to-address
>>> #   mappings for the TCP/IP subsystem.  It is mostly
>>> #   used at boot time, when no name servers are running.
>>> #   On small systems, this file can be used instead of a
>>> #   "named" name server.
>>> # Syntax:
>>> #
>>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>>> #
>>>
>>> 192.168.0.42
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> zahid@localhost
>>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>>> -Dexec.args="input.txt"
>>> [INFO] Scanning for projects...
>>> [WARNING]
>>> [WARNING] Some problems were encountered while building the effective
>>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>>> [WARNING] 'build.plugins.plugin.version' for
>>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>>> column 21
>>> [WARNING]
>>> [WARNING] It is highly recommended to fix these problems because they
>>> threaten the stability of your build.
>>> [WARNING]
>>> [WARNING] For this reason, future Maven versions might no longer support
>>> building such malformed projects.
>>> [WARNING]
>>> [INFO]
>>> [INFO] ---< javacodegeek:examples
>>> >
>>> [INFO] Building examples 1.0-SNAPSHOT
>>> [INFO] [ jar
>>> ]-
>>> [INFO]
>>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>>> to method java.nio.Bits.unaligned()
>>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.spark.unsafe.Platform
>>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>>> reflective access operations
>>> WARNING: All illegal access operations will be denied in a future release
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
>>> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
>>> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users  with view permissions: Set(zahid);
>>> groups with view permissions: Set(); users  with modify permissions:
>>> Set(zahid); groups with modify permissions: Set()
>>>

Re: Hostname :BUG

2020-03-04 Thread Zahid Rahman

Please explain why you think that if there is a different reason from this
: -

If you think that, because the header of /etc/hostname says hosts then that
is because I copied the file header from /etc/hosts to  /etc/hostname.




On Wed, 4 Mar 2020, 21:14 Andrew Melo,  wrote:

> Hello Zabid,
>
> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>
>> Hi,
>>
>> I found the problem was because on my  Linux   Operating System the
>> /etc/hostname was blank.
>>
>> *STEP 1*
>> I searched  on google the error message and there was an answer suggesting
>> I should add to /etc/hostname
>>
>> 127.0.0.1  [hostname] localhost.
>>
>
> I believe you've confused /etc/hostname and /etc/hosts --
>
>
>>
>> I did that but there was still  an error,  this time the spark  log in
>> standard output was concatenating the text content
>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>
>> *STEP 2*
>> My second attempt was to change the /etc/hostname to 127.0.0.1
>> This time I was getting a warning with information about "using loop
>> back"  rather than an error.
>>
>> *STEP 3*
>> I wasn't happy with that so then I changed the /etc/hostname to (see
>> below) ,
>> then the warning message disappeared. my guess is that it is the act of
>> creating spark session as to the cause of error,
>> in SparkConf() API.
>>
>>  SparkConf sparkConf = new SparkConf()
>>  .setAppName("Simple Application")
>>  .setMaster("local")
>>  .set("spark.executor.memory","2g");
>>
>> $ cat /etc/hostname
>> # hosts This file describes a number of hostname-to-address
>> #   mappings for the TCP/IP subsystem.  It is mostly
>> #   used at boot time, when no name servers are running.
>> #   On small systems, this file can be used instead of a
>> #   "named" name server.
>> # Syntax:
>> #
>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>> #
>>
>> 192.168.0.42
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> zahid@localhost
>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>> -Dexec.args="input.txt"
>> [INFO] Scanning for projects...
>> [WARNING]
>> [WARNING] Some problems were encountered while building the effective
>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>> [WARNING] 'build.plugins.plugin.version' for
>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>> column 21
>> [WARNING]
>> [WARNING] It is highly recommended to fix these problems because they
>> threaten the stability of your build.
>> [WARNING]
>> [WARNING] For this reason, future Maven versions might no longer support
>> building such malformed projects.
>> [WARNING]
>> [INFO]
>> [INFO] ---< javacodegeek:examples
>> >
>> [INFO] Building examples 1.0-SNAPSHOT
>> [INFO] [ jar
>> ]-
>> [INFO]
>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>> to method java.nio.Bits.unaligned()
>> WARNING: Please consider reporting this to the maintainers of
>> org.apache.spark.unsafe.Platform
>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>> reflective access operations
>> WARNING: All illegal access operations will be denied in a future release
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
>> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
>> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(zahid);
>> groups with view permissions: Set(); users  with modify permissions:
>> Set(zahid); groups with modify permissions: Set()
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. You may check whether configuring an appropriate binding
>> address.
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port. You may check whether configuring an appropriate binding
>> address.
>> 20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
>> random free port.

Hostname :BUG

2020-03-04 Thread Zahid Rahman

Hi,

I found the problem was because on my  Linux   Operating System the
/etc/hostname was blank.

*STEP 1*
I searched  on google the error message and there was an answer suggesting
I should add to /etc/hostname

127.0.0.1  [hostname] localhost.

I did that but there was still  an error,  this time the spark  log in
standard output was concatenating the text content
of etc/hostname  like so ,   127.0.0.1[hostname]localhost.

*STEP 2*
My second attempt was to change the /etc/hostname to 127.0.0.1
This time I was getting a warning with information about "using loop back"
rather than an error.

*STEP 3*
I wasn't happy with that so then I changed the /etc/hostname to (see below)
,
then the warning message disappeared. my guess is that it is the act of
creating spark session as to the cause of error,
in SparkConf() API.

 SparkConf sparkConf = new SparkConf()
 .setAppName("Simple Application")
 .setMaster("local")
 .set("spark.executor.memory","2g");

$ cat /etc/hostname
# hosts This file describes a number of hostname-to-address
#   mappings for the TCP/IP subsystem.  It is mostly
#   used at boot time, when no name servers are running.
#   On small systems, this file can be used instead of a
#   "named" name server.
# Syntax:
#
# IP-Address  Full-Qualified-Hostname  Short-Hostname
#

192.168.0.42












zahid@localhost:~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
-Dexec.args="input.txt"
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model
for javacodegeek:examples:jar:1.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for
org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they
threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support
building such malformed projects.
[WARNING]
[INFO]
[INFO] ---< javacodegeek:examples
>
[INFO] Building examples 1.0-SNAPSHOT
[INFO] [ jar
]-
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
(file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(zahid);
groups with view permissions: Set(); users  with modify permissions:
Set(zahid); groups with modify permissions: Set()
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver' could not bind on a
random free port. You may check whether configuring an appropriate binding
address.
20/02/29 17:20:41 WARN Utils: Service 'sparkDriver'

[Spark SQL]: Dataframe group by potential bug (Scala)

2019-10-31 Thread ludwiggj

This is using Spark Scala 2.4.4. I'm getting some very strange behaviour
after reading in a dataframe from a json file, using sparkSession.read in
permissive mode. I've included the error column when reading in the data, as
I want to log details of any errors in the input json file.

My suspicion is that I've found a bug in Spark, though I'm happy to be
wrong. I can't find any reference to this issue online.

*Given this schema:*

val salesSchema = StructType(Seq(
  StructField("shopId", LongType, nullable = false),
  StructField("game", StringType, nullable = false),
  StructField("sales", LongType, nullable = false),
  StructField("_corrupt_record", StringType)
))

*I'm reading in this file:*

{"shopId": 1, "game":  "Monopoly", "sales": 60}
{"shopId": 1, "game":  "Cleudo", "sales": 25}
{"shopId": 2, "game":  "Monopoly", "sales": 40}
{"shopId": "err", "game":  "Cleudo", "sales": 75}

Note that the last line has a deliberate error on the shopId field.

*I read in the data:*

val inputDataDF = sparkSession.read
  .schema(salesSchema)
  .option("mode", "PERMISSIVE")
  .json(filePath)

*On displaying it:*

+--+---+-+---+
|shopId|game  |sales |_corrupt_record   
 
|
+--+--+--+---+
|1   |Monopoly |60 |null
  
|
|1   |Cleudo |25 |null  

|
|2   |Monopoly |40 |null
  
|
|null|null |null|{"shopId": "err", "game":  "Cleudo",
"sales": 75} |
+--++-+--+

*I then filter out the failures:*

val validSales = inputDataDF.filter(col("_corrupt_record").isNull)

*I use a group by to sum the sales per game:*

val incorrectReportDF = validSales.groupBy("game")
  .agg(
count(col("game")),
sum(col("sales")) as "salesTotal"
  ).sort("game")

*The result is incorrect:*

+--++--+
|game   |count(game)   |salesTotal|
+--++--+
|Cleudo |2|100|
|Monopoly |2|100|
+--++--+

The Cleudo sales should only be 25, but the count column shows that the
erroneous record has been counted too. Since the sales of the error record
are 75, the incorrect total is 100.

*If I change the groupBy statement to collect all the records contributing
to each group, I get a different result:*

 val reportDF = validSales.groupBy("game")
  .agg(
count(col("game")),
sum(col("sales")) as "salesTotal",
collect_list(struct("*")).as("allRecords")
  ).sort("game")

+--+--+--++
|game   |count(game)|salesTotal|allRecords  
 
|
+--+--+--++
|Cleudo |1 |25  |[[1, Cleudo, 25,]] 
   
|
|Monopoly |2 |100|[[1, Monopoly, 60,], [2, Monopoly,
40,]] |
+--+--+--++

The salesTotal is now correct. However, if I then process this dataframe
further, for example by dropping the allRecords column, or converting it to
a DataSet based on a simple case class, the salesTotals revert to the
incorrect values.

The only reliable way I've found to handle this is to process the allRecords
column via an explode, and then group the resulting records again.

*In a single statement:*

val allInOneReport = validSales.groupBy("game")
  .agg(
collect_list(struct("*")).as("allRecords")
  )
  .select(explode($"allRecords"))
  .select($"col.game", $"col.sales")
  .groupBy("game")
  .agg(
sum(col("sales")) as "salesTotal"
  )
  .sort("game")

+---+--+
|game|salesTotal|
+---+--+
|Cleudo  |25  |
|Monopoly  |100|
+---+--+

I've created  a gist
<https://gist.github.com/ludwiggj/1fc3ac09ca698e22143e824c683e2394>   with
all the code and the output.

Thanks,

Graeme.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Dataset schema incompatibility bug when reading column partitioned data

2019-03-29 Thread Dávid Szakállas

We observed the following bug on Spark 2.4.0:

scala> 
spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet")

scala> val schema = StructType(Seq(StructField("_1", 
IntegerType),StructField("_2", IntegerType)))

scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show
+---+---+
| _2| _1|
+---+---+
|  2|  1|
+---+- --+

That is, when reading column partitioned Parquet files the explicitly specified 
schema is not adhered to, instead the partitioning columns are appended the end 
of the column list. This is a quite severe issue as some operations, such as 
union, fails if columns are in a different order in two datasets. Thus we have 
to work around the issue with a select:

val columnNames = schema.fields.map(_.name)
ds.select(columnNames.head, columnNames.tail: _*)


Thanks, 
David Szakallas
Data Engineer | Whitepages, Inc.

Re: Bug in Window Function

2018-07-25 Thread Jacek Laskowski

Hi Elior,

Could you show the query that led to the exception?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Wed, Jul 25, 2018 at 10:04 AM, Elior Malul  wrote:

> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> collect_set(named_struct(value, country#123 AS value#346, count,
> (cast(count(country#123) windowspecdefinit ion(campaign_id#104,
> app_id#93, country#123, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
> FOLLOWING) as double) / cast(count(1) windowspecdefinition(campaign_id#104,
> app_id #93, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
> as double)) AS count#349) AS histogram_country#350, 0, 0)
> windowspecdefinition(campaign_id#104, app_id#93, ROWS  BETWEEN
> UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS 
> collect_set(named_struct(NamePlaceholder(),
> country AS `value`, NamePlaceholder(), (CAST(count(country) OVER (PARTITI
>   ON BY campaign_id, app_id, country UnspecifiedFrame) AS DOUBLE) /
> CAST(count(1) OVER (PARTITION BY campaign_id, app_id UnspecifiedFrame) AS
> DOUBLE)) AS `count`) AS `histogram _country`) OVER (PARTITION BY
> campaign_id, app_id UnspecifiedFrame)#352 has multiple Window
> Specifications (ArrayBuffer(windowspecdefinition(campaign_id#104,
> app_id#93, ROWS  BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
> windowspecdefinition(campaign_id#104, app_id#93, country#123, ROWS
> BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) ).Please file a
> bug report with this error message, stack trace, and the query.;
>

Bug in Window Function

2018-07-25 Thread Elior Malul

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
collect_set(named_struct(value, country#123 AS value#346, count, 
(cast(count(country#123) windowspecdefinit ion(campaign_id#104, app_id#93, 
country#123, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as 
double) / cast(count(1) windowspecdefinition(campaign_id#104, app_id #93, 
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as double)) AS 
count#349) AS histogram_country#350, 0, 0) 
windowspecdefinition(campaign_id#104, app_id#93, ROWS  BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING) AS 
collect_set(named_struct(NamePlaceholder(), country AS `value`, 
NamePlaceholder(), (CAST(count(country) OVER (PARTITI ON BY campaign_id, 
app_id, country UnspecifiedFrame) AS DOUBLE) / CAST(count(1) OVER (PARTITION BY 
campaign_id, app_id UnspecifiedFrame) AS DOUBLE)) AS `count`) AS `histogram 
_country`) OVER (PARTITION BY campaign_id, app_id UnspecifiedFrame)#352 has 
multiple Window Specifications 
(ArrayBuffer(windowspecdefinition(campaign_id#104, app_id#93, ROWS  BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 
windowspecdefinition(campaign_id#104, app_id#93, country#123, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) ).Please file a bug report 
with this error message, stack trace, and the query.;

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan

Here it is :
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html


On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hi Shiyuan,
> can you show us the output of ¨explain¨ over df (as a last step)?
>
> On 11 April 2018 at 19:47, Shiyuan <gshy2...@gmail.com> wrote:
>
>> Variable name binding is a python thing, and Spark should not care how
>> the variable is named. What matters is the dependency graph. Spark fails to
>> handle this dependency graph correctly for which I am quite surprised: this
>> is just a simple combination of three very common sql operations.
>>
>>
>> On Tue, Apr 10, 2018 at 9:03 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Shiyuan,
>>>
>>> I do not know whether I am right, but I would prefer to avoid
>>> expressions in Spark as:
>>>
>>> df = <>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>>
>>>> Here is the pretty print of the physical plan which reveals some
>>>> details about what causes the bug (see the lines highlighted in bold):
>>>> WithColumnRenamed() fails to update the dependency graph correctly:
>>>>
>>>>
>>>> 'Resolved attribute(s) kk#144L missing from
>>>> ID#118,LABEL#119,kk#96L,score#121 in operator !Project [ID#118,
>>>> score#121, LABEL#119, kk#144L]. Attribute(s) with the same name appear in
>>>> the operation: kk. Please check if the right attribute(s) are used
>>>>
>>>> Project [ID#64, kk#73L, score#67, LABEL#65, cnt1#123L]
>>>> +- Join Inner, ((ID#64 = ID#135) && (kk#73L = kk#128L))
>>>>:- Project [ID#64, score#67, LABEL#65, kk#73L]
>>>>:  +- Join Inner, (ID#64 = ID#99)
>>>>: :- Project [ID#64, score#67, LABEL#65, kk#73L]
>>>>: :  +- Project [ID#64, LABEL#65, k#66L AS kk#73L, score#67]
>>>>: : +- LogicalRDD [ID#64, LABEL#65, k#66L, score#67]
>>>>: +- Project [ID#99]
>>>>:+- Filter (nL#90L > cast(1 as bigint))
>>>>:   +- Aggregate [ID#99], [ID#99, count(distinct LABEL#100)
>>>> AS nL#90L]
>>>>:  +- Project [ID#99, score#102, LABEL#100, kk#73L]
>>>>: +- Project [ID#99, LABEL#100, k#101L AS kk#73L,
>>>> score#102]
>>>>:+- LogicalRDD [ID#99, LABEL#100, k#101L,
>>>> score#102]
>>>>+- Project [ID#135, kk#128L, count#118L AS cnt1#123L]
>>>>   +- Aggregate [ID#135, kk#128L], [ID#135, kk#128L, count(1) AS
>>>> count#118L]
>>>>  +- Project [ID#135, score#138, LABEL#136, kk#128L]
>>>> +- Join Inner, (ID#135 = ID#99)
>>>>:- Project [ID#135, score#138, LABEL#136, kk#128L]
>>>>:  +- *Project [ID#135, LABEL#136, k#137L AS kk#128L,
>>>> score#138]*
>>>>: +- LogicalRDD [ID#135, LABEL#136, k#137L,
>>>> score#138]
>>>>+- Project [ID#99]
>>>>   +- Filter (nL#90L > cast(1 as bigint))
>>>>  +- Aggregate [ID#99], [ID#99, count(distinct
>>>> LABEL#100) AS nL#90L]
>>>> +- *!Project [ID#99, score#102, LABEL#100,
>>>> kk#128L]*
>>>>+-* Project [ID#99, LABEL#100, k#101L AS
>>>> kk#73L, score#102]*
>>>>   +- LogicalRDD [ID#99, LABEL#100, k#101L,
>>>> score#102]
>>>>
>>>> Here is the code which generates the error:
>>>>
>>>> import pyspark.sql.functions as F
>>>> from pyspark.sql import Row
>>>> df = spark.createDataFrame([Row(score=1.0,ID='abc',LABEL=True,k=2
>>>> ),Row(score=1.0,ID='abc',LABEL=False,k=3)]).withColumnRename
>>>> d("k","kk").select("ID","score","LABEL","kk")
>>>> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).f
>>>> ilter(F.col("nL")>1)
>>>> df = df.join(df_t.select("ID"),["ID"])
>>>> df_sw = df.groupby(["ID","kk"]).count().wi

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando

Hi Shiyuan,
can you show us the output of ¨explain¨ over df (as a last step)?

On 11 April 2018 at 19:47, Shiyuan <gshy2...@gmail.com> wrote:

> Variable name binding is a python thing, and Spark should not care how the
> variable is named. What matters is the dependency graph. Spark fails to
> handle this dependency graph correctly for which I am quite surprised: this
> is just a simple combination of three very common sql operations.
>
>
> On Tue, Apr 10, 2018 at 9:03 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi Shiyuan,
>>
>> I do not know whether I am right, but I would prefer to avoid expressions
>> in Spark as:
>>
>> df = <>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>
>>> Here is the pretty print of the physical plan which reveals some details
>>> about what causes the bug (see the lines highlighted in bold):
>>> WithColumnRenamed() fails to update the dependency graph correctly:
>>>
>>>
>>> 'Resolved attribute(s) kk#144L missing from
>>> ID#118,LABEL#119,kk#96L,score#121 in operator !Project [ID#118,
>>> score#121, LABEL#119, kk#144L]. Attribute(s) with the same name appear in
>>> the operation: kk. Please check if the right attribute(s) are used
>>>
>>> Project [ID#64, kk#73L, score#67, LABEL#65, cnt1#123L]
>>> +- Join Inner, ((ID#64 = ID#135) && (kk#73L = kk#128L))
>>>:- Project [ID#64, score#67, LABEL#65, kk#73L]
>>>:  +- Join Inner, (ID#64 = ID#99)
>>>: :- Project [ID#64, score#67, LABEL#65, kk#73L]
>>>: :  +- Project [ID#64, LABEL#65, k#66L AS kk#73L, score#67]
>>>: : +- LogicalRDD [ID#64, LABEL#65, k#66L, score#67]
>>>: +- Project [ID#99]
>>>:+- Filter (nL#90L > cast(1 as bigint))
>>>:   +- Aggregate [ID#99], [ID#99, count(distinct LABEL#100)
>>> AS nL#90L]
>>>:  +- Project [ID#99, score#102, LABEL#100, kk#73L]
>>>: +- Project [ID#99, LABEL#100, k#101L AS kk#73L,
>>> score#102]
>>>:+- LogicalRDD [ID#99, LABEL#100, k#101L,
>>> score#102]
>>>+- Project [ID#135, kk#128L, count#118L AS cnt1#123L]
>>>   +- Aggregate [ID#135, kk#128L], [ID#135, kk#128L, count(1) AS
>>> count#118L]
>>>  +- Project [ID#135, score#138, LABEL#136, kk#128L]
>>> +- Join Inner, (ID#135 = ID#99)
>>>:- Project [ID#135, score#138, LABEL#136, kk#128L]
>>>:  +- *Project [ID#135, LABEL#136, k#137L AS kk#128L,
>>> score#138]*
>>>: +- LogicalRDD [ID#135, LABEL#136, k#137L, score#138]
>>>+- Project [ID#99]
>>>   +- Filter (nL#90L > cast(1 as bigint))
>>>  +- Aggregate [ID#99], [ID#99, count(distinct
>>> LABEL#100) AS nL#90L]
>>> +- *!Project [ID#99, score#102, LABEL#100,
>>> kk#128L]*
>>>+-* Project [ID#99, LABEL#100, k#101L AS
>>> kk#73L, score#102]*
>>>   +- LogicalRDD [ID#99, LABEL#100, k#101L,
>>> score#102]
>>>
>>> Here is the code which generates the error:
>>>
>>> import pyspark.sql.functions as F
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([Row(score=1.0,ID='abc',LABEL=True,k=2
>>> ),Row(score=1.0,ID='abc',LABEL=False,k=3)]).withColumnRename
>>> d("k","kk").select("ID","score","LABEL","kk")
>>> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).f
>>> ilter(F.col("nL")>1)
>>> df = df.join(df_t.select("ID"),["ID"])
>>> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count",
>>> "cnt1")
>>> df = df.join(df_sw, ["ID","kk"])
>>>
>>>
>>> On Tue, Apr 10, 2018 at 1:37 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>>
>>>> The spark warning about Row instead of Dict is not the culprit. The
>>>> problem still persists after I use Row instead of Dict to generate the
>>>> dataframe.
>>>>
>>>> Here is the expain() output regarding the reassignment of df as Gourav
>>>> suggests to run, They look the same except that  the

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan

Variable name binding is a python thing, and Spark should not care how the
variable is named. What matters is the dependency graph. Spark fails to
handle this dependency graph correctly for which I am quite surprised: this
is just a simple combination of three very common sql operations.


On Tue, Apr 10, 2018 at 9:03 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Shiyuan,
>
> I do not know whether I am right, but I would prefer to avoid expressions
> in Spark as:
>
> df = <>
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote:
>
>> Here is the pretty print of the physical plan which reveals some details
>> about what causes the bug (see the lines highlighted in bold):
>> WithColumnRenamed() fails to update the dependency graph correctly:
>>
>>
>> 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121
>> in operator !Project [ID#118, score#121, LABEL#119, kk#144L]. Attribute(s)
>> with the same name appear in the operation: kk. Please check if the right
>> attribute(s) are used
>>
>> Project [ID#64, kk#73L, score#67, LABEL#65, cnt1#123L]
>> +- Join Inner, ((ID#64 = ID#135) && (kk#73L = kk#128L))
>>:- Project [ID#64, score#67, LABEL#65, kk#73L]
>>:  +- Join Inner, (ID#64 = ID#99)
>>: :- Project [ID#64, score#67, LABEL#65, kk#73L]
>>: :  +- Project [ID#64, LABEL#65, k#66L AS kk#73L, score#67]
>>: : +- LogicalRDD [ID#64, LABEL#65, k#66L, score#67]
>>: +- Project [ID#99]
>>:+- Filter (nL#90L > cast(1 as bigint))
>>:   +- Aggregate [ID#99], [ID#99, count(distinct LABEL#100) AS
>> nL#90L]
>>:  +- Project [ID#99, score#102, LABEL#100, kk#73L]
>>: +- Project [ID#99, LABEL#100, k#101L AS kk#73L,
>> score#102]
>>:+- LogicalRDD [ID#99, LABEL#100, k#101L,
>> score#102]
>>+- Project [ID#135, kk#128L, count#118L AS cnt1#123L]
>>   +- Aggregate [ID#135, kk#128L], [ID#135, kk#128L, count(1) AS
>> count#118L]
>>  +- Project [ID#135, score#138, LABEL#136, kk#128L]
>> +- Join Inner, (ID#135 = ID#99)
>>:- Project [ID#135, score#138, LABEL#136, kk#128L]
>>:  +- *Project [ID#135, LABEL#136, k#137L AS kk#128L,
>> score#138]*
>>: +- LogicalRDD [ID#135, LABEL#136, k#137L, score#138]
>>+- Project [ID#99]
>>   +- Filter (nL#90L > cast(1 as bigint))
>>  +- Aggregate [ID#99], [ID#99, count(distinct
>> LABEL#100) AS nL#90L]
>> +- *!Project [ID#99, score#102, LABEL#100,
>> kk#128L]*
>>+-* Project [ID#99, LABEL#100, k#101L AS
>> kk#73L, score#102]*
>>   +- LogicalRDD [ID#99, LABEL#100, k#101L,
>> score#102]
>>
>> Here is the code which generates the error:
>>
>> import pyspark.sql.functions as F
>> from pyspark.sql import Row
>> df = spark.createDataFrame([Row(score=1.0,ID='abc',LABEL=True,k=
>> 2),Row(score=1.0,ID='abc',LABEL=False,k=3)]).withColumnRenam
>> ed("k","kk").select("ID","score","LABEL","kk")
>> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).
>> filter(F.col("nL")>1)
>> df = df.join(df_t.select("ID"),["ID"])
>> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count",
>> "cnt1")
>> df = df.join(df_sw, ["ID","kk"])
>>
>>
>> On Tue, Apr 10, 2018 at 1:37 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>
>>> The spark warning about Row instead of Dict is not the culprit. The
>>> problem still persists after I use Row instead of Dict to generate the
>>> dataframe.
>>>
>>> Here is the expain() output regarding the reassignment of df as Gourav
>>> suggests to run, They look the same except that  the serial numbers
>>> following the columns are different(eg. ID#7273 vs. ID#7344).
>>>
>>> this is the output of df.explain() after df =
>>> df.join(df_t.select("ID"),["ID"])
>>> == Physical Plan == *(6) Project [ID#7273, score#7276, LABEL#7274,
>>> kk#7281L] +- *(6) SortMergeJoin [ID#7273], [ID#7303], Inner :- *(2) Sort
>>> [ID#7273 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ID#7273,
>>> 200) : +- *(1) Proj

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Gourav Sengupta

Hi Shiyuan,

I do not know whether I am right, but I would prefer to avoid expressions
in Spark as:

df = <>


Regards,
Gourav Sengupta

On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote:

> Here is the pretty print of the physical plan which reveals some details
> about what causes the bug (see the lines highlighted in bold):
> WithColumnRenamed() fails to update the dependency graph correctly:
>
>
> 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121
> in operator !Project [ID#118, score#121, LABEL#119, kk#144L]. Attribute(s)
> with the same name appear in the operation: kk. Please check if the right
> attribute(s) are used
>
> Project [ID#64, kk#73L, score#67, LABEL#65, cnt1#123L]
> +- Join Inner, ((ID#64 = ID#135) && (kk#73L = kk#128L))
>:- Project [ID#64, score#67, LABEL#65, kk#73L]
>:  +- Join Inner, (ID#64 = ID#99)
>: :- Project [ID#64, score#67, LABEL#65, kk#73L]
>: :  +- Project [ID#64, LABEL#65, k#66L AS kk#73L, score#67]
>: : +- LogicalRDD [ID#64, LABEL#65, k#66L, score#67]
>: +- Project [ID#99]
>:+- Filter (nL#90L > cast(1 as bigint))
>:   +- Aggregate [ID#99], [ID#99, count(distinct LABEL#100) AS
> nL#90L]
>:  +- Project [ID#99, score#102, LABEL#100, kk#73L]
>: +- Project [ID#99, LABEL#100, k#101L AS kk#73L,
> score#102]
>:+- LogicalRDD [ID#99, LABEL#100, k#101L, score#102]
>+- Project [ID#135, kk#128L, count#118L AS cnt1#123L]
>   +- Aggregate [ID#135, kk#128L], [ID#135, kk#128L, count(1) AS
> count#118L]
>  +- Project [ID#135, score#138, LABEL#136, kk#128L]
> +- Join Inner, (ID#135 = ID#99)
>:- Project [ID#135, score#138, LABEL#136, kk#128L]
>:  +- *Project [ID#135, LABEL#136, k#137L AS kk#128L,
> score#138]*
>: +- LogicalRDD [ID#135, LABEL#136, k#137L, score#138]
>+- Project [ID#99]
>   +- Filter (nL#90L > cast(1 as bigint))
>  +- Aggregate [ID#99], [ID#99, count(distinct
> LABEL#100) AS nL#90L]
> +- *!Project [ID#99, score#102, LABEL#100,
> kk#128L]*
>+-* Project [ID#99, LABEL#100, k#101L AS
> kk#73L, score#102]*
>   +- LogicalRDD [ID#99, LABEL#100, k#101L,
> score#102]
>
> Here is the code which generates the error:
>
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> df = spark.createDataFrame([Row(score=1.0,ID='abc',LABEL=True,
> k=2),Row(score=1.0,ID='abc',LABEL=False,k=3)]).
> withColumnRenamed("k","kk").select("ID","score","LABEL","kk")
> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("
> nL")).filter(F.col("nL")>1)
> df = df.join(df_t.select("ID"),["ID"])
> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
> df = df.join(df_sw, ["ID","kk"])
>
>
> On Tue, Apr 10, 2018 at 1:37 PM, Shiyuan <gshy2...@gmail.com> wrote:
>
>> The spark warning about Row instead of Dict is not the culprit. The
>> problem still persists after I use Row instead of Dict to generate the
>> dataframe.
>>
>> Here is the expain() output regarding the reassignment of df as Gourav
>> suggests to run, They look the same except that  the serial numbers
>> following the columns are different(eg. ID#7273 vs. ID#7344).
>>
>> this is the output of df.explain() after df =
>> df.join(df_t.select("ID"),["ID"])
>> == Physical Plan == *(6) Project [ID#7273, score#7276, LABEL#7274,
>> kk#7281L] +- *(6) SortMergeJoin [ID#7273], [ID#7303], Inner :- *(2) Sort
>> [ID#7273 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ID#7273,
>> 200) : +- *(1) Project [ID#7273, score#7276, LABEL#7274, k#7275L AS
>> kk#7281L] : +- *(1) Filter isnotnull(ID#7273) : +- *(1) Scan
>> ExistingRDD[ID#7273,LABEL#7274,k#7275L,score#7276] +- *(5) Sort [ID#7303
>> ASC NULLS FIRST], false, 0 +- *(5) Project [ID#7303] +- *(5) Filter
>> (nL#7295L > 1) +- *(5) HashAggregate(keys=[ID#7303],
>> functions=[finalmerge_count(distinct merge count#7314L) AS
>> count(LABEL#7304)#7294L]) +- Exchange hashpartitioning(ID#7303, 200) +-
>> *(4) HashAggregate(keys=[ID#7303], functions=[partial_count(distinct
>> LABEL#7304) AS count#7314L]) +- *(4) HashAggregate(keys=[ID#7303,
>> LABEL#7304], functions=[]) +- Exchange hashpartitioning(ID#7303,
>> LABEL#7304, 200) +- *(3) HashAgg

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan

Here is the pretty print of the physical plan which reveals some details
about what causes the bug (see the lines highlighted in bold):
WithColumnRenamed() fails to update the dependency graph correctly:


'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121
in operator !Project [ID#118, score#121, LABEL#119, kk#144L]. Attribute(s)
with the same name appear in the operation: kk. Please check if the right
attribute(s) are used

Project [ID#64, kk#73L, score#67, LABEL#65, cnt1#123L]
+- Join Inner, ((ID#64 = ID#135) && (kk#73L = kk#128L))
   :- Project [ID#64, score#67, LABEL#65, kk#73L]
   :  +- Join Inner, (ID#64 = ID#99)
   : :- Project [ID#64, score#67, LABEL#65, kk#73L]
   : :  +- Project [ID#64, LABEL#65, k#66L AS kk#73L, score#67]
   : : +- LogicalRDD [ID#64, LABEL#65, k#66L, score#67]
   : +- Project [ID#99]
   :+- Filter (nL#90L > cast(1 as bigint))
   :   +- Aggregate [ID#99], [ID#99, count(distinct LABEL#100) AS
nL#90L]
   :  +- Project [ID#99, score#102, LABEL#100, kk#73L]
   : +- Project [ID#99, LABEL#100, k#101L AS kk#73L,
score#102]
   :+- LogicalRDD [ID#99, LABEL#100, k#101L, score#102]
   +- Project [ID#135, kk#128L, count#118L AS cnt1#123L]
  +- Aggregate [ID#135, kk#128L], [ID#135, kk#128L, count(1) AS
count#118L]
 +- Project [ID#135, score#138, LABEL#136, kk#128L]
+- Join Inner, (ID#135 = ID#99)
   :- Project [ID#135, score#138, LABEL#136, kk#128L]
   :  +- *Project [ID#135, LABEL#136, k#137L AS kk#128L,
score#138]*
   : +- LogicalRDD [ID#135, LABEL#136, k#137L, score#138]
   +- Project [ID#99]
  +- Filter (nL#90L > cast(1 as bigint))
 +- Aggregate [ID#99], [ID#99, count(distinct
LABEL#100) AS nL#90L]
+- *!Project [ID#99, score#102, LABEL#100, kk#128L]*
   +-* Project [ID#99, LABEL#100, k#101L AS kk#73L,
score#102]*
  +- LogicalRDD [ID#99, LABEL#100, k#101L,
score#102]

Here is the code which generates the error:

import pyspark.sql.functions as F
from pyspark.sql import Row
df =
spark.createDataFrame([Row(score=1.0,ID='abc',LABEL=True,k=2),Row(score=1.0,ID='abc',LABEL=False,k=3)]).withColumnRenamed("k","kk").select("ID","score","LABEL","kk")
df_t =
df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
df = df.join(df_sw, ["ID","kk"])


On Tue, Apr 10, 2018 at 1:37 PM, Shiyuan <gshy2...@gmail.com> wrote:

> The spark warning about Row instead of Dict is not the culprit. The
> problem still persists after I use Row instead of Dict to generate the
> dataframe.
>
> Here is the expain() output regarding the reassignment of df as Gourav
> suggests to run, They look the same except that  the serial numbers
> following the columns are different(eg. ID#7273 vs. ID#7344).
>
> this is the output of df.explain() after df = df.join(df_t.select("ID"),["
> ID"])
> == Physical Plan == *(6) Project [ID#7273, score#7276, LABEL#7274,
> kk#7281L] +- *(6) SortMergeJoin [ID#7273], [ID#7303], Inner :- *(2) Sort
> [ID#7273 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ID#7273,
> 200) : +- *(1) Project [ID#7273, score#7276, LABEL#7274, k#7275L AS
> kk#7281L] : +- *(1) Filter isnotnull(ID#7273) : +- *(1) Scan
> ExistingRDD[ID#7273,LABEL#7274,k#7275L,score#7276] +- *(5) Sort [ID#7303
> ASC NULLS FIRST], false, 0 +- *(5) Project [ID#7303] +- *(5) Filter
> (nL#7295L > 1) +- *(5) HashAggregate(keys=[ID#7303],
> functions=[finalmerge_count(distinct merge count#7314L) AS
> count(LABEL#7304)#7294L]) +- Exchange hashpartitioning(ID#7303, 200) +-
> *(4) HashAggregate(keys=[ID#7303], functions=[partial_count(distinct
> LABEL#7304) AS count#7314L]) +- *(4) HashAggregate(keys=[ID#7303,
> LABEL#7304], functions=[]) +- Exchange hashpartitioning(ID#7303,
> LABEL#7304, 200) +- *(3) HashAggregate(keys=[ID#7303, LABEL#7304],
> functions=[]) +- *(3) Project [ID#7303, LABEL#7304] +- *(3) Filter
> isnotnull(ID#7303) +- *(3) Scan 
> ExistingRDD[ID#7303,LABEL#7304,k#7305L,score#7306]
>
>
> In comparison, this is the output of df1.explain() after  df1 =
> df.join(df_t.select("ID"),["ID"])?
> == Physical Plan == *(6) Project [ID#7344, score#7347, LABEL#7345,
> kk#7352L] +- *(6) SortMergeJoin [ID#7344], [ID#7374], Inner :- *(2) Sort
> [ID#7344 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(ID#7344,
> 200) : +- *(1) Project [ID#7344, sco

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan

EL#151, k#152L,
score#153], false\n +- Project [ID#118]\n +- Filter (nL#110L > cast(1 as
bigint))\n +- Aggregate [ID#118], [ID#118, count(distinct LABEL#119) AS
nL#110L]\n +- !Project [ID#118, score#121, LABEL#119, kk#144L]\n +- Project
[ID#118, LABEL#119, k#120L AS kk#96L, score#121]\n +- LogicalRDD [ID#118,
LABEL#119, k#120L, score#121], false\n'




On Mon, Apr 9, 2018 at 3:21 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> what I am curious about is the reassignment of df.
>
> Can you please look into the explain plan of df after the statement df =
> df.join(df_t.select("ID"),["ID"])? And then compare with the explain plan
> of df1 after the statement df1 = df.join(df_t.select("ID"),["ID"])?
>
> Its late here, but I am yet to go through this completely.  But I think
> that SPARK does throw a warning mentioning us to use Row instead of
> Dictionary.
>
> It will be of help if you could kindly try using the below statement and
> go through your used case once again (I am yet to go through all the lines):
>
>
>
> from pyspark.sql import Row
>
> df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2),
> Row(score = 1.0,ID="abc",LABEL=True,k=3)])
>
> Regards,
> Gourav Sengupta
>
>
> On Mon, Apr 9, 2018 at 6:50 PM, Shiyuan <gshy2...@gmail.com> wrote:
>
>> Hi Spark Users,
>> The following code snippet has an "attribute missing" error while the
>> attribute exists.  This bug is  triggered by a particular sequence of of
>> "select", "groupby" and "join".  Note that if I take away the "select"  in
>> #line B,  the code runs without error.   However, the "select" in #line B
>> includes all columns in the dataframe and hence should  not affect the
>> final result.
>>
>>
>> import pyspark.sql.functions as F
>> df = spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':True,
>> 'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])
>>
>> df = df.withColumnRenamed("k","kk")\
>>   .select("ID","score","LABEL","kk")#line B
>>
>> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).
>> filter(F.col("nL")>1)
>> df = df.join(df_t.select("ID"),["ID"])
>> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count",
>> "cnt1")
>> df = df.join(df_sw, ["ID","kk"])
>>
>
>

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Gourav Sengupta

Hi,

what I am curious about is the reassignment of df.

Can you please look into the explain plan of df after the statement df =
df.join(df_t.select("ID"),["ID"])? And then compare with the explain plan
of df1 after the statement df1 = df.join(df_t.select("ID"),["ID"])?

Its late here, but I am yet to go through this completely.  But I think
that SPARK does throw a warning mentioning us to use Row instead of
Dictionary.

It will be of help if you could kindly try using the below statement and go
through your used case once again (I am yet to go through all the lines):

from pyspark.sql import Row

df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2),
Row(score = 1.0,ID="abc",LABEL=True,k=3)])

Regards,
Gourav Sengupta

On Mon, Apr 9, 2018 at 6:50 PM, Shiyuan <gshy2...@gmail.com> wrote:

> Hi Spark Users,
> The following code snippet has an "attribute missing" error while the
> attribute exists.  This bug is  triggered by a particular sequence of of
> "select", "groupby" and "join".  Note that if I take away the "select"  in
> #line B,  the code runs without error.   However, the "select" in #line B
> includes all columns in the dataframe and hence should  not affect the
> final result.
>
>
> import pyspark.sql.functions as F
> df = spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':
> True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])
>
> df = df.withColumnRenamed("k","kk")\
>   .select("ID","score","LABEL","kk")#line B
>
> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("
> nL")).filter(F.col("nL")>1)
> df = df.join(df_t.select("ID"),["ID"])
> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
> df = df.join(df_sw, ["ID","kk"])
>

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Shiyuan

Hi Spark Users,
The following code snippet has an "attribute missing" error while the
attribute exists.  This bug is  triggered by a particular sequence of of
"select", "groupby" and "join".  Note that if I take away the "select"  in
#line B,  the code runs without error.   However, the "select" in #line B
includes all columns in the dataframe and hence should  not affect the
final result.


import pyspark.sql.functions as F
df =
spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])

df = df.withColumnRenamed("k","kk")\
  .select("ID","score","LABEL","kk")#line B

df_t =
df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
df = df.join(df_sw, ["ID","kk"])

spark 2.3 dataframe join bug

2018-03-26 Thread 李斌松

Hi, sparks:
 I'm using spark2.3 and had found a bug in spark dataframe, here is my
codes:

sc = sparkSession.sparkContext
tmp = sparkSession.createDataFrame(sc.parallelize([[1, 2, 3, 4],
[1, 2, 5, 6], [2, 3, 4, 5], [2, 3, 5, 6]])).toDF('a', 'b', 'c', 'd')
tmp.createOrReplaceTempView('tdl_spark_test')
sparkSession.sql('cache table tdl_spark_test')

df = sparkSession.sql('select a, b from tdl_spark_test group by a,
b')
df.printSchema()

df1 = sparkSession.sql('select a, b, collect_set(array(c)) as c
from tdl_spark_test group by a, b')
df1 = df1.withColumnRenamed('a', 'a1').withColumnRenamed('b', 'b1')
cond = [df.a==df1.a1, df.b==df1.b1]
df = df.join(df1, cond, 'inner').drop('a1', 'b1')

df2 = sparkSession.sql('select a, b, collect_set(array(d)) as d
from tdl_spark_test group by a, b')
df2 = df2.withColumnRenamed('a', 'a1').withColumnRenamed('b', 'b1')
cond = [df.a==df2.a1, df.b==df2.b1]
df = df.join(df2, cond, 'inner').drop('a1', 'b1')

df.show()
sparkSession.sql('uncache table tdl_spark_test')


as you can see, the above code is just create a dataframe and two
child dataframe，the expected answer is that:
   +---+---+--+--+
|  a|  b  | c   | d   |
   +---+---+--+--+
|  2|  3  |[[5], [4]]|[[5], [6]] |
|  1|  2  |[[5], [3]]|[[6], [4]] |
   +---+---+--+--+

however，we had got the unexpected answer:
+---+---+--+--+
 |  a  |  b | c   | d  |
+---+---+--+--+
 |  2|  3  |[[5], [4]]|[[5], [4]] |
 |  1|  2  |[[5], [3]]|[[5], [3]] |
+---+---+--+--+

 it seems that the column of the first dataframe had coverd the
column of the second dataframe.

 In addition, this error occurred as long as the following options
occurred at the same time:
 1. the first root table is cached.
 2. the "group by" is used in child dataframe.
 3. the "array" is used in "collect_set" in child dataframe.
 4. the join condition is "df.a==df2.a1, df.b==df2.b1" instead of
"['a', 'b']"

A possible bug? Must call persist to make code run

2017-12-06 Thread kwunlyou

I prepare a simple example (python) as follows to illustrate what I found:

- The code works well by calling a persist beforehand under all Spark
versions

- Without calling persist, the code works well under Spark 2.2.0 but doesn't
work under Spark 2.1.1 and Spark 2.1.2

- It really looks like a bug in Spark. Does anyone know which solved Spark
issues are related?


== CODE ==
from __future__ import absolute_import, division, print_function
import pyspark.sql.types as T
import pyspark.sql.functions as F

# 2.1.1, 2.1.2 doesn't work
# 2.2.0 works
print(spark.version)

df = spark.createDataFrame(
[{'name': 'a', 'scores': ['1', '2']}, {'name': 'b', 'scores': None}],
T.StructType(
[T.StructField('name', T.StringType(), True),
T.StructField('scores', T.ArrayType(T.StringType()), True)]
)
)

print(df.collect())
df.printSchema()

def loop_array(l):
for e in l:
pass
return "pass"


# should work with persist
# tmp = df.filter(F.col('scores').isNotNull()).withColumn(
# 'new_col',
# F.udf(loop_array)('scores')
# ).persist()

# won't work
tmp = df.filter(F.col('scores').isNotNull()).withColumn(
'new_col',
F.udf(loop_array)('scores')
)

print(tmp.collect())
tmp.filter(F.col('new_col').isNotNull()).count()
 CODE END 

== ERROR MESSAGE ==
---
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 tmp.filter(F.col('new_col').isNotNull()).count()

/databricks/spark/python/pyspark/sql/dataframe.py in count(self)
378 2
379 """
--> 380 return int(self._jdf.count())
381
382 @ignore_unicode_prefix

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o235.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3
in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0
(TID 14, 10.179.231.249, executor 0):
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File "/databricks/spark/python/pyspark/worker.py", line 171, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/worker.py", line 103, in 
func = lambda _, it: map(mapper, it)
  File "", line 1, in 
  File "/databricks/spark/python/pyspark/worker.py", line 70, in 
return lambda *a: f(*a)
  File "", line 2, in loop_array
TypeError: 'NoneType' object is not iterable

at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
a

Bug Report: Spark Config Defaults not Loading with python code/spark-submit

2017-10-13 Thread Nathan McLean

Here is an example pyspark program which illustrates this problem. If run
using spark-submit, the default configurations for Spark do not seem to be
loaded when a new SparkConf class is instantiated (contrary to what the
loadDefaults=True keyword arg implies). When using the interactive shell
for pyspark, the default configurations are loaded as expected.

I also tested this behaviour with a simple Scala program, but the Scala
code behaved correctly.


#!/usr/bin/python2.7
from pyspark.conf import SparkConf
from pyspark import SparkContext


# this prints nothing
print 'SPARK DEFAULTS'
conf = SparkConf()
print conf.toDebugString()

# this prints configuration options
print 'SPARK DEFAULTS'
spark_context = SparkContext()
conf = spark_context.getConf()
print conf.toDebugString()


This bug does not seem to exist in Spark 1.6.x
I have reproduced it in Spark 2.1.1 and Spark 2.2.0

How do I create a JIRA issue and associate it with a PR that I created for a bug in master?

2017-09-12 Thread Mikhailau, Alex

How do I create a JIRA issue and associate it with a PR that I created for a 
bug in master?

https://github.com/apache/spark/pull/19210

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith

Thanks for the reply, I filled an issue in JIRA 
https://issues.apache.org/jira/browse/SPARK-21819

I submitted the job from Java API, not by the spark-submit command line as we 
want to make spark processing as a service .

Configuration hc = new  Configuration(false);
String yarnxml=String.format("%s/%s", 
ConfigLocation,"yarn-site.xml");
String corexml=String.format("%s/%s", 
ConfigLocation,"core-site.xml");
String hdfsxml=String.format("%s/%s", 
ConfigLocation,"hdfs-site.xml");
String hivexml=String.format("%s/%s", 
ConfigLocation,"hive-site.xml");

hc.addResource(yarnxml);
hc.addResource(corexml);
hc.addResource(hdfsxml);
hc.addResource(hivexml);

//manually set all the Hadoop config in sparkconf
SparkConf sc = new SparkConf(true);
hc.forEach(entry-> {
 if(entry.getKey().startsWith("hive")) {
   sc.set(entry.getKey(), 
entry.getValue());
 }else {
   
sc.set("spark.hadoop."+entry.getKey(), entry.getValue());
 }
   });

  UserGroupInformation.setConfiguration(hc);
  UserGroupInformation.loginUserFromKeytab(Principal, Keytab);

SparkSession sparkSessesion= SparkSession
 .builder()
 .master("yarn-client") 
//"yarn-client", "local"
 .config(sc)
 .appName(SparkEAZDebug.class.getName())
 .enableHiveSupport()
 .getOrCreate();


Thanks very much.
Keith

From: 周康 [mailto:zhoukang199...@gmail.com]
Sent: 2017年8月22日 20:22
To: Sun, Keith <ai...@ebay.com>
Cc: user@spark.apache.org
Subject: Re: A bug in spark or hadoop RPC with kerberos authentication?

you can checkout Hadoop**credential class in  spark yarn。During spark submit,it 
will use config on the classpath.
I wonder how do you reference your own config?

1 2 3 4 5 >

1 - 100 of 401 matches

Mail list logo