How to set a config for a single query?

2023-01-03 Thread Felipe Pessoto
Hi,

In Scala is it possible to set a config value to a single query?

I could set/unset the value, but it won't work for multithreading scenarios.

Example:

spark.sql.adaptive.coalescePartitions.enabled = false
queryA_df.collect
spark.sql.adaptive.coalescePartitions.enabled=original value
queryB_df.collect
queryC_df.collect
queryD_df.collect


If I execute that block of code multiple times using multiple thread, I can end 
up executing Query A with coalescePartitions.enabled=true, and Queries B, C and 
D with the config set to false, because another thread could set it between the 
executions.

Is there any good alternative to this?

Thanks.


[SparkR] Compare datetime with Sys.time() throws error in R (>= 4.2.0)

2023-01-03 Thread Vivek Atal
Hi,
Base R 4.2.0 introduced a change ([Rd] R 4.2.0 is released), "Calling if() or 
while() with a condition of length greater than one gives an error rather than 
a warning."
The below code is a reproducible example of the issue. If it is executed in R 
>=4.2.0 then it will generate an error, else just a warning message. Sys.time() 
is a multi-class object in R, and throughout the Spark R repository 'if' 
statement is used as: 'if(class(x) == "Column")' - this causes the error in 
latest R version. Note that, R allows an object to have multiple 'class' names 
as a character vector (R: Object Classes); hence this type of check itself was 
not a good idea in the first place.
t <- Sys.time()sdf <- SparkR::createDataFrame(data.frame(xx = t + 
c(-1,1,-1,1,-1)))
SparkR::collect(SparkR::filter(sdf, SparkR::column("xx") > t))

The suggested change is to add 'all' function while doing the check of whether 
class(.) is Column or not: 'if(all(class(x) == "Column"))'.
 Creating an issue in JIRA is not very clear to me, hence mailing it to the 
'user' list.
Vivek


Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
Why does the data even need cleaning? That's all perfectly correct. The
error was setting quote to be an escape char.

On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh 
wrote:

> if you take your source CSV as below
>
> "a","b","c"
> "1","",","
> "2","","abc"
>
>
> and define your code as below
>
>
>csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
> # read hive table in spark
> listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
> listing_df.printSchema()
> print(f"""\n Reading from Hive table {csv_file}\n""")
> listing_df.show(100,False)
> listing_df.select("c").show()
>
>
> results in
>
>
>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv
>
> +---++---+
> |a  |b   |c  |
> +---++---+
> |1  |null|,  |
> |2  |null|abc|
> +---++---+
>
> +---+
> |  c|
> +---+
> |  ,|
> |abc|
> +---+
>
>
> which assumes that "," is a value for column c in row 1
>
>
> This interpretation is correct. You ought to do data cleansing before.
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 3 Jan 2023 at 17:03, Sean Owen  wrote:
>
>> No, you've set the escape character to double-quote, when it looks like
>> you mean for it to be the quote character (which it already is). Remove
>> this setting, as it's incorrect.
>>
>> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>>  wrote:
>>
>>> Hello,
>>> We are seeing a case with csv data when it parses csv data incorrectly.
>>> The issue can be replicated using the below csv data
>>>
>>> "a","b","c"
>>> "1","",","
>>> "2","","abc"
>>>
>>> and using the spark csv read command.
>>>
>>> df = spark.read.format("csv")\
>>> .option("multiLine", True)\
>>> .option("escape", '"')\
>>> .option("enforceSchema", False) \
>>> .option("header", True)\
>>> .load(f"/tmp/test.csv")
>>>
>>> df.show(100, False) # prints both rows
>>> |a  |b   |c  |
>>> +---++---+
>>> |1  |null|,  |
>>> |2  |null|abc|
>>>
>>> df.select("c").show() # merges last column of first row and first
>>> column of second row
>>> +--+
>>> | c|
>>> +--+
>>> |"\n"2"|
>>>
>>> print(df.count()) # prints 1, should be 2
>>>
>>>
>>> It feels like a bug and I thought of asking the community before
>>> creating a bug on jira.
>>>
>>> Mvg/Regards
>>> Saurabh
>>>
>>>


Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Mich Talebzadeh
if you take your source CSV as below

"a","b","c"
"1","",","
"2","","abc"


and define your code as below


   csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
# read hive table in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
listing_df.printSchema()
print(f"""\n Reading from Hive table {csv_file}\n""")
listing_df.show(100,False)
listing_df.select("c").show()


results in


 Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv

+---++---+
|a  |b   |c  |
+---++---+
|1  |null|,  |
|2  |null|abc|
+---++---+

+---+
|  c|
+---+
|  ,|
|abc|
+---+


which assumes that "," is a value for column c in row 1


This interpretation is correct. You ought to do data cleansing before.


HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 3 Jan 2023 at 17:03, Sean Owen  wrote:

> No, you've set the escape character to double-quote, when it looks like
> you mean for it to be the quote character (which it already is). Remove
> this setting, as it's incorrect.
>
> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>  wrote:
>
>> Hello,
>> We are seeing a case with csv data when it parses csv data incorrectly.
>> The issue can be replicated using the below csv data
>>
>> "a","b","c"
>> "1","",","
>> "2","","abc"
>>
>> and using the spark csv read command.
>>
>> df = spark.read.format("csv")\
>> .option("multiLine", True)\
>> .option("escape", '"')\
>> .option("enforceSchema", False) \
>> .option("header", True)\
>> .load(f"/tmp/test.csv")
>>
>> df.show(100, False) # prints both rows
>> |a  |b   |c  |
>> +---++---+
>> |1  |null|,  |
>> |2  |null|abc|
>>
>> df.select("c").show() # merges last column of first row and first column
>> of second row
>> +--+
>> | c|
>> +--+
>> |"\n"2"|
>>
>> print(df.count()) # prints 1, should be 2
>>
>>
>> It feels like a bug and I thought of asking the community before creating
>> a bug on jira.
>>
>> Mvg/Regards
>> Saurabh
>>
>>


Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
No, you've set the escape character to double-quote, when it looks like you
mean for it to be the quote character (which it already is). Remove this
setting, as it's incorrect.

On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
 wrote:

> Hello,
> We are seeing a case with csv data when it parses csv data incorrectly.
> The issue can be replicated using the below csv data
>
> "a","b","c"
> "1","",","
> "2","","abc"
>
> and using the spark csv read command.
>
> df = spark.read.format("csv")\
> .option("multiLine", True)\
> .option("escape", '"')\
> .option("enforceSchema", False) \
> .option("header", True)\
> .load(f"/tmp/test.csv")
>
> df.show(100, False) # prints both rows
> |a  |b   |c  |
> +---++---+
> |1  |null|,  |
> |2  |null|abc|
>
> df.select("c").show() # merges last column of first row and first column
> of second row
> +--+
> | c|
> +--+
> |"\n"2"|
>
> print(df.count()) # prints 1, should be 2
>
>
> It feels like a bug and I thought of asking the community before creating
> a bug on jira.
>
> Mvg/Regards
> Saurabh
>
>


Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Saurabh Gulati
Hello,
We are seeing a case with csv data when it parses csv data incorrectly.
The issue can be replicated using the below csv data

"a","b","c"
"1","",","
"2","","abc"
and using the spark csv read command.
df = spark.read.format("csv")\
.option("multiLine", True)\
.option("escape", '"')\
.option("enforceSchema", False) \
.option("header", True)\
.load(f"/tmp/test.csv")

df.show(100, False) # prints both rows
|a  |b   |c  |
+---++---+
|1  |null|,  |
|2  |null|abc|

df.select("c").show() # merges last column of first row and first column of 
second row
+--+
| c|
+--+
|"\n"2"|

print(df.count()) # prints 1, should be 2

It feels like a bug and I thought of asking the community before creating a bug 
on jira.

Mvg/Regards
Saurabh