[jira] [Comment Edited] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2021-08-27 Thread Taran Saini (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405606#comment-17405606
 ] 

Taran Saini edited comment on SPARK-24156 at 8/27/21, 6:05 AM:
---

[~kabhwan] We have a Kafka broker from where we are continuously reading via 
spark structured stream on which we perform some aggregations before writing it 
out to a file sink in APPEND mode(s3).

Watermarking is used here to accommodate for late events (15 minutes/tried with 
lesser values as well)
{code:java}
.withWatermark("localTimeStamp", config.getString("spark.watermarkInterval"))
{code}
The groupBy clause is used to define batch and sliding interval(15 minutes in 
our case) 
{code:java}
.groupBy(window($"localTimeStamp", batchInterval, 
config.getString("spark.slideInterval")),..,..)
{code}
Post aggregation(s), here's the snippet to write stream results to file sink : 
{code:java}
.repartition(1)
.writeStream
.partitionBy("date", "hour", "windowMinute")
.format("parquet")
.option("checkpointLocation", config.getString("spark.checkpointLocation"))
.trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval")))
.outputMode("append")
.option("path", s"${s3SinkLocation}/parquet/")
.start()
{code}
Here are the issues which we observe : 
 1. The stream doesn't write output to sink unless there is new data so 
basically, if no events are being fired in current window, the previous one 
doesn't get flushed out
 2. Even with continuous inflow of events, there is no consistency in 
partitioned output directories getting created every trigger interval i.e 15 
mins, it works many times but not always. We did try setting 
`.option("parquet.block.size", 1024)` thinking it might flush events every 
window if the size if greater than 1024 bytes but that is also not producing 
desired results.

To summarise, `watermarking + append mode + file sink` is not working as 
expected(as it should per spark documentation). We are using Spark 3.0.x


was (Author: taransaini43):
[~kabhwan] We have a Kafka broker from where we are continuously reading via 
spark structured stream on which we perform some aggregations before writing it 
out to a file sink in APPEND mode(s3).

Watermarking is used here to accommodate for late events (15 minutes/tried with 
lesser values as well)
{code:java}
.withWatermark("localTimeStamp", config.getString("spark.watermarkInterval"))
{code}

The groupBy clause is used to define batch and sliding interval(15 minutes in 
our case) 
{code:java}
.groupBy(window($"localTimeStamp", batchInterval, 
config.getString("spark.slideInterval")),..,..)
{code}

Post aggregation(s), here's the snippet to write stream results to file sink : 
{code:java}
.repartition(1)
.writeStream
.partitionBy("date", "hour", "windowMinute")
.format("parquet")
.option("checkpointLocation", config.getString("spark.checkpointLocation"))
.trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval")))
.outputMode("append")
.option("path", s"${s3SinkLocation}/parquet/")
.start()
{code}


Here are the issues which we observe : 
1. The stream doesn't write output to sink unless there is new data so 
basically, if no events are being fired in current window, the previous one 
doesn't get flushed out
2. Even with continuous inflow of events, there is no consistency in 
partitioned output directories getting created every window i.e 15 mins, it 
works many times but not always. We did try setting 
`.option("parquet.block.size", 1024)` thinking it might flush events every 
window if the size if greater than 1024 bytes but that is also not producing 
desired results.

To summarise, `watermarking + append mode + file sink` is not working as 
expected(as it should per spark documentation). We are using Spark 3.0.x

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates 

[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2021-08-27 Thread Taran Saini (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405606#comment-17405606
 ] 

Taran Saini commented on SPARK-24156:
-

[~kabhwan] We have a Kafka broker from where we are continuously reading via 
spark structured stream on which we perform some aggregations before writing it 
out to a file sink in APPEND mode(s3).

Watermarking is used here to accommodate for late events (15 minutes/tried with 
lesser values as well)
{code:java}
.withWatermark("localTimeStamp", config.getString("spark.watermarkInterval"))
{code}

The groupBy clause is used to define batch and sliding interval(15 minutes in 
our case) 
{code:java}
.groupBy(window($"localTimeStamp", batchInterval, 
config.getString("spark.slideInterval")),..,..)
{code}

Post aggregation(s), here's the snippet to write stream results to file sink : 
{code:java}
.repartition(1)
.writeStream
.partitionBy("date", "hour", "windowMinute")
.format("parquet")
.option("checkpointLocation", config.getString("spark.checkpointLocation"))
.trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval")))
.outputMode("append")
.option("path", s"${s3SinkLocation}/parquet/")
.start()
{code}


Here are the issues which we observe : 
1. The stream doesn't write output to sink unless there is new data so 
basically, if no events are being fired in current window, the previous one 
doesn't get flushed out
2. Even with continuous inflow of events, there is no consistency in 
partitioned output directories getting created every window i.e 15 mins, it 
works many times but not always. We did try setting 
`.option("parquet.block.size", 1024)` thinking it might flush events every 
window if the size if greater than 1024 bytes but that is also not producing 
desired results.

To summarise, `watermarking + append mode + file sink` is not working as 
expected(as it should per spark documentation). We are using Spark 3.0.x

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2021-08-26 Thread Taran Saini (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405599#comment-17405599
 ] 

Taran Saini commented on SPARK-24156:
-

[~thebluephantom] did you see any resolution for this? We are also facing the 
same problem and see non uniform delays in writing to s3/file sink while using 
both watermarking and append mode(can see lots of people raising the same). 
This is a major bug which should be re-tested and mentioned in documentation at 
least. 
[~tdas] [~cloud_fan]

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-10 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21684:

Attachment: SparkQuotesTest2.scala

PFA the same.

> df.write double escaping all the already escaped characters except the first 
> one
> 
>
> Key: SPARK-21684
> URL: https://issues.apache.org/jira/browse/SPARK-21684
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
> Attachments: SparkQuotesTest2.scala
>
>
> Hi,
> If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
> {noformat}
> Then while writing it is being written as 
> {noformat} "ab\,cd\\,ef\\,gh" {noformat}
> i.e it double escapes all the already escaped commas/delimiters but not the 
> first one.
> This is weird behaviour considering either it should do for all or none.
> If I do mention df.option("escape","") as empty then it solves this problem 
> but the double quotes inside the same value if any are preceded by a special 
> char i.e '\u00'. Why does it do so when the escape character is set as 
> ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21684) df.write double escaping all the already escaped characters except the first one

2017-08-09 Thread Taran Saini (JIRA)
Taran Saini created SPARK-21684:
---

 Summary: df.write double escaping all the already escaped 
characters except the first one
 Key: SPARK-21684
 URL: https://issues.apache.org/jira/browse/SPARK-21684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Taran Saini


Hi,

If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh 
{noformat}
Then while writing it is being written as 
{noformat} "ab\,cd\\,ef\\,gh" {noformat}
i.e it double escapes all the already escaped commas/delimiters but not the 
first one.
This is weird behaviour considering either it should do for all or none.
If I do mention df.option("escape","") as empty then it solves this problem but 
the double quotes inside the same value if any are preceded by a special char 
i.e '\u00'. Why does it do so when the escape character is set as ""(empty)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini reopened SPARK-21678:
-

> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> {noformat}
> L"\, p' Y a\, C G
> {noformat}
>  is written as 
> {noformat}
> "L\"\\, p' Y a\\, C G\\, H"
> {noformat}
>  i.e double escapes the next already escaped values. 
> and if i myself escape like :
> {noformat}
> L\"\, p' Y a\, C G
> {noformat}
>  then that is written as 
> {noformat}
>  "L\\"\\, p' Y a\\, C G\\, H"
> {noformat}
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119898#comment-16119898
 ] 

Taran Saini edited comment on SPARK-21678 at 8/9/17 1:30 PM:
-

this is not a question. This is a bug! 
Only if somebody reads this and lets me know whether it is a bug or a question.


was (Author: taransaini43):
this is not a question. This is a bug! 
Only if somebody reads this and let me know whether it is a bug or a question.

> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> {noformat}
> L"\, p' Y a\, C G
> {noformat}
>  is written as 
> {noformat}
> "L\"\\, p' Y a\\, C G\\, H"
> {noformat}
>  i.e double escapes the next already escaped values. 
> and if i myself escape like :
> {noformat}
> L\"\, p' Y a\, C G
> {noformat}
>  then that is written as 
> {noformat}
>  "L\\"\\, p' Y a\\, C G\\, H"
> {noformat}
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119898#comment-16119898
 ] 

Taran Saini commented on SPARK-21678:
-

this is not a question. This is a bug! 
Only if somebody reads this and let me know whether it is a bug or a question.

> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> {noformat}
> L"\, p' Y a\, C G
> {noformat}
>  is written as 
> {noformat}
> "L\"\\, p' Y a\\, C G\\, H"
> {noformat}
>  i.e double escapes the next already escaped values. 
> and if i myself escape like :
> {noformat}
> L\"\, p' Y a\, C G
> {noformat}
>  then that is written as 
> {noformat}
>  "L\\"\\, p' Y a\\, C G\\, H"
> {noformat}
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21678:

Description: 
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
{noformat}
L"\, p' Y a\, C G
{noformat}
 is written as 
{noformat}
"L\"\\, p' Y a\\, C G\\, H"
{noformat}
 i.e double escapes the next already escaped values. 
and if i myself escape like :
{noformat}
L\"\, p' Y a\, C G
{noformat}
 then that is written as 
{noformat}
 "L\\"\\, p' Y a\\, C G\\, H"
{noformat}

How do we just disable this automatic escaping of characters?




  was:
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
{quote}
L"\, p' Y a\, C G
{quote}
 is written as 
{quote}
"L\"\\, p' Y a\\, C G\\, H"
{quote}
 i.e double escapes the next already escaped values. 
and if i myself escape like :
{quote}
L\"\, p' Y a\, C G
{quote}
 then that is written as 
{quote}
 "L\\"\\, p' Y a\\, C G\\, H"
{quote}

How do we just disable this automatic escaping of characters?





> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> {noformat}
> L"\, p' Y a\, C G
> {noformat}
>  is written as 
> {noformat}
> "L\"\\, p' Y a\\, C G\\, H"
> {noformat}
>  i.e double escapes the next already escaped values. 
> and if i myself escape like :
> {noformat}
> L\"\, p' Y a\, C G
> {noformat}
>  then that is written as 
> {noformat}
>  "L\\"\\, p' Y a\\, C G\\, H"
> {noformat}
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21678:

Description: 
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
{quote}
L"\, p' Y a\, C G
{quote}
 is written as 
{quote}
"L\"\\, p' Y a\\, C G\\, H"
{quote}
 i.e double escapes the next already escaped values. 
and if i myself escape like :
{quote}
L\"\, p' Y a\, C G
{quote}
 then that is written as 
{quote}
 "L\\"\\, p' Y a\\, C G\\, H"
{quote}

How do we just disable this automatic escaping of characters?




  was:
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes 
the next already escaped values. 
and if i myself escape like :
L\"\, p' Y a\, C G then that is written as  "L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?





> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> {quote}
> L"\, p' Y a\, C G
> {quote}
>  is written as 
> {quote}
> "L\"\\, p' Y a\\, C G\\, H"
> {quote}
>  i.e double escapes the next already escaped values. 
> and if i myself escape like :
> {quote}
> L\"\, p' Y a\, C G
> {quote}
>  then that is written as 
> {quote}
>  "L\\"\\, p' Y a\\, C G\\, H"
> {quote}
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21678:

Description: 
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes 
the next already escaped values. 
and if i myself escape like :
L\"\, p' Y a\, C G then that is written as  "L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?




  was:
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written 
as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped 
values. 
and
if i myself escape like :
L\"\, p' Y a\, C G then that is written as 
"L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?





> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double 
> escapes the next already escaped values. 
> and if i myself escape like :
> L\"\, p' Y a\, C G then that is written as  "L\\"\\, p' Y a\\, C G\\, H"
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taran Saini updated SPARK-21678:

Description: 
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written 
as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped 
values. 
and
if i myself escape like :
L\"\, p' Y a\, C G then that is written as 
"L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?




  was:
Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes 
the next already escaped values. I
and
if i myself escape like :
L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?





> Disabling quotes while writing a dataframe
> --
>
> Key: SPARK-21678
> URL: https://issues.apache.org/jira/browse/SPARK-21678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Taran Saini
>
> Hi,
> I have the my dataframe cloumn values which can contain commas, double quotes 
> etc.
> I am transforming the dataframes in order to ensure that all the required 
> values are escaped.
> However, on doing df.write.format("csv")
> It again wraps the values in double quotes. How do I disable the same? 
> And even if the double quotes are there to stay why does it do the following :
> L"\, p' Y a\, C G is written 
> as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped 
> values. 
> and
> if i myself escape like :
> L\"\, p' Y a\, C G then that is written as 
> "L\\"\\, p' Y a\\, C G\\, H"
> How do we just disable this automatic escaping of characters?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21678) Disabling quotes while writing a dataframe

2017-08-09 Thread Taran Saini (JIRA)
Taran Saini created SPARK-21678:
---

 Summary: Disabling quotes while writing a dataframe
 Key: SPARK-21678
 URL: https://issues.apache.org/jira/browse/SPARK-21678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Taran Saini


Hi,

I have the my dataframe cloumn values which can contain commas, double quotes 
etc.
I am transforming the dataframes in order to ensure that all the required 
values are escaped.

However, on doing df.write.format("csv")
It again wraps the values in double quotes. How do I disable the same? 
And even if the double quotes are there to stay why does it do the following :
L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes 
the next already escaped values. I
and
if i myself escape like :
L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H"

How do we just disable this automatic escaping of characters?






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org