[jira] [Comment Edited] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up
[ https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405606#comment-17405606 ] Taran Saini edited comment on SPARK-24156 at 8/27/21, 6:05 AM: --- [~kabhwan] We have a Kafka broker from where we are continuously reading via spark structured stream on which we perform some aggregations before writing it out to a file sink in APPEND mode(s3). Watermarking is used here to accommodate for late events (15 minutes/tried with lesser values as well) {code:java} .withWatermark("localTimeStamp", config.getString("spark.watermarkInterval")) {code} The groupBy clause is used to define batch and sliding interval(15 minutes in our case) {code:java} .groupBy(window($"localTimeStamp", batchInterval, config.getString("spark.slideInterval")),..,..) {code} Post aggregation(s), here's the snippet to write stream results to file sink : {code:java} .repartition(1) .writeStream .partitionBy("date", "hour", "windowMinute") .format("parquet") .option("checkpointLocation", config.getString("spark.checkpointLocation")) .trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval"))) .outputMode("append") .option("path", s"${s3SinkLocation}/parquet/") .start() {code} Here are the issues which we observe : 1. The stream doesn't write output to sink unless there is new data so basically, if no events are being fired in current window, the previous one doesn't get flushed out 2. Even with continuous inflow of events, there is no consistency in partitioned output directories getting created every trigger interval i.e 15 mins, it works many times but not always. We did try setting `.option("parquet.block.size", 1024)` thinking it might flush events every window if the size if greater than 1024 bytes but that is also not producing desired results. To summarise, `watermarking + append mode + file sink` is not working as expected(as it should per spark documentation). We are using Spark 3.0.x was (Author: taransaini43): [~kabhwan] We have a Kafka broker from where we are continuously reading via spark structured stream on which we perform some aggregations before writing it out to a file sink in APPEND mode(s3). Watermarking is used here to accommodate for late events (15 minutes/tried with lesser values as well) {code:java} .withWatermark("localTimeStamp", config.getString("spark.watermarkInterval")) {code} The groupBy clause is used to define batch and sliding interval(15 minutes in our case) {code:java} .groupBy(window($"localTimeStamp", batchInterval, config.getString("spark.slideInterval")),..,..) {code} Post aggregation(s), here's the snippet to write stream results to file sink : {code:java} .repartition(1) .writeStream .partitionBy("date", "hour", "windowMinute") .format("parquet") .option("checkpointLocation", config.getString("spark.checkpointLocation")) .trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval"))) .outputMode("append") .option("path", s"${s3SinkLocation}/parquet/") .start() {code} Here are the issues which we observe : 1. The stream doesn't write output to sink unless there is new data so basically, if no events are being fired in current window, the previous one doesn't get flushed out 2. Even with continuous inflow of events, there is no consistency in partitioned output directories getting created every window i.e 15 mins, it works many times but not always. We did try setting `.option("parquet.block.size", 1024)` thinking it might flush events every window if the size if greater than 1024 bytes but that is also not producing desired results. To summarise, `watermarking + append mode + file sink` is not working as expected(as it should per spark documentation). We are using Spark 3.0.x > Enable no-data micro batches for more eager streaming state clean up > - > > Key: SPARK-24156 > URL: https://issues.apache.org/jira/browse/SPARK-24156 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > Currently, MicroBatchExecution in Structured Streaming runs batches only when > there is new data to process. This is sensible in most cases as we dont want > to unnecessarily use resources when there is nothing new to process. However, > in some cases of stateful streaming queries, this delays state clean up as > well as clean-up based output. For example, consider a streaming aggregation > query with watermark-based state cleanup. The watermark is updated after > every batch with new data completes. The updated value is used in the next > batch to clean up state, and output finalized aggregates
[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up
[ https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405606#comment-17405606 ] Taran Saini commented on SPARK-24156: - [~kabhwan] We have a Kafka broker from where we are continuously reading via spark structured stream on which we perform some aggregations before writing it out to a file sink in APPEND mode(s3). Watermarking is used here to accommodate for late events (15 minutes/tried with lesser values as well) {code:java} .withWatermark("localTimeStamp", config.getString("spark.watermarkInterval")) {code} The groupBy clause is used to define batch and sliding interval(15 minutes in our case) {code:java} .groupBy(window($"localTimeStamp", batchInterval, config.getString("spark.slideInterval")),..,..) {code} Post aggregation(s), here's the snippet to write stream results to file sink : {code:java} .repartition(1) .writeStream .partitionBy("date", "hour", "windowMinute") .format("parquet") .option("checkpointLocation", config.getString("spark.checkpointLocation")) .trigger(Trigger.ProcessingTime(config.getString("spark.triggerInterval"))) .outputMode("append") .option("path", s"${s3SinkLocation}/parquet/") .start() {code} Here are the issues which we observe : 1. The stream doesn't write output to sink unless there is new data so basically, if no events are being fired in current window, the previous one doesn't get flushed out 2. Even with continuous inflow of events, there is no consistency in partitioned output directories getting created every window i.e 15 mins, it works many times but not always. We did try setting `.option("parquet.block.size", 1024)` thinking it might flush events every window if the size if greater than 1024 bytes but that is also not producing desired results. To summarise, `watermarking + append mode + file sink` is not working as expected(as it should per spark documentation). We are using Spark 3.0.x > Enable no-data micro batches for more eager streaming state clean up > - > > Key: SPARK-24156 > URL: https://issues.apache.org/jira/browse/SPARK-24156 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > Currently, MicroBatchExecution in Structured Streaming runs batches only when > there is new data to process. This is sensible in most cases as we dont want > to unnecessarily use resources when there is nothing new to process. However, > in some cases of stateful streaming queries, this delays state clean up as > well as clean-up based output. For example, consider a streaming aggregation > query with watermark-based state cleanup. The watermark is updated after > every batch with new data completes. The updated value is used in the next > batch to clean up state, and output finalized aggregates in append mode. > However, if there is no data, then the next batch does not occur, and > cleanup/output gets delayed unnecessarily. This is true for all stateful > streaming operators - aggregation, deduplication, joins, mapGroupsWithState > This issue tracks the work to enable no-data batches in MicroBatchExecution. > The major challenge is that all the tests of relevant stateful operations add > dummy data to force another batch for testing the state cleanup. So a lot of > the tests are going to be changed. So my plan is to enable no-data batches > for different stateful operators one at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up
[ https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405599#comment-17405599 ] Taran Saini commented on SPARK-24156: - [~thebluephantom] did you see any resolution for this? We are also facing the same problem and see non uniform delays in writing to s3/file sink while using both watermarking and append mode(can see lots of people raising the same). This is a major bug which should be re-tested and mentioned in documentation at least. [~tdas] [~cloud_fan] > Enable no-data micro batches for more eager streaming state clean up > - > > Key: SPARK-24156 > URL: https://issues.apache.org/jira/browse/SPARK-24156 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > Currently, MicroBatchExecution in Structured Streaming runs batches only when > there is new data to process. This is sensible in most cases as we dont want > to unnecessarily use resources when there is nothing new to process. However, > in some cases of stateful streaming queries, this delays state clean up as > well as clean-up based output. For example, consider a streaming aggregation > query with watermark-based state cleanup. The watermark is updated after > every batch with new data completes. The updated value is used in the next > batch to clean up state, and output finalized aggregates in append mode. > However, if there is no data, then the next batch does not occur, and > cleanup/output gets delayed unnecessarily. This is true for all stateful > streaming operators - aggregation, deduplication, joins, mapGroupsWithState > This issue tracks the work to enable no-data batches in MicroBatchExecution. > The major challenge is that all the tests of relevant stateful operations add > dummy data to force another batch for testing the state cleanup. So a lot of > the tests are going to be changed. So my plan is to enable no-data batches > for different stateful operators one at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
[ https://issues.apache.org/jira/browse/SPARK-21684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21684: Attachment: SparkQuotesTest2.scala PFA the same. > df.write double escaping all the already escaped characters except the first > one > > > Key: SPARK-21684 > URL: https://issues.apache.org/jira/browse/SPARK-21684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > Attachments: SparkQuotesTest2.scala > > > Hi, > If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh > {noformat} > Then while writing it is being written as > {noformat} "ab\,cd\\,ef\\,gh" {noformat} > i.e it double escapes all the already escaped commas/delimiters but not the > first one. > This is weird behaviour considering either it should do for all or none. > If I do mention df.option("escape","") as empty then it solves this problem > but the double quotes inside the same value if any are preceded by a special > char i.e '\u00'. Why does it do so when the escape character is set as > ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21684) df.write double escaping all the already escaped characters except the first one
Taran Saini created SPARK-21684: --- Summary: df.write double escaping all the already escaped characters except the first one Key: SPARK-21684 URL: https://issues.apache.org/jira/browse/SPARK-21684 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Taran Saini Hi, If we have a dataframe with the column value as {noformat} ab\,cd\,ef\,gh {noformat} Then while writing it is being written as {noformat} "ab\,cd\\,ef\\,gh" {noformat} i.e it double escapes all the already escaped commas/delimiters but not the first one. This is weird behaviour considering either it should do for all or none. If I do mention df.option("escape","") as empty then it solves this problem but the double quotes inside the same value if any are preceded by a special char i.e '\u00'. Why does it do so when the escape character is set as ""(empty)? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini reopened SPARK-21678: - > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > {noformat} > L"\, p' Y a\, C G > {noformat} > is written as > {noformat} > "L\"\\, p' Y a\\, C G\\, H" > {noformat} > i.e double escapes the next already escaped values. > and if i myself escape like : > {noformat} > L\"\, p' Y a\, C G > {noformat} > then that is written as > {noformat} > "L\\"\\, p' Y a\\, C G\\, H" > {noformat} > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119898#comment-16119898 ] Taran Saini edited comment on SPARK-21678 at 8/9/17 1:30 PM: - this is not a question. This is a bug! Only if somebody reads this and lets me know whether it is a bug or a question. was (Author: taransaini43): this is not a question. This is a bug! Only if somebody reads this and let me know whether it is a bug or a question. > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > {noformat} > L"\, p' Y a\, C G > {noformat} > is written as > {noformat} > "L\"\\, p' Y a\\, C G\\, H" > {noformat} > i.e double escapes the next already escaped values. > and if i myself escape like : > {noformat} > L\"\, p' Y a\, C G > {noformat} > then that is written as > {noformat} > "L\\"\\, p' Y a\\, C G\\, H" > {noformat} > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119898#comment-16119898 ] Taran Saini commented on SPARK-21678: - this is not a question. This is a bug! Only if somebody reads this and let me know whether it is a bug or a question. > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > {noformat} > L"\, p' Y a\, C G > {noformat} > is written as > {noformat} > "L\"\\, p' Y a\\, C G\\, H" > {noformat} > i.e double escapes the next already escaped values. > and if i myself escape like : > {noformat} > L\"\, p' Y a\, C G > {noformat} > then that is written as > {noformat} > "L\\"\\, p' Y a\\, C G\\, H" > {noformat} > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21678: Description: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : {noformat} L"\, p' Y a\, C G {noformat} is written as {noformat} "L\"\\, p' Y a\\, C G\\, H" {noformat} i.e double escapes the next already escaped values. and if i myself escape like : {noformat} L\"\, p' Y a\, C G {noformat} then that is written as {noformat} "L\\"\\, p' Y a\\, C G\\, H" {noformat} How do we just disable this automatic escaping of characters? was: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : {quote} L"\, p' Y a\, C G {quote} is written as {quote} "L\"\\, p' Y a\\, C G\\, H" {quote} i.e double escapes the next already escaped values. and if i myself escape like : {quote} L\"\, p' Y a\, C G {quote} then that is written as {quote} "L\\"\\, p' Y a\\, C G\\, H" {quote} How do we just disable this automatic escaping of characters? > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > {noformat} > L"\, p' Y a\, C G > {noformat} > is written as > {noformat} > "L\"\\, p' Y a\\, C G\\, H" > {noformat} > i.e double escapes the next already escaped values. > and if i myself escape like : > {noformat} > L\"\, p' Y a\, C G > {noformat} > then that is written as > {noformat} > "L\\"\\, p' Y a\\, C G\\, H" > {noformat} > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21678: Description: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : {quote} L"\, p' Y a\, C G {quote} is written as {quote} "L\"\\, p' Y a\\, C G\\, H" {quote} i.e double escapes the next already escaped values. and if i myself escape like : {quote} L\"\, p' Y a\, C G {quote} then that is written as {quote} "L\\"\\, p' Y a\\, C G\\, H" {quote} How do we just disable this automatic escaping of characters? was: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > {quote} > L"\, p' Y a\, C G > {quote} > is written as > {quote} > "L\"\\, p' Y a\\, C G\\, H" > {quote} > i.e double escapes the next already escaped values. > and if i myself escape like : > {quote} > L\"\, p' Y a\, C G > {quote} > then that is written as > {quote} > "L\\"\\, p' Y a\\, C G\\, H" > {quote} > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21678: Description: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? was: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double > escapes the next already escaped values. > and if i myself escape like : > L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21678) Disabling quotes while writing a dataframe
[ https://issues.apache.org/jira/browse/SPARK-21678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Taran Saini updated SPARK-21678: Description: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? was: Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. I and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? > Disabling quotes while writing a dataframe > -- > > Key: SPARK-21678 > URL: https://issues.apache.org/jira/browse/SPARK-21678 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Taran Saini > > Hi, > I have the my dataframe cloumn values which can contain commas, double quotes > etc. > I am transforming the dataframes in order to ensure that all the required > values are escaped. > However, on doing df.write.format("csv") > It again wraps the values in double quotes. How do I disable the same? > And even if the double quotes are there to stay why does it do the following : > L"\, p' Y a\, C G is written > as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped > values. > and > if i myself escape like : > L\"\, p' Y a\, C G then that is written as > "L\\"\\, p' Y a\\, C G\\, H" > How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21678) Disabling quotes while writing a dataframe
Taran Saini created SPARK-21678: --- Summary: Disabling quotes while writing a dataframe Key: SPARK-21678 URL: https://issues.apache.org/jira/browse/SPARK-21678 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Taran Saini Hi, I have the my dataframe cloumn values which can contain commas, double quotes etc. I am transforming the dataframes in order to ensure that all the required values are escaped. However, on doing df.write.format("csv") It again wraps the values in double quotes. How do I disable the same? And even if the double quotes are there to stay why does it do the following : L"\, p' Y a\, C G is written as "L\"\\, p' Y a\\, C G\\, H" i.e double escapes the next already escaped values. I and if i myself escape like : L\"\, p' Y a\, C G then that is written as "L\\"\\, p' Y a\\, C G\\, H" How do we just disable this automatic escaping of characters? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org