[jira] [Commented] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603819#comment-17603819 ] Drew commented on SPARK-40287: -- Hey [~ste...@apache.org], Yes, I get the same functionality with this criteria as well. It looks like the data is moving to the new table location again. > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598835#comment-17598835 ] Steve Loughran commented on SPARK-40287: does this happen when # you switch to an ASF spark build with the s3a connector # and use an s3a committer safe to use with spark this is clearly EMR (s3:// URLs), so they have to be the people to talk to if you can't replicate it in the apache code > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org