Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
Hi, All.
New PR is made. Please move to https://github.com/apache/spark/pull/22932
for further discussion.
---
-
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
That is the value used by Parquet-MR library. We had better not to touch
it. Parquet MR reader can work differently based on that versions to handle
some older Parquet writer bugs.
---
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22255
Just to confirm it. `created_by` is set to `parquet-mr version 1.10.0
(build 031a6654009e3b82020012a18434c582bd74c73a)`?
---
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
It seems to cause some inconsistency if we choose one of
`org.apache.spark.sql.create.version` or `spark.sql.create.version` as a key?
1) If we choose `spark.sql.create.version` as a
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
That will go like the following.
```
file:
file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
Currently, we put the metadata like the following.
```
file:
file:/tmp/p/part-5-dbb9a9ab-0d6a-49df-9f39-397c8505f22b-c000.snappy.parquet
creator: parquet-mr version
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
BTW, @rdblue recommended
[key_value_metadata](https://github.com/apache/spark/pull/22255#issuecomment-418169189).
Are we going to `created_by` instead of `key_value_metadata`? Could you give
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
Sure, @gatorsmile .
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22255
Also cc @hvanhovell
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22255
@dongjoon-hyun Do you want to take this over?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/22255
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L902
@rdblue Can we use created_by?
```
/** String for application that wrote this
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22255
Hi, @npoberezkin . Thank you for your first contribution. Could you update
your PR to use custom key-value metadata according to the above advice of
@rdblue ? Also, please use tag `[SQL]`
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/22255
@npoberezkin, Parquet already supports custom key-value metadata in the
file footer. The Spark version would go there.
---
-
To
Github user npoberezkin commented on the issue:
https://github.com/apache/spark/pull/22255
I got your idea now. Apparently I was a little confused because of the
description of tickets.
I can try to implement these (writing info about writer.model like "avro"
etc in Spark), if
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/22255
I don't think this fits the intent of the model name. The model name is
intended to encode what the data model was that was written to Parquet. I can
write Avro records to a Parquet file, for
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22255
Hi @rdblue, is it roughly good to do here in Spark?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user npoberezkin commented on the issue:
https://github.com/apache/spark/pull/22255
Hello, @dbtsai, @HyukjinKwon . I added test on reading writer.model.name to
PR. Justification for this change is below.
This is original jira:
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/22255
I would also rather write the justification for this change, for instance,
linking the usage of this name in Parquet side, potential usage, etc.
---
Github user dbtsai commented on the issue:
https://github.com/apache/spark/pull/22255
Is there any other project writing this into the footer? Tests on reading
this back?
---
-
To unsubscribe, e-mail:
Github user npoberezkin commented on the issue:
https://github.com/apache/spark/pull/22255
@dbtsai Hello, I'm sorry for asking you directly, but for some reason
jenkins did not generate message: "Can one of the admins verify this patch?".
I just saw that you've reviewed some other
Github user npoberezkin commented on the issue:
https://github.com/apache/spark/pull/22255
ok to test
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22255
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22255
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22255
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
24 matches
Mail list logo