Hi Will,

I am not familiar with Redshift, so take the following with a grain of
salt. But a few notes about Parquet/Arrow:

Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical
types (
https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40),
and a "string" logical type which annotates the binary physical type (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string
).

When Arrow writes a Parquet file, it thus creates such "string" logical
typed columns for string data. And you are correct that there is no way to
influence how this is physically stored in the Parquet file (using anything
else as "binary" would also violate the spec, AFAIU), and Parquet also
doesn't know the concept of "VARCHAR(n)" (it's either variable length
(without configurable max length), or either fixed length)

But it would seem strange if Redshift did not support string columns in
Parquet whatsoever. Do you have control over the table's schema into which
the Parquet file is read? Eg I noticed
https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html
has a "varchar_lengths" keyword to control this.

Best,
Joris

On Thu, 30 Sept 2021 at 23:52, William Ayd <[email protected]> wrote:

> Greetings Arrow Community,
>
> I am working with on a project that drops data into Parquet files on an
> Amazon S3 bucket and reads from there into Redshift. From what I can tell,
> AWS Redshift will not read a parquet file properly that contains a VARCHAR
> data type with no max_length specification.
>
> Is there any way to pass that type information through to the parquet
> serializer? I searched through the documentation but nothing stood out at
> me.
>
> For reference, the best documentation I could find on the AWS side about
> not supporting a blanket VARCHAR without max_length is in the "Invalid
> Column Type Error" from this documentation:
>
>
> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>
> Thanks,
> Will
>
>
>

Reply via email to