My Redshift knowledge is probably both rusty and inadequate but I believe Redshift requires a data catalog in addition to your data. This is basically a dedicated dataset schema. AWS glue is the technology I think they use for that data catalog. If I read that link you posted correctly then it is complaining that the data catalog should have a parameterized varchar type (and not the data itself). So you might look at whatever process you are using to generate your data catalog (e.g. are you using `CREATE EXTERNAL TABLE` or are you using "crawlers" and see if you can update that catalog creation process to define the varchar length).
On Fri, Oct 1, 2021 at 7:24 AM William Ayd <[email protected]> wrote: > > Thanks Joris for all the great info. I thought the VARCHAR(max_length) was > part of the Parquet file specification, but there must be some intermediary > layer AWS is using to map to that from the Logical string type you’ve > referenced. > > The library you referenced looks promising. I’ll definitely give that a shot. > > Thanks again! > Will > > Sent from my iPhone > > On Oct 1, 2021, at 8:06 AM, Joris Van den Bossche > <[email protected]> wrote: > > > Hi Will, > > I am not familiar with Redshift, so take the following with a grain of salt. > But a few notes about Parquet/Arrow: > > Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical > types > (https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40), > and a "string" logical type which annotates the binary physical type > (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string). > > When Arrow writes a Parquet file, it thus creates such "string" logical typed > columns for string data. And you are correct that there is no way to > influence how this is physically stored in the Parquet file (using anything > else as "binary" would also violate the spec, AFAIU), and Parquet also > doesn't know the concept of "VARCHAR(n)" (it's either variable length > (without configurable max length), or either fixed length) > > But it would seem strange if Redshift did not support string columns in > Parquet whatsoever. Do you have control over the table's schema into which > the Parquet file is read? Eg I noticed > https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html > has a "varchar_lengths" keyword to control this. > > Best, > Joris > > On Thu, 30 Sept 2021 at 23:52, William Ayd <[email protected]> wrote: >> >> Greetings Arrow Community, >> >> I am working with on a project that drops data into Parquet files on an >> Amazon S3 bucket and reads from there into Redshift. From what I can tell, >> AWS Redshift will not read a parquet file properly that contains a VARCHAR >> data type with no max_length specification. >> >> Is there any way to pass that type information through to the parquet >> serializer? I searched through the documentation but nothing stood out at me. >> >> For reference, the best documentation I could find on the AWS side about not >> supporting a blanket VARCHAR without max_length is in the "Invalid Column >> Type Error" from this documentation: >> >> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/ >> >> Thanks, >> Will >> >>
