Re: [Python,Parquet] How to specify VARCHAR max_length?

William Ayd Fri, 01 Oct 2021 10:24:45 -0700

Thanks Joris for all the great info. I thought the VARCHAR(max_length) was part 
of the Parquet file specification, but there must be some intermediary layer 
AWS is using to map to that from the Logical string type you’ve referenced.


The library you referenced looks promising. I’ll definitely give that a shot. 

Thanks again!
Will

Sent from my iPhone

> On Oct 1, 2021, at 8:06 AM, Joris Van den Bossche 
> <[email protected]> wrote:
> 
> 
> Hi Will,
> 
> I am not familiar with Redshift, so take the following with a grain of salt. 
> But a few notes about Parquet/Arrow:
> 
> Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical 
> types 
> (https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40),
>  and a "string" logical type which annotates the binary physical type 
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string).
> 
> When Arrow writes a Parquet file, it thus creates such "string" logical typed 
> columns for string data. And you are correct that there is no way to 
> influence how this is physically stored in the Parquet file (using anything 
> else as "binary" would also violate the spec, AFAIU), and Parquet also 
> doesn't know the concept of "VARCHAR(n)" (it's either variable length 
> (without configurable max length), or either fixed length)
> 
> But it would seem strange if Redshift did not support string columns in 
> Parquet whatsoever. Do you have control over the table's schema into which 
> the Parquet file is read? Eg I noticed 
> https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html
>  has a "varchar_lengths" keyword to control this.
> 
> Best,
> Joris
> 
>> On Thu, 30 Sept 2021 at 23:52, William Ayd <[email protected]> wrote:
>> Greetings Arrow Community,
>> 
>> I am working with on a project that drops data into Parquet files on an 
>> Amazon S3 bucket and reads from there into Redshift. From what I can tell, 
>> AWS Redshift will not read a parquet file properly that contains a VARCHAR 
>> data type with no max_length specification. 
>> 
>> Is there any way to pass that type information through to the parquet 
>> serializer? I searched through the documentation but nothing stood out at me.
>> 
>> For reference, the best documentation I could find on the AWS side about not 
>> supporting a blanket VARCHAR without max_length is in the "Invalid Column 
>> Type Error" from this documentation:
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>> 
>> Thanks,
>> Will
>> 
>>

Re: [Python,Parquet] How to specify VARCHAR max_length?

Reply via email to