Re: [Python,Parquet] How to specify VARCHAR max_length?

Weston Pace Fri, 01 Oct 2021 12:52:23 -0700

My Redshift knowledge is probably both rusty and inadequate but I
believe Redshift requires a data catalog in addition to your data.
This is basically a dedicated dataset schema.  AWS glue is the
technology I think they use for that data catalog.  If I read that
link you posted correctly then it is complaining that the data catalog
should have a parameterized varchar type (and not the data itself).
So you might look at whatever process you are using to generate your
data catalog (e.g. are you using `CREATE EXTERNAL TABLE` or are you
using "crawlers" and see if you can update that catalog creation
process to define the varchar length).



On Fri, Oct 1, 2021 at 7:24 AM William Ayd <[email protected]> wrote:
>
> Thanks Joris for all the great info. I thought the VARCHAR(max_length) was 
> part of the Parquet file specification, but there must be some intermediary 
> layer AWS is using to map to that from the Logical string type you’ve 
> referenced.
>
> The library you referenced looks promising. I’ll definitely give that a shot.
>
> Thanks again!
> Will
>
> Sent from my iPhone
>
> On Oct 1, 2021, at 8:06 AM, Joris Van den Bossche 
> <[email protected]> wrote:
>
> 
> Hi Will,
>
> I am not familiar with Redshift, so take the following with a grain of salt. 
> But a few notes about Parquet/Arrow:
>
> Parquet has "binary" (or "byte_array") and "fixed_len_byte_array" physical 
> types 
> (https://github.com/apache/parquet-format/blob/43c891a4494f85e2fe0e56f4ef408bcc60e8da48/src/main/thrift/parquet.thrift#L39-L40),
>  and a "string" logical type which annotates the binary physical type 
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string).
>
> When Arrow writes a Parquet file, it thus creates such "string" logical typed 
> columns for string data. And you are correct that there is no way to 
> influence how this is physically stored in the Parquet file (using anything 
> else as "binary" would also violate the spec, AFAIU), and Parquet also 
> doesn't know the concept of "VARCHAR(n)" (it's either variable length 
> (without configurable max length), or either fixed length)
>
> But it would seem strange if Redshift did not support string columns in 
> Parquet whatsoever. Do you have control over the table's schema into which 
> the Parquet file is read? Eg I noticed 
> https://aws-data-wrangler.readthedocs.io/en/2.4.0-docs/stubs/awswrangler.redshift.copy_from_files.html
>  has a "varchar_lengths" keyword to control this.
>
> Best,
> Joris
>
> On Thu, 30 Sept 2021 at 23:52, William Ayd <[email protected]> wrote:
>>
>> Greetings Arrow Community,
>>
>> I am working with on a project that drops data into Parquet files on an 
>> Amazon S3 bucket and reads from there into Redshift. From what I can tell, 
>> AWS Redshift will not read a parquet file properly that contains a VARCHAR 
>> data type with no max_length specification.
>>
>> Is there any way to pass that type information through to the parquet 
>> serializer? I searched through the documentation but nothing stood out at me.
>>
>> For reference, the best documentation I could find on the AWS side about not 
>> supporting a blanket VARCHAR without max_length is in the "Invalid Column 
>> Type Error" from this documentation:
>>
>> https://aws.amazon.com/premiumsupport/knowledge-center/redshift-spectrum-data-errors/
>>
>> Thanks,
>> Will
>>
>>

Re: [Python,Parquet] How to specify VARCHAR max_length?

Reply via email to