cool, thanks for sharing the sample!

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Fri, May 6, 2022 at 1:31 PM Howard Engelhart <[email protected]>
wrote:

> Thanks Aldrin and Weston!  Following your suggestions I was able to encode
> the schema such that Athena recognized it... In case it helps anyone,
> here's some sample code..
>
> import { Schema, Field, Utf8, Table, RecordBatchStreamWriter, Int32, Bool,
> DateMillisecond, DateDay } from 'apache-arrow';
> const s = new Schema([
>   new Field('name', new Utf8),
>   new Field('address', new Utf8),
>   new Field('active', new Bool),
>   new Field('count', new Int32),
>   new Field('birthday', new DateDay),
>   new Field('created', new DateMillisecond)
> ]);
> const w = new RecordBatchStreamWriter();
> w.write(new Table(s));
> const encodedSchema = Buffer.from(w.toUint8Array(true)).toString('base64');
>
>
>
> On Fri, May 6, 2022 at 3:53 PM Aldrin <[email protected]> wrote:
>
>> I didn't think of this as a possible solution, for some reason, but I
>> think it actually makes a lot of sense. Just as a reference, this is
>> something I currently do when storing data in a key-value interface:
>>
>>    - I write a buffer with no batches
>>    - Write batches in separate buffers
>>       - these are sized to fully utilize the space for each key-value
>>
>> It is possible to then read the key-value that only contains a schema.
>>
>> I believe my approach for doing this can be seen in [1], and I use the
>> StreamWriter because I want it to use an in-memory format that is
>> streamable.
>>
>> [1]:
>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/mainline/src/cpp/processing/dataformats.cpp#L16
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Fri, May 6, 2022 at 12:04 PM Weston Pace <[email protected]>
>> wrote:
>>
>>> Can you serialize the schema by creating an IPC file with zero record
>>> batches?  I apologize, but I do not know the JS API as well.  Maybe
>>> you can create a table from just a schema (or a schema and a set of
>>> empty arrays) and then turn that into an IPC file?  This shouldn't add
>>> too much overhead.
>>>
>>> On Thu, May 5, 2022 at 8:23 AM Howard Engelhart
>>> <[email protected]> wrote:
>>> >
>>> > I'm looking to implement an Athena federated query custom connector
>>> using the arrow js lib.  I'm getting stuck on figuring out how to encode a
>>> Schema properly for the Athena GetTableResponse.  I have found an example
>>> using python that does something like this.. (paraphrasing...)
>>> >
>>> > import pyarrow as pa
>>> > .....
>>> >        return {
>>> >             "@type": "GetTableResponse",
>>> >             "catalogName": self.catalogName,
>>> >             "tableName": {'schemaName': self.databaseName,
>>> 'tableName': self.tableName},
>>> >             "schema": {"schema":
>>> base64.b64encode(pa.schema(....args...).serialize().slice(4)).decode("utf-8")},
>>> >             "partitionColumns": self.partitions,
>>> >             "requestType": self.request_type
>>> >         }
>>> > What i'm looking for is the js equivalent of
>>> > pa.schema(....args...).serialize()
>>> >
>>> > Is there one?  If not, could someone point me in the right direction
>>> of how to code up something similar?
>>>
>>

Reply via email to