Flight and sparse data

Matt Youill Wed, 01 Jun 2022 22:17:56 -0700

Hi,

I have a question regarding Flight and sparse data...


Suppose you have a data set where some records are missing values.
Consuming those records in batches may mean a different schema for each
batch.

In the case where a field is known to be missing it isn't possible to
infer the type. In the case where the fields aren't known in advance it
isn't possible to include missing fields in the schema at all. E.g.
Suppose the following 2 partitions of a notionally single data set are
read into 2 batches of 3 records each.

A, B
1, 2
4, 5
7, 8

A,  B,  C
10, 11, 12
13, 14, 15
16, 17, 18

Batch 1 may get schema ((A, int), (B, int)) while batch 2 may get ((A,
int), (B, int), (C, int)) or in the case where we know C*should*  exist
we could set batch 1 schema to ((A, int), (B, int), (C, null or some
other "undefined" type?)).

This isn't an issue when working with individual batches, but becomes
problematic when working with data structures that aggregate batches
(e.g. Table, RecordBatchReader, etc). Most of these data structures seem
to assume that the schema is that of the first contained record batch -
which is usually fine or can be worked around.

What I can't figure out however is how to deal with FlightDataStream
that wants a single schema for a stream of record batches AFAICT, when
the record batches may have different schemas and it isn't possible to
have a view of the entire stream of batches to resolve discrepancies
prior to transmitting the stream. Or, indeed fix discrepancies at the
receiving end?

Is there a natural way to work with Flight and sparse streaming data.

Thanks, Matt

Flight and sparse data

Reply via email to