Hi, I have a question regarding Flight and sparse data...
Suppose you have a data set where some records are missing values. Consuming those records in batches may mean a different schema for each batch. In the case where a field is known to be missing it isn't possible to infer the type. In the case where the fields aren't known in advance it isn't possible to include missing fields in the schema at all. E.g. Suppose the following 2 partitions of a notionally single data set are read into 2 batches of 3 records each. A, B 1, 2 4, 5 7, 8 A, B, C 10, 11, 12 13, 14, 15 16, 17, 18 Batch 1 may get schema ((A, int), (B, int)) while batch 2 may get ((A, int), (B, int), (C, int)) or in the case where we know C*should* exist we could set batch 1 schema to ((A, int), (B, int), (C, null or some other "undefined" type?)). This isn't an issue when working with individual batches, but becomes problematic when working with data structures that aggregate batches (e.g. Table, RecordBatchReader, etc). Most of these data structures seem to assume that the schema is that of the first contained record batch - which is usually fine or can be worked around. What I can't figure out however is how to deal with FlightDataStream that wants a single schema for a stream of record batches AFAICT, when the record batches may have different schemas and it isn't possible to have a view of the entire stream of batches to resolve discrepancies prior to transmitting the stream. Or, indeed fix discrepancies at the receiving end? Is there a natural way to work with Flight and sparse streaming data. Thanks, Matt
