Aldrin,

  Thanks for the detailed answer. I’m familiar with many of the concepts of 
distributed storage, but we don’t intend to do any data replication.
At most, we will have a single metadata server that stores information about 
data sources. Regardless, we’ll take a look at the links you provided.

From: Aldrin <[email protected]>
Date: Friday, January 13, 2023 at 1:44 PM
To: [email protected] <[email protected]>
Subject: Re: Sharded Flight Server: Continued
1. Setup a single flight metadata
 Implement the subset of flight interface that supports metadata operations 
(such as GetFlightInfo() mentioned by David)

2. Setup several sharded data Flight servers
Implement the subset of flight interface that supports data operations (such as 
DoGet() mentioned by David)

3. Setup clients that distribute data to sharded data servers:
Even looking at the original thread, I am mildly confused. Between A and B, I 
am not sure what you're asserting or what you're trying to clarify.

    a. All data is pushed to servers by clients, no direct access. This assumes 
that data is  placed in memory on the data servers?
When writing new data (and often when sending updates), the client sends the 
data to the server. This does not mean that the data needs to be in-memory 
only. From the original thread, though, it seems like that is a requirement you 
have?
Also, this usage of direct access is confusing even after reading your 
clarification in the original thread.

    B. How does data get distributed to the various data servers?
In this case, David's advice is particularly true: you just want to see the 
various ways other distributed systems handle this. The brief summary of the 
fundamental approaches is:
        A) At a client, partition data then send each data partition to the 
appropiate data server
        B) At a client, send data to a particular data server. At the data 
server, partition data then send each data partition to the appropriate data 
server in a peer-to-peer fashion
        C) hybrid of A and B where your client knows some coarse-grained data 
distribution and each data server is aware of some fine-grained data 
distribution (this is most relevant for data that is partitioned and replicated)

    c. Is this a valid use case?
This is a fundamental approach for distributed storage systems (many data 
servers, few metadata servers). Plus, you're trying to use Arrow, so it sounds 
especially valid.

4. If anyone can point us in the right direction with some code examples, we 
would very much appreciate.
The way to think about it is that Flight defines an interface or protocol to 
facilitate communication. The diagram in [1] shows exactly how the client and 
servers would interact for a GET request (downloading data), which should also 
contextualize David's responses if you haven't seen the diagram.

One thing you'll need to do is figure out what to put in your "ticket" that 
facilitates how your flight servers identify data; this should be dependent on 
your storage model (i.e. file names or key names). How your data servers store 
data, and how your metadata server stores metadata is totally up to you, but 
whatever info you get from the metadata server should be usable by the data 
server to actually retrieve data.

I agree with David that there is a lot of content and your questions don't make 
it clear how much of that info you need clarification on. In case you have 
little experience, here's some quick info to help orient you when looking 
through resources:
1. Distributed storage systems use the architecture you're mentioning (metadata 
server and data servers)
2. Key value stores and distributed databases frequently just have many 
database servers, where each server has both metadata and data
3. If you are just distributing data in disjoint partitions (data values on one 
data server do not exist on other data servers), then the design is much 
simpler.
4. If you are doing replication then things become much more complex and 
consistency becomes something that your system needs to accommodate.
5. 1-4 are orthogonal to Arrow Flight, which just provides the foundation for 
how clients and servers communicate.
6. 1-4 requires distributed system design principles be applied to what is 
stored in your metadata and how that metadata is propagated (or accessed) by 
each client and server participating in some series of requests.
7. Best approaches to start understanding consistency are quorum consistency 
and vector clocks.
    A. at a glance, [2] seems to concisely explain quorum consistency which is 
useful from a server or client that can send requests to all database servers, 
and usually is used with a small amount of replication (12ish servers)
    B. The wiki on vector clocks [3] decently illustrates how to determine that 
some events provably come after other events. This approach is necessary when 
some peer-to-peer communication happens, otherwise a simple version number 
should suffice.


[1]: 
https://arrow.apache.org/docs/format/Flight.html#downloading-data<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Farrow.apache.org%2Fdocs%2Fformat%2FFlight.html%23downloading-data&data=05%7C01%7Cphilip.carinhas%40zapatacomputing.com%7C8ec28d954dd24d672d3a08daf59e86dd%7C47c84d2a037549a0aea39a4db4172570%7C1%7C0%7C638092358487659454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TxkZHII9Gmt8MpwnevNUdnjGoygxmnVtOEMkuHhwZIY%3D&reserved=0>
[2]: 
https://tangarts.github.io/consistency-levels-and-quorums.html<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftangarts.github.io%2Fconsistency-levels-and-quorums.html&data=05%7C01%7Cphilip.carinhas%40zapatacomputing.com%7C8ec28d954dd24d672d3a08daf59e86dd%7C47c84d2a037549a0aea39a4db4172570%7C1%7C0%7C638092358487659454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=j7t2IGB56HyBOPx%2Fojd7M88m9PgHTUWBv0Gmv1zyhC8%3D&reserved=0>
[3]: 
https://en.wikipedia.org/wiki/Vector_clock<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FVector_clock&data=05%7C01%7Cphilip.carinhas%40zapatacomputing.com%7C8ec28d954dd24d672d3a08daf59e86dd%7C47c84d2a037549a0aea39a4db4172570%7C1%7C0%7C638092358487659454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mpgd7zl6WMpaIuRSaFEWHOxJZ6iUpyF5LXHbwjPG%2Byo%3D&reserved=0>


Aldrin Montana
Computer Science PhD Student
UC Santa Cruz

Reply via email to