Flight is meant for Arrow data and so I don't think there are any other 
comparable frameworks in the first place. 

I don't think there is any generic answer to your question without a lot of 
knowledge of your particular challenges. (gRPC has technical features making it 
faster, but also gRPC users generally choose a binary, non-self-describing wire 
format and so are transferring less data and spending less time 
serializing/deserializing in the first place, for instance. That's a tradeoff 
that may or may not make sense. Compression may give you more effective 
bandwidth, but that may not help in your case, and may not help an individual 
stream's bandwidth (at least, if you're compressing on the fly - we did some 
benchmarking and didn't see any improvements with compression in Flight, for 
instance). So without a lot of context, it's hard to recommend anything - and 
at least I'm really mostly familiar with Arrow-related things.)

On Fri, Mar 3, 2023, at 16:41, Vilayannur Sitaraman wrote:
> Got it, thanks for your thoughts David.  In terms of accelerating transfers 
> are you aware of any other framework known to be faster than the gRPC based 
> arrow/arrowflight?
> Sitaraman
> 
> 
> *From:* David Li <lidav...@apache.org>
> *Sent:* Friday, March 3, 2023 1:37 PM
> *To:* dl <user@arrow.apache.org>
> *Subject:* Re: Is ArrowFlight/Arrow the right choice for transporting large 
> volumes of unstructured text
>  
> ***** EXTERNAL EMAIL *****
> HTTP/2 is not a magic 'make things faster' button (and due to head-of-line 
> blocking, it can be slower than HTTP/1 in some circumstances!), plus you can 
> use HTTP/2 outside of Flight/gRPC. (Or you could just use gRPC itself, though 
> by default gRPC is going to force some extra copies on you.)
> 
> Flight encourages parallelization and separation of control/data, but of 
> course you can implement those things yourself.
> 
> I would still encourage you to try Arrow/Flight, especially as many 
> frameworks in this space do interoperate with Arrow, but I'm not sure I'd 
> treat Arrow as just a way to accelerate your network transfers.
> 
> On Fri, Mar 3, 2023, at 16:33, Vilayannur Sitaraman wrote:
>> Hi David,
>>    I was of the belief that gRPC which uses http2 as transport is more 
>> efficient.  Plus the benefits listed as pasted below for Arrow Flight?
>> “
>> We wanted Flight to enable systems to create horizontally scalable data 
>> services without having to deal with such bottlenecks. A client request to a 
>> dataset using the GetFlightInfo RPC returns a list of *endpoints*, each of 
>> which contains a server location and a *ticket* to send that server in a 
>> DoGet request to obtain a part of the full dataset. To get access to the 
>> entire dataset, all of the endpoints must be consumed. While Flight streams 
>> are not necessarily ordered, we provide for application-defined metadata 
>> which can be used to serialize ordering information.
>> This multiple-endpoint pattern has a number of benefits:
>>  • Endpoints can be read by clients in parallel.
>>  • The service that serves the GetFlightInfo “planning” request can delegate 
>> work to sibling services to take advantage of data locality or simply to 
>> help with load balancing.
>>  • Nodes in a distributed cluster can take on different roles. For example, 
>> a subset of nodes might be responsible for planning queries while other 
>> nodes exclusively fulfill data stream (DoGet or DoPut) requests.
>>  
>>  
>> “
>>  
>> *From: *David Li <lidav...@apache.org>
>> *Date: *Friday, March 3, 2023 at 5:15 AM
>> *To: *dl <user@arrow.apache.org>
>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
>> volumes of unstructured text
>> ***** EXTERNAL EMAIL *****
>> If you are always just going to convert back to string, then I don't see why 
>> you wouldn't just use HTTP.
>>  
>> On Thu, Mar 2, 2023, at 21:50, Vilayannur Sitaraman wrote:
>>> I am primarily concerned about the efficient transfer of such texts across 
>>> machines and across different programming languages….I am comfortable 
>>> treating a text chunk as a VarCharVector with the textual content like 
>>> below, and which I can then transfer to my NLP module,  for further 
>>> processing.  But I want to get expert opinion on if this is the right way 
>>> to handle this requirement.  Or are there more efficient ways of doing the 
>>> transfer than converting to arrow first, doing the transfer and then 
>>> converting back to  string for further processing.
>>> 
>>> Thanks for your thoughts and considered opinion on this.
>>> 
>>> VarCharVector stateVector = (VarCharVector) 
>>> vectorSchemaRoot.getVector("state");
>>> stateVector.allocateNew(textlines.size());
>>> int k=0;
>>> for ( String thisStr: textlines) {
>>>   //nameVector.set(i, stateStr.getBytes());
>>>   stateVector.set(k, thisStr.getBytes());
>>>   k++;
>>> }
>>>   //System.out.println("i in state is " + i + " " + stateStr);
>>>   //vectorSchemaRoot.setRowCount(i+1);
>>>   vectorSchemaRoot.setRowCount(textlines.size());
>>>   clientStreamListener.start(vectorSchemaRoot);
>>>   clientStreamListener.putNext();
>>>   clientStreamListener.completed();
>>>   System.*out*.println(vectorSchemaRoot.getRowCount());
>>> 
>>> Sitaraman
>>> 
>>> *From: *David Li <lidav...@apache.org>
>>> *Date: *Thursday, March 2, 2023 at 6:03 PM
>>> *To: *dl <user@arrow.apache.org>
>>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting large 
>>> volumes of unstructured text
>>> 
>>> ***** EXTERNAL EMAIL *****
>>> 
>>> NLP is not something I'm familiar with. If your analysis works with Arrow 
>>> or Arrow-ecosystem tools at some point (e.g. pandas, RAPIDS, xgboost) then 
>>> it would likely benefit you to use Arrow up front instead of converting the 
>>> data down the line. (For example, HuggingFace datasets use Arrow partly for 
>>> its interoperability with other tools [1].) 
>>> 
>>>  
>>> 
>>> [1]: https://huggingface.co/docs/datasets/about_arrow 
>>> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdatasets%2Fabout_arrow&data=05%7C01%7Cvilayannur.sitaraman%40hitachivantara.com%7C791f9257d64943dcdc9708db1c2f9db9%7C18791e1761594f52a8d4de814ca8284a%7C0%7C0%7C638134763077709249%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4vI5ziv1IEoDZh2O1Oj85Kh1r7Z64XLK5dC622AHy3w%3D&reserved=0>
>>> 
>>>  
>>> 
>>> On Thu, Mar 2, 2023, at 20:38, Vilayannur Sitaraman wrote:
>>> 
>>>> Hi David,
>>>> 
>>>> Thanks for the questions…
>>>> 
>>>> Are these two processes on the same machine:
>>>> 
>>>> No, two different processes on different machines
>>>> 
>>>>  
>>>> 
>>>> What exactly is the unstructured text
>>>> 
>>>> The text is the textual content of normal documents that enterprises have 
>>>> such as pdf docx files.  I can split these into chunks before transferring 
>>>> if needed. 
>>>> 
>>>>  
>>>> 
>>>> What is the python side planning to do:
>>>> 
>>>> Analyze and run ML models such as NLP on the text.
>>>> 
>>>> Sitaraman
>>>> 
>>>> *From: *David Li <lidav...@apache.org>
>>>> *Date: *Thursday, March 2, 2023 at 5:33 PM
>>>> *To: *dl <user@arrow.apache.org>
>>>> *Subject: *Re: Is ArrowFlight/Arrow the right choice for transporting 
>>>> large volumes of unstructured text
>>>> 
>>>> ***** EXTERNAL EMAIL *****
>>>> 
>>>> Possibly, but more details might help. Are these two processes on the same 
>>>> machine, two components in the same process, two processes on different 
>>>> machines? What exactly is the unstructured text - does it at least fit 
>>>> into a column of data, or is it literally just a stream of text with no 
>>>> further structure? What is the Python side planning to do with the text 
>>>> (for instance, do you want to further analyze it with something like 
>>>> Pandas)?
>>>> 
>>>>  
>>>> 
>>>> On Thu, Mar 2, 2023, at 18:45, Vilayannur Sitaraman wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>>   My use case is the need to efficiently transport large volumes of 
>>>>> unstructured text from a module in Java to a module in Python with 
>>>>> possibly a massaging of the docs before transport. Is Arrow Flight/Arrow 
>>>>> the right choice for this?  Why Why not?  Any advice appreciated.
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Sitaraman
>>>>> 
>>>>  
>>>> 
>>>  
>>> 
>>  
> 

Reply via email to