[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2021-04-26 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332103#comment-17332103
 ] 

Andrew Lamb commented on ARROW-9275:


Migrated to github: https://github.com/apache/arrow-rs/issues/82

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-08-26 Thread Max Burke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185360#comment-17185360
 ] 

Max Burke commented on ARROW-9275:
--

I'd be a little concerned about over-generalizing out of the gate. Having done 
a similar song and dance with some internal code one of the things I like about 
tailoring to a specific runtime is that synchronization primitives taken from a 
particular runtime are able to leverage that runtime. Tokio's mutex, for 
example, will yield back to the executor if it contends on a mutex lock rather 
than tying up a pool thread, which can be a Very Good Thing with async-heavy 
workloads.

I'm not sure what you mean in terms of specifics for the "sans-IO" method, I 
assume by this you mean the user would be expected to pass in implementations 
of the AsyncRead/etc. traits which will read from disk or network or memory or 
wherever? 

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-08-25 Thread Andrew Lamb (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184086#comment-17184086
 ] 

Andrew Lamb commented on ARROW-9275:


In general, I think the notion of implementing async Parquet and Arrow APIs 
that don't rely on tokio or other executors is a good idea. 

I think in order to make the crate as widely useful as possible, it should also 
retain a synchronous API for use with the rust standard library.

One pattern I have seen is a using a `async` crate option that adds the 
appropriate async options (and possibly additional dependencies). For example, 
https://docs.rs/bzip2/0.4.1/bzip2/#async-io



> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-08-24 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183667#comment-17183667
 ] 

Andy Grove commented on ARROW-9275:
---

[~vertexclique] For some reason I didn't see this issue until now. I am 
interested in discussing this further and especially how it relates to other 
issues we have open around async.

Also pinging [~alamb] and [~jorgecarleitao] who have been involved in 
discussions related to this in the DataFusion crate.

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-07-03 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150943#comment-17150943
 ] 

Mahmut Bulut commented on ARROW-9275:
-

Yes, exactly Neville, so users can choose whatever they want to incorporate in 
their workloads, which enables plenty of projects with different workloads, 
scenarios, etc.

And yes again, I feel like there should be a collaborative effort together to 
add APIs around crates. Spans a little wider than other tickets.

Sure! I will send a similar email with similar content of this ticket. Tagging 
`[Rust]`. Thanks for the feedback, will send a mail asap.

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-07-01 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149648#comment-17149648
 ] 

Neville Dipale commented on ARROW-9275:
---

Hi [~vertexclique], I'm out of my depth with Sans IO. Are you proposing a way 
of using async IO without being bound to a specific runtime (tokio, async-std, 
etc.)?

There has been interest in async IO, so I presume that once we have a concrete 
implementation plan, we might be able to get more contributors to help 
(assuming it's a lot of effort).

As you mention that this might potentially span across other projects; perhaps 
you could bring this up in the mailing list, to get more feedback from the 
wider community?

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the flight format or any flight-related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays

2020-06-30 Thread Mahmut Bulut (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148510#comment-17148510
 ] 

Mahmut Bulut commented on ARROW-9275:
-

[~nevi_me], [~andygrove], [~paddyhoran] I need input for this from you if 
possible.

> [Rust] – Async Sans IO: R/W into/to Arrow Arrays
> 
>
> Key: ARROW-9275
> URL: https://issues.apache.org/jira/browse/ARROW-9275
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mahmut Bulut
>Assignee: Mahmut Bulut
>Priority: Major
>
> This issue can be considered an epic level that spans across other arrow 
> projects.
> *Drill down*
> Currently, traits like `ParquetReader` only allow synchronous interface which 
> uses BufReader having 8KB constant buffer. Over the network, this becomes a 
> problem. This can be easily solvable with differential buffers. In addition 
> to this shortage, there is a problem of executor engine is needed to schedule 
> from async trait methods to sync trait methods which should sit somewhere in 
> between to make requests asynchronous to external IO. On-disk IO is 
> acceptable with the approach we currently have since no reliable evented IO 
> exists for on-disk IO on major platforms.
> All these considered abstractions that will expose asynchronous IO without 
> any side from executors, needs to be exposed.
>  
> *Design Suggestions & Considerations*
> The design should apply and consider:
>  * Sans IO, (for more information about Sans approach please see 
> [https://sans-io.readthedocs.io/] ) 
>  * Not including any executor specific data, at all.
>  * Tests should work with any executor with little to no modification.
>  * Buffers are adjusted accordingly and use differential buffers to optimize 
> network trips.
>  * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO 
> traits or we do overlapping implementation, that will make our life harder in 
> the future. Sans IO should be compartmentalized.
>  
> *Notes*
> If Sans approach is not taken, the project will:
>  * use an extreme amount of dependencies.
>  * be not compatible with other Rust code at all.
>  * break currently working code uses array ingestions.
>  * integrations tests are going to be harder.
>  * it will really hard to adapt to completion-based APIs stabilize in the 
> future. (in the user projects)
>  * this suggestion is not about the in-flight format or any in-flight related 
> information atm. This is purely making on-disk, remote IO (provider backends 
> like AWS etc.) async.
>  
> *Open points*
> A couple of open points:
>  * Identifying traits that are going to be asyncized.
>  * Designing internal routines.
>  * package name to expose.
>  * Gather traits into the designated packages in all file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)