[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332103#comment-17332103 ] Andrew Lamb commented on ARROW-9275: Migrated to github: https://github.com/apache/arrow-rs/issues/82 > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185360#comment-17185360 ] Max Burke commented on ARROW-9275: -- I'd be a little concerned about over-generalizing out of the gate. Having done a similar song and dance with some internal code one of the things I like about tailoring to a specific runtime is that synchronization primitives taken from a particular runtime are able to leverage that runtime. Tokio's mutex, for example, will yield back to the executor if it contends on a mutex lock rather than tying up a pool thread, which can be a Very Good Thing with async-heavy workloads. I'm not sure what you mean in terms of specifics for the "sans-IO" method, I assume by this you mean the user would be expected to pass in implementations of the AsyncRead/etc. traits which will read from disk or network or memory or wherever? > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184086#comment-17184086 ] Andrew Lamb commented on ARROW-9275: In general, I think the notion of implementing async Parquet and Arrow APIs that don't rely on tokio or other executors is a good idea. I think in order to make the crate as widely useful as possible, it should also retain a synchronous API for use with the rust standard library. One pattern I have seen is a using a `async` crate option that adds the appropriate async options (and possibly additional dependencies). For example, https://docs.rs/bzip2/0.4.1/bzip2/#async-io > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183667#comment-17183667 ] Andy Grove commented on ARROW-9275: --- [~vertexclique] For some reason I didn't see this issue until now. I am interested in discussing this further and especially how it relates to other issues we have open around async. Also pinging [~alamb] and [~jorgecarleitao] who have been involved in discussions related to this in the DataFusion crate. > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150943#comment-17150943 ] Mahmut Bulut commented on ARROW-9275: - Yes, exactly Neville, so users can choose whatever they want to incorporate in their workloads, which enables plenty of projects with different workloads, scenarios, etc. And yes again, I feel like there should be a collaborative effort together to add APIs around crates. Spans a little wider than other tickets. Sure! I will send a similar email with similar content of this ticket. Tagging `[Rust]`. Thanks for the feedback, will send a mail asap. > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149648#comment-17149648 ] Neville Dipale commented on ARROW-9275: --- Hi [~vertexclique], I'm out of my depth with Sans IO. Are you proposing a way of using async IO without being bound to a specific runtime (tokio, async-std, etc.)? There has been interest in async IO, so I presume that once we have a concrete implementation plan, we might be able to get more contributors to help (assuming it's a lot of effort). As you mention that this might potentially span across other projects; perhaps you could bring this up in the mailing list, to get more feedback from the wider community? > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the flight format or any flight-related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9275) [Rust] – Async Sans IO: R/W into/to Arrow Arrays
[ https://issues.apache.org/jira/browse/ARROW-9275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148510#comment-17148510 ] Mahmut Bulut commented on ARROW-9275: - [~nevi_me], [~andygrove], [~paddyhoran] I need input for this from you if possible. > [Rust] – Async Sans IO: R/W into/to Arrow Arrays > > > Key: ARROW-9275 > URL: https://issues.apache.org/jira/browse/ARROW-9275 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Mahmut Bulut >Assignee: Mahmut Bulut >Priority: Major > > This issue can be considered an epic level that spans across other arrow > projects. > *Drill down* > Currently, traits like `ParquetReader` only allow synchronous interface which > uses BufReader having 8KB constant buffer. Over the network, this becomes a > problem. This can be easily solvable with differential buffers. In addition > to this shortage, there is a problem of executor engine is needed to schedule > from async trait methods to sync trait methods which should sit somewhere in > between to make requests asynchronous to external IO. On-disk IO is > acceptable with the approach we currently have since no reliable evented IO > exists for on-disk IO on major platforms. > All these considered abstractions that will expose asynchronous IO without > any side from executors, needs to be exposed. > > *Design Suggestions & Considerations* > The design should apply and consider: > * Sans IO, (for more information about Sans approach please see > [https://sans-io.readthedocs.io/] ) > * Not including any executor specific data, at all. > * Tests should work with any executor with little to no modification. > * Buffers are adjusted accordingly and use differential buffers to optimize > network trips. > * Sync IO shouldn't be touched. At all costs. If we try to unify Sync IO > traits or we do overlapping implementation, that will make our life harder in > the future. Sans IO should be compartmentalized. > > *Notes* > If Sans approach is not taken, the project will: > * use an extreme amount of dependencies. > * be not compatible with other Rust code at all. > * break currently working code uses array ingestions. > * integrations tests are going to be harder. > * it will really hard to adapt to completion-based APIs stabilize in the > future. (in the user projects) > * this suggestion is not about the in-flight format or any in-flight related > information atm. This is purely making on-disk, remote IO (provider backends > like AWS etc.) async. > > *Open points* > A couple of open points: > * Identifying traits that are going to be asyncized. > * Designing internal routines. > * package name to expose. > * Gather traits into the designated packages in all file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)