[jira] [Updated] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hildreth updated ARROW-8648:
-
Component/s: Rust

> [Rust] Optimize Rust CI Build Times
> ---
>
> Key: ARROW-8648
> URL: https://issues.apache.org/jira/browse/ARROW-8648
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Priority: Major
>
> In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
> options used that are at odds with each other, resulting in multiple 
> redundant builds where a smaller number could do the same job. The following 
> tweaks, at minimal, could reduce this, speeding up build times:
>  * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
> Currently, it's only used for a single command (the {{build --all-targets}} 
> in {{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
> since RUSTFLAGS has changed.
>  * Don't run examples in release mode, as that would force a new (and slower) 
> rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8648) [Rust] Optimize Rust CI Build Times

2020-04-30 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8648:


 Summary: [Rust] Optimize Rust CI Build Times
 Key: ARROW-8648
 URL: https://issues.apache.org/jira/browse/ARROW-8648
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mark Hildreth


In the Rust CI workflows (rust_build.sh, rust_test.sh), there are some build 
options used that are at odds with each other, resulting in multiple redundant 
builds where a smaller number could do the same job. The following tweaks, at 
minimal, could reduce this, speeding up build times:
 * Ensure that RUSTFLAGS="-D warnings" is used for all cargo commands. 
Currently, it's only used for a single command (the {{build --all-targets}} in 
{{rust_build.sh}}). Subsuquent runs of cargo will ignore this first build, 
since RUSTFLAGS has changed.
 * Don't run examples in release mode, as that would force a new (and slower) 
rebuild, when the examples have already been built in debug mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8637) [Rust] Resolve Issues with `prettytable-rs` dependency

2020-04-29 Thread Mark Hildreth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hildreth updated ARROW-8637:
-
Summary: [Rust] Resolve Issues with `prettytable-rs` dependency  (was: 
Resolve Issues with `prettytable-rs` dependency)

> [Rust] Resolve Issues with `prettytable-rs` dependency
> --
>
> Key: ARROW-8637
> URL: https://issues.apache.org/jira/browse/ARROW-8637
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Priority: Major
>
> {{prettytable-rs}} is a dependency of Arrow for creating a string for 
> displaying record batches in a tabular form (see [pretty 
> util|https://github.com/apache/arrow/blob/c546eef41e6ab20c4ca29a2d836987959843896f/rust/arrow/src/util/pretty.rs#L24-L25])
>  The crate, however, has some issues:
>  
> 1.) {{prettytable-rs}} has a dependency on the {{term}} crate. The {{term}} 
> crate is under minimal maintenance, and it is advised to switch to another 
> crate. This will probably pop up in an [informational security 
> advisory|https://rustsec.org/advisories/RUSTSEC-2018-0015] if it's decided 
> one day to audit the crates.
> 2.) The crate also has a dependency on {{encode-unicode}}. While not 
> problematic in its own right, this crate implements some traits which can 
> bring about confusing type inference issues. For example, after adding the 
> {{prettytable-rs}} dependency in arrow, the following error occurred what 
> attempting to compile the parquet crate:
>  
> {{let seed_vec: Vec =}}
> {{    Standard.sample_iter( rng).take(seed_len).collect();}}
>  
> {{error[E0282]: type annotations needed}}
> {{   --> parquet/src/encodings/rle.rs:833:26}}
> {{    |}}
> {{833 | Standard.sample_iter( rng).take(seed_len).collect();}}
> {{    | ^^^ cannot infer type for `T`}}
>  
> Any user of the arrow crate would see a similar style of error.
>  
> There are a few possible ways to resolve this:
>  
> 1.) Hopefully hear from the crate maintainer. There is a [PR 
> open|https://github.com/phsym/prettytable-rs/pull/125] for the encode-unicode 
> issue.
> 2.) Find a different table-generating crate with less issues.
> 3.) Fork and fix prettytable-rs.
> 4.) ???
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8637) [Rust] Resolve Issues with `prettytable-rs` dependency

2020-04-29 Thread Mark Hildreth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hildreth updated ARROW-8637:
-
Priority: Minor  (was: Major)

> [Rust] Resolve Issues with `prettytable-rs` dependency
> --
>
> Key: ARROW-8637
> URL: https://issues.apache.org/jira/browse/ARROW-8637
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Mark Hildreth
>Priority: Minor
>
> {{prettytable-rs}} is a dependency of Arrow for creating a string for 
> displaying record batches in a tabular form (see [pretty 
> util|https://github.com/apache/arrow/blob/c546eef41e6ab20c4ca29a2d836987959843896f/rust/arrow/src/util/pretty.rs#L24-L25])
>  The crate, however, has some issues:
>  
> 1.) {{prettytable-rs}} has a dependency on the {{term}} crate. The {{term}} 
> crate is under minimal maintenance, and it is advised to switch to another 
> crate. This will probably pop up in an [informational security 
> advisory|https://rustsec.org/advisories/RUSTSEC-2018-0015] if it's decided 
> one day to audit the crates.
> 2.) The crate also has a dependency on {{encode-unicode}}. While not 
> problematic in its own right, this crate implements some traits which can 
> bring about confusing type inference issues. For example, after adding the 
> {{prettytable-rs}} dependency in arrow, the following error occurred what 
> attempting to compile the parquet crate:
>  
> {{let seed_vec: Vec =}}
> {{    Standard.sample_iter( rng).take(seed_len).collect();}}
>  
> {{error[E0282]: type annotations needed}}
> {{   --> parquet/src/encodings/rle.rs:833:26}}
> {{    |}}
> {{833 | Standard.sample_iter( rng).take(seed_len).collect();}}
> {{    | ^^^ cannot infer type for `T`}}
>  
> Any user of the arrow crate would see a similar style of error.
>  
> There are a few possible ways to resolve this:
>  
> 1.) Hopefully hear from the crate maintainer. There is a [PR 
> open|https://github.com/phsym/prettytable-rs/pull/125] for the encode-unicode 
> issue.
> 2.) Find a different table-generating crate with less issues.
> 3.) Fork and fix prettytable-rs.
> 4.) ???
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8637) Resolve Issues with `prettytable-rs` dependency

2020-04-29 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8637:


 Summary: Resolve Issues with `prettytable-rs` dependency
 Key: ARROW-8637
 URL: https://issues.apache.org/jira/browse/ARROW-8637
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Mark Hildreth


{{prettytable-rs}} is a dependency of Arrow for creating a string for 
displaying record batches in a tabular form (see [pretty 
util|https://github.com/apache/arrow/blob/c546eef41e6ab20c4ca29a2d836987959843896f/rust/arrow/src/util/pretty.rs#L24-L25])
 The crate, however, has some issues:

 

1.) {{prettytable-rs}} has a dependency on the {{term}} crate. The {{term}} 
crate is under minimal maintenance, and it is advised to switch to another 
crate. This will probably pop up in an [informational security 
advisory|https://rustsec.org/advisories/RUSTSEC-2018-0015] if it's decided one 
day to audit the crates.

2.) The crate also has a dependency on {{encode-unicode}}. While not 
problematic in its own right, this crate implements some traits which can bring 
about confusing type inference issues. For example, after adding the 
{{prettytable-rs}} dependency in arrow, the following error occurred what 
attempting to compile the parquet crate:

 

{{let seed_vec: Vec =}}

{{    Standard.sample_iter( rng).take(seed_len).collect();}}

 

{{error[E0282]: type annotations needed}}
{{   --> parquet/src/encodings/rle.rs:833:26}}
{{    |}}
{{833 | Standard.sample_iter( rng).take(seed_len).collect();}}
{{    | ^^^ cannot infer type for `T`}}

 

Any user of the arrow crate would see a similar style of error.

 

There are a few possible ways to resolve this:

 

1.) Hopefully hear from the crate maintainer. There is a [PR 
open|https://github.com/phsym/prettytable-rs/pull/125] for the encode-unicode 
issue.

2.) Find a different table-generating crate with less issues.

3.) Fork and fix prettytable-rs.

4.) ???

 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8590) [Rust] Use Arrow pretty print utility in DataFusion

2020-04-24 Thread Mark Hildreth (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hildreth updated ARROW-8590:
-
Description: ARROW-8287 added some new utility methods for pretty printing 
into the rust arrow crate (see [PR 
6972|https://github.com/apache/arrow/pull/6972]). These were basically copied 
from DataFusion. Modify DataFusion to use the utility methods in the arrow 
crate, removing the duplicate code.  (was: ARROW-8287 added some new utility 
methods for pretty printing into the rust arrow crate (see [PR 
6972|https://github.com/apache/arrow/pull/6972]). These were basically pulled 
from DataFusion. Modify DataFusion to use the utility methods in the arrow 
crate, removing the duplicate code.)

> [Rust] Use Arrow pretty print utility in DataFusion
> ---
>
> Key: ARROW-8590
> URL: https://issues.apache.org/jira/browse/ARROW-8590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Mark Hildreth
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-8287 added some new utility methods for pretty printing into the rust 
> arrow crate (see [PR 6972|https://github.com/apache/arrow/pull/6972]). These 
> were basically copied from DataFusion. Modify DataFusion to use the utility 
> methods in the arrow crate, removing the duplicate code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8590) [Rust] Use Arrow pretty print utility in DataFusion

2020-04-24 Thread Mark Hildreth (Jira)
Mark Hildreth created ARROW-8590:


 Summary: [Rust] Use Arrow pretty print utility in DataFusion
 Key: ARROW-8590
 URL: https://issues.apache.org/jira/browse/ARROW-8590
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Mark Hildreth


ARROW-8287 added some new utility methods for pretty printing into the rust 
arrow crate (see [PR 6972|https://github.com/apache/arrow/pull/6972]). These 
were basically pulled from DataFusion. Modify DataFusion to use the utility 
methods in the arrow crate, removing the duplicate code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8559) [Rust] Consolidate Record Batch iterator traits in main arrow crate

2020-04-24 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091739#comment-17091739
 ] 

Mark Hildreth edited comment on ARROW-8559 at 4/24/20, 5:00 PM:


Generally in favor, but one question and one bikeshed:

Question: perhaps my Rust-fu is lacking, but why would we also need a 
{{SendableBatchIterator}}? If we want to make sure that a type marks itself 
{{Send}} and/or {{Sync}}, it can do that. If an interface wants to accept only 
{{Send}} and/or {{Sync}} iterators, it could do {{BatchIterator + Send + Sync}}.

Bikeshed: There are no {{std::iter::Iterator}} trait implementation for either 
{{BatchIterator}} or {{RecordBatchReader}}. Thus, using the name {{Iterator}} 
seems a bit misleading.


was (Author: markhildreth):
Generally in favor, but one question and one bikeshed:

Question: perhaps my Rust-fu is lacking, but why would we need a 
{{SendableBatchIterator}}? If we want to make sure that a type marks itself 
{{Send}} and/or {{Sync}}, it can do that. If an interface wants to accept only 
{{Send}} and/or {{Sync}} iterators, it could do {{BatchIterator + Send + Sync}}.

Bikeshed: There are no {{std::iter::Iterator}} trait implementation for either 
{{BatchIterator}} or {{RecordBatchReader}}. Thus, using the name {{Iterator}} 
seems a bit misleading.

> [Rust] Consolidate Record Batch iterator traits in main arrow crate
> ---
>
> Key: ARROW-8559
> URL: https://issues.apache.org/jira/browse/ARROW-8559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>
> We have the `BatchIterator` trait in DataFusion and the `RecordBatchReader` 
> trait in the main arrow crate.
> They differ in that `BatchIterator` is Send + Sync.  They should both be in 
> the Arrow crate and be named `BatchIterator` and `SendableBatchIterator`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8559) [Rust] Consolidate Record Batch iterator traits in main arrow crate

2020-04-24 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091739#comment-17091739
 ] 

Mark Hildreth commented on ARROW-8559:
--

Generally in favor, but one question and one bikeshed:

Question: perhaps my Rust-fu is lacking, but why would we need a 
{{SendableBatchIterator}}? If we want to make sure that a type marks itself 
{{Send}} and/or {{Sync}}, it can do that. If an interface wants to accept only 
{{Send}} and/or {{Sync}} iterators, it could do {{BatchIterator + Send + Sync}}.

Bikeshed: There are no {{std::iter::Iterator}} trait implementation for either 
{{BatchIterator}} or {{RecordBatchReader}}. Thus, using the name {{Iterator}} 
seems a bit misleading.

> [Rust] Consolidate Record Batch iterator traits in main arrow crate
> ---
>
> Key: ARROW-8559
> URL: https://issues.apache.org/jira/browse/ARROW-8559
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>
> We have the `BatchIterator` trait in DataFusion and the `RecordBatchReader` 
> trait in the main arrow crate.
> They differ in that `BatchIterator` is Send + Sync.  They should both be in 
> the Arrow crate and be named `BatchIterator` and `SendableBatchIterator`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte

2020-04-22 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089670#comment-17089670
 ] 

Mark Hildreth commented on ARROW-7574:
--

Although this issue was reported first, ARROW-7681 appears to be trying to 
solve the same problem, and has some pull requests looking to make the change.

> [Rust] FileSource read implementation is seeking for each single byte
> -
>
> Key: ARROW-7574
> URL: https://issues.apache.org/jira/browse/ARROW-7574
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Jörn Horstmann
>Priority: Major
>
> on current master branch
> {code:java}
> $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
> ...
> lseek(3, -8, SEEK_END)  = 2937
> read(3, ",\10\0\0PAR1", 8192)   = 8
> lseek(3, 845, SEEK_SET) = 845
> read(3, "\25\2\31\334H schema"..., 8192) = 2100
> ...
> lseek(5, 4, SEEK_SET)   = 4
> read(5, 
> "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2941
> lseek(5, 5, SEEK_SET)   = 5
> read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020"..., 
> 8192) = 2940
> lseek(5, 6, SEEK_SET)   = 6
> read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200"..., 
> 8192) = 2939
> lseek(5, 7, SEEK_SET)   = 7
> read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000"..., 
> 8192) = 2938
> lseek(5, 8, SEEK_SET)   = 8
> read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2937
> lseek(5, 9, SEEK_SET)   = 9
> read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\024"..., 8192) = 
> 2936
> lseek(5, 10, SEEK_SET)  = 10
> read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\024\30"..., 8192) = 
> 2935
> {code}
>  Notice the seek position being incremented by one, despite reading up to 
> 8192 bytes at a time. Interestingly this does not seem to have a big 
> performance impact on a local file system with linux, but becomes a problem 
> when working with a custom implementation of ParquetReader, for example for 
> reading from s3.
> The problem seems to be in
> {code}
> impl Read for FileSource
> {code}
> which is unconditionally calling
> {code}
> reader.seek(SeekFrom::Start(self.start as u64))?
> {code}
> Instead it should probably keep track of the current position and only seek 
> on the first read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8508) [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets

2020-04-21 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089198#comment-17089198
 ] 

Mark Hildreth commented on ARROW-8508:
--

I believe there are a few things going on here:

1.) I wouldn't consider myself an expert on these APIs, but it seems like the 
builders are being used correctly.

2.) The debug output definitely appears broken. I opened a [PR to fix 
this|https://github.com/apache/arrow/pull/7006], which puts it more in line 
with how the non-fixed size *ListArray* works. This should fix the *value()* 
method on the FixedSizeListArray to properly take the offset into the child 
array into account.

3.) As for the asserts that fail, this I'm less certain on. The values from 
these asserts are taken from the *values()* method, which seems to just return 
the underlying array without taking offsets into account. This seems to be 
similar to how other arrays work (including primitives), so my guess it is by 
design. I don't have an explanation for a better way of using the API, so maybe 
someone else can provide input.

> [Rust] ListBuilder of FixedSizeListBuilder creates wrong offsets
> 
>
> Key: ARROW-8508
> URL: https://issues.apache.org/jira/browse/ARROW-8508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Christian Beilschmidt
>Priority: Major
>  Labels: pull-request-available
>
> I created an example of storing multi points with Arrow.
>  # A coordinate consists of two floats (Float64Builder)
>  # A multi point consists of one or more coordinates (FixedSizeListBuilder)
>  # A list of multi points consists of multiple multi points (ListBuilder)
> This is the corresponding code snippet:
> {code:java}
> let float_builder = arrow::array::Float64Builder::new(0);
> let coordinate_builder = 
> arrow::array::FixedSizeListBuilder::new(float_builder, 2);
> let mut multi_point_builder = 
> arrow::array::ListBuilder::new(coordinate_builder);
> multi_point_builder
> .values()
> .values()
> .append_slice(&[0.0, 0.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[1.0, 1.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // first multi point
> multi_point_builder
> .values()
> .values()
> .append_slice(&[2.0, 2.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[3.0, 3.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder
> .values()
> .values()
> .append_slice(&[4.0, 4.1])
> .unwrap();
> multi_point_builder.values().append(true).unwrap();
> multi_point_builder.append(true).unwrap(); // second multi point
> let multi_point = dbg!(multi_point_builder.finish());
> let first_multi_point_ref = multi_point.value(0);
> let first_multi_point: ::array::FixedSizeListArray = 
> first_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = first_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 2), &[0.0, 0.1, 1.0, 1.1]);
> let second_multi_point_ref = multi_point.value(1);
> let second_multi_point: ::array::FixedSizeListArray = 
> second_multi_point_ref.as_any().downcast_ref().unwrap();
> let coordinates_ref = second_multi_point.values();
> let coordinates:  = 
> coordinates_ref.as_any().downcast_ref().unwrap();
> assert_eq!(coordinates.value_slice(0, 2 * 3), &[2.0, 2.1, 3.0, 3.1, 4.0, 
> 4.1]);
> {code}
> The second assertion fails and the output is {{[0.0, 0.1, 1.0, 1.1, 2.0, 
> 2.1]}}.
> Moreover, the debug output produced from {{dbg!}} confirms this:
> {noformat}
> [
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
> ],
>   FixedSizeListArray<2>
> [
>   PrimitiveArray
> [
>   0.0,
>   0.1,
> ],
>   PrimitiveArray
> [
>   1.0,
>   1.1,
> ],
>   PrimitiveArray
> [
>   2.0,
>   2.1,
> ],
> ],
> ]{noformat}
> The second list should contain the values 2-4.
>  
> So either I am using the builder wrong or there is a bug with the offsets. I 
> used {{0.16}} as well as the current {{master}} from GitHub.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-20 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/20/20, 6:42 PM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 We could either; ignore this problem, look for another crate to do tables, or 
try to change the prettytable crate. *Update:* I have created a [PR for that 
crate|https://github.com/phsym/prettytable-rs/pull/125] to see if they would 
want to remove the dependency, but not sure how active the original developer 
is.
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * Stick with the original prettytable crate as is and just add the required 
type annotations in the Parquet test, or find an alternative? I recommend 
[using the branch in the 
PR|https://github.com/markhildreth/prettytable-rs/tree/remove-encode-unicode-dependency],
 switching back to the actual crate if that crate is merged.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a 

[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-20 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/20/20, 6:41 PM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 We could either; ignore this problem, look for another crate to do tables, or 
try to change the prettytable crate. *Update:* I have created a [PR for that 
crate|[https://github.com/phsym/prettytable-rs/pull/125|https://github.com/phsym/prettytable-rs/pull/125)]]
 to see if they would want to remove the dependency.
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * Stick with the original prettytable crate as is and just add the required 
type annotations in the Parquet test, or find an alternative? I recommend 
[using the branch in the 
PR|https://github.com/markhildreth/prettytable-rs/tree/remove-encode-unicode-dependency],
 switching back to the actual crate if that crate is merged.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a 

[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-20 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/20/20, 6:40 PM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 We could either; ignore this problem, look for another crate to do tables, or 
try to change the prettytable crate. *Update:* I have created a [PR for that 
crate]([https://github.com/phsym/prettytable-rs/pull/125)] to see if they would 
want to remove the dependency. 
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * Stick with the original prettytable crate as is and just add the required 
type annotations in the Parquet test, or find an alternative? I recommend 
[using the branch in the 
PR|https://github.com/markhildreth/prettytable-rs/tree/remove-encode-unicode-dependency],
 switching back to the actual crate if that crate is merged.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust 

[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-19 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/20/20, 1:36 AM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). This is probably a bit 
much. I would recommend only expose *print_batches* for now so we could switch 
out the internal implementation if desired.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * Stick with the original prettytable crate and just add the required type 
annotations in the Parquet test, or find another crate that doesn't have said 
side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new 

[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-18 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/18/20, 5:10 PM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * Stick with the original prettytable crate and just add the required type 
annotations in the Parquet test, or find another crate that doesn't have said 
side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called 

[jira] [Comment Edited] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-18 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth edited comment on ARROW-8287 at 4/18/20, 5:10 PM:


I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate,|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 hence explaining the broken builds.

{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
This was a pretty complicated problem for someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
 So, here are my blocking questions:
 * * Stick with the original prettytable crate and just add the required type 
annotations in the Parquet test, or find another crate that doesn't have said 
side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.


was (Author: markhildreth):
I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate.|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 This was a pretty complicated problem with someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].


{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this 

[jira] [Commented] (ARROW-8287) [Rust] Arrow examples should use utility to print results

2020-04-18 Thread Mark Hildreth (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086534#comment-17086534
 ] 

Mark Hildreth commented on ARROW-8287:
--

I took this PR hoping it would be a simple intro, but there's actually a bit 
more here than what meets the eye. Here are my notes.
 * If the utility methods were moved as-is to the arrow crate, then the public 
interface of the arrow crate would now include the *prettytable* crate's 
*Table* struct (as that is what *create_table* returns). The simplest fix is to 
make *create_table* private, and only expose *print_batches* for now, which is 
what I would recommend.
 * Second, the crate used to create the strings used in the output 
(*prettytable*) has a dependency on the crate *encode_unicode*. The 
*encode_unicode* crate does some funky stuff with implementing the trait 
*FromIterator* for *Vec*. This can cause issues with any code that would 
use the *arrow* crate that rely on there being only one way to collect an 
*Iterator<_>* into *Vec*, which actually [broke some code in a test in the 
parquet 
crate.|https://github.com/apache/arrow/blob/8648cd46fd990e5c2e76c265b6f927b84a194ffb/rust/parquet/src/encodings/rle.rs#L832-L833]
 This was a pretty complicated problem with someone of my Rust experience, I 
wrote up more information about it in [this reddit 
thread|https://www.reddit.com/r/rust/comments/g3iqan/crates_implementing_fromiterator_for_std/].


{code:java}
error[E0282]: type annotations needed
   --> parquet/src/encodings/rle.rs:833:26
|
833 | Standard.sample_iter( rng).take(seed_len).collect();
| ^^^ cannot infer type for `T`
 {code}
 * Additionally, the interface for print_batches accepts a vector of multiple 
RecordBatches. Unfortunately, there is no static guarantee that the 
RecordBatches have the same schema. The C++/Python and Javascript 
implementations have created a new logical type called "Table" which tries to 
do this (although some of their APIs also don't seem to provide that 
guarantee). However, development of such a structure is way outside the scope 
of this project, so I would be happy to say forget about it and perhaps add an 
issue to revisit this. As a short-term solution, *print_table* could take a 
generic iterator of *RecordBatch* types, which if we did end up with a *Table* 
type later on probably wouldn't need to be changed.

 
So, here are my blocking questions: * Stick with the original prettytable crate 
and just add the required type annotations in the Parquet test, or find another 
crate that doesn't have said side effect? I recommend finding a different one.
 * Keep *create_table* public, or make it private? I recommend make it private.
 * Come up with a better wrapper for a "Table" to enforce 
one-schema-multiple-record batches, or don't worry about this for now? My 
recommendation is don't worry about it for now, but make *print_table* accept 
an iterator and to add an issue to think more about creating a *Table* type 
like other APIs do.

> [Rust] Arrow examples should use utility to print results
> -
>
> Key: ARROW-8287
> URL: https://issues.apache.org/jira/browse/ARROW-8287
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/6773] added a utility for printing 
> record batches and the DataFusion examples were updated to use this. We 
> should now do the same for the Arrow examples. This will require moving the 
> utility method from the datafusion crate to the arrow crate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)