[jira] [Created] (ARROW-13300) [Integration] Add Rust map

2021-07-11 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-13300:
--

 Summary: [Integration] Add Rust map
 Key: ARROW-13300
 URL: https://issues.apache.org/jira/browse/ARROW-13300
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale


I'm working on Rust map support at https://github.com/apache/arrow-rs/pull/491.
We can add integration testing support after the PR is merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12156) [Rust] Calculate the size of a RecordBatch

2021-03-30 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12156:
--

 Summary: [Rust] Calculate the size of a RecordBatch
 Key: ARROW-12156
 URL: https://issues.apache.org/jira/browse/ARROW-12156
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale


We can compute the size of an array, but there's no facility yet to compute the 
size of a recordbatch.

This is useful if we need to measure the size of data we're about to write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12153) [Rust] [Parquet] Return file metadata after writing Parquet file

2021-03-30 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12153:
--

 Summary: [Rust] [Parquet] Return file metadata after writing 
Parquet file
 Key: ARROW-12153
 URL: https://issues.apache.org/jira/browse/ARROW-12153
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale
Assignee: Neville Dipale


Parquet writers like delta-rs rely on the Parquet metadata to write file-level 
statistics for file pruning purposes.

We currently do not expose these stats, requiring the writer to read the file 
that has just been written, to get the stats. This is more problematic for 
in-memory sinks, as there is currently no way of getting the metadata from the 
sink before it's persisted.

Explore if we can expose these stats to the writer, to make the above easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12121) [Rust] [Parquet] Arrow writer benchmarks

2021-03-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12121:
--

 Summary: [Rust] [Parquet] Arrow writer benchmarks
 Key: ARROW-12121
 URL: https://issues.apache.org/jira/browse/ARROW-12121
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


The common concern with Parquet's Arrow readers and writers is that they're 
slow.
My diagnosis is that we rely on a chain of processes, which introduces overhead.
For example, writing an Arrow RecordBatch involves the following:

1. Iterate through arrays to create def/rep levels
2. Extract Parquet primitive values from arrays using these levels
3. Write primitive values, validating them in the process (when they already 
should be validated)
4. Split the already materialised values into small batches for Parquet chunks 
(consider where we have 1e6 values in a batch)
5. Write these batches, computing the stats of each batch, and encoding values

The above is as a side-effect of convenience, as it would likely require a lot 
more effort to bypass some of the steps.

I have ideas around going from step 1 to 5 directly, but won't know if it's 
better if there aren't performance benchmarks. I also struggle to see if I'm 
making improvements while I clean up the writer code, especially removing the 
allocations that I created to reduce the complexity of the level calculations.

With ARROW-12120 (random array & batch generator), it becomes more convenient 
to benchmark (and test many combinations of) the Arrow writer.

I would thus like to start adding benchmarks for the Arrow writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12120) [Rust] Generate random arrays and batches

2021-03-27 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12120:
--

 Summary: [Rust] Generate random arrays and batches
 Key: ARROW-12120
 URL: https://issues.apache.org/jira/browse/ARROW-12120
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Neville Dipale
Assignee: Neville Dipale


I need a random data generator for the Parquet <> Arrow integration. It takes 
me a while to craft a test case, so being able to create random data would make 
it a bit easier to improve test coverage and catch edge-cases in the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12116) [Rust] Fix or ignore 1.51 clippy lints

2021-03-26 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12116:
--

 Summary: [Rust] Fix or ignore 1.51 clippy lints
 Key: ARROW-12116
 URL: https://issues.apache.org/jira/browse/ARROW-12116
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Neville Dipale
Assignee: Neville Dipale


Rust 1.51 introduces some lints that have broken CI. We can either fix or 
ignore them, depending on the amount of time it'll take to fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12043) [Rust] [Parquet] Write fixed size binary arrays

2021-03-22 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12043:
--

 Summary: [Rust] [Parquet] Write fixed size binary arrays
 Key: ARROW-12043
 URL: https://issues.apache.org/jira/browse/ARROW-12043
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Neville Dipale


We already write FSB when writing binary arrays, so this extends the support by 
removing unimplemented code paths



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12019) [Rust] [Parquet] Update README for 2.6.0 support

2021-03-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12019:
--

 Summary: [Rust] [Parquet] Update README for 2.6.0 support
 Key: ARROW-12019
 URL: https://issues.apache.org/jira/browse/ARROW-12019
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


The Parquet README still talks about supporting 2.4.0, with a TODO for 2.5.0.
When the 2.6.0 support is completed, we can update the README.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12018) [Rust] [Parquet] Write lower precision Arrow decimal to int32/64

2021-03-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-12018:
--

 Summary: [Rust] [Parquet] Write lower precision Arrow decimal to 
int32/64
 Key: ARROW-12018
 URL: https://issues.apache.org/jira/browse/ARROW-12018
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


When ARROW-10818 is completed, we should start writing decimal arrays of lower 
precisions as i32 and i64.

I have left a TODO in the code as part of ARROW-11824



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11898) [Rust] Pretty print columns

2021-03-07 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11898:
--

 Summary: [Rust] Pretty print columns
 Key: ARROW-11898
 URL: https://issues.apache.org/jira/browse/ARROW-11898
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


We can pretty print a slice of record batches, but it's also useful to pretty 
print a slice of columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11824) [Rust] [Parquet] Use logical types in Arrow writer

2021-02-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11824:
--

 Summary: [Rust] [Parquet] Use logical types in Arrow writer
 Key: ARROW-11824
 URL: https://issues.apache.org/jira/browse/ARROW-11824
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Neville Dipale


Start using the logical type for Arrow <> Parquet schema conversion, so that we 
can support more Arrow types, like nanosecond temporal types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11803) [Rust] Parquet] Support v2 LogicalType

2021-02-27 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11803:
--

 Summary: [Rust] Parquet] Support v2 LogicalType
 Key: ARROW-11803
 URL: https://issues.apache.org/jira/browse/ARROW-11803
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Neville Dipale
Assignee: Neville Dipale


We currently do not read nor write the version 2 logical types. This is mainly 
because we do not have a mapping for it from parquet-format-rs.

To implement this, we can:
- convert "parquet::basic::LogicalType" to "parquet::basic::ConvertedType"
- implement "parquet::basic::LogicalType" which mirrors 
"parquet_format::LogicalType"
- create a mapping between ConvertedType and LogicalType
- write LogicalType to "parquet_format::SchemaElement" if v2 of the writer is 
used

This would be a good starting point for implementing 2.6 types (UUID, NANOS 
precision time & timestamp).
Follow-up work would be:
- parsing v2 of the schema
- Using v2 in the Arrow writer (mostly schema conversion)
- Supporting nanosecond precision time & timestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11798) [Integration] Update testing submodule

2021-02-26 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11798:
--

 Summary: [Integration] Update testing submodule
 Key: ARROW-11798
 URL: https://issues.apache.org/jira/browse/ARROW-11798
 Project: Apache Arrow
  Issue Type: Task
Reporter: Neville Dipale


Updates submodule after ARROW-11666, and removes references to files that no 
longer exist (generated_large_batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11619) [Rust] String-based path field projection

2021-02-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11619:
--

 Summary: [Rust] String-based path field projection
 Key: ARROW-11619
 URL: https://issues.apache.org/jira/browse/ARROW-11619
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Neville Dipale


Similar to ARROW-11618, we could benefit from the ability to pluck out specific 
fields from an Arrow schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11618) [Rust] [Parquet] String-based path column projection

2021-02-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11618:
--

 Summary: [Rust] [Parquet] String-based path column projection
 Key: ARROW-11618
 URL: https://issues.apache.org/jira/browse/ARROW-11618
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


There is currently no way to select a column by its path, e.g. 'a.b.c'. We have 
to select the column by its index, which is not trivial for nested structures.

For example, if a record has the following schema, the column indices are shown 
in parentheses:

{code}
schema:
  a [struct] ("a")
b [struct]   ("a.b")  
  c [int32]  ("a.b.c")
  d [struct] ("a.b.d")
e [int32]("a.b.d.e")  [0]
f [bool] ("a.b.d.f")  [1]
  g [int64]  ("a.b.g")[2]
{code}

if one wants to select 'a.b', they need to know that 'a.b' spans 3 (0 to 2) 
columns. This is inconvenient, and potentially forces readers to read whole 
records to avoid this inconvenience.

A string-based projection could allow one to select columns 0 and 1 via "a.b.d" 
or column 2 via "a.b.g"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11605) [Rust] Adopt a MSRV policy

2021-02-11 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11605:
--

 Summary: [Rust] Adopt a MSRV policy
 Key: ARROW-11605
 URL: https://issues.apache.org/jira/browse/ARROW-11605
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


With all our crates now supporting stable Rust, we can decide on a Minimum 
Supported Rust Version, so that we don't introduce breakage to people relying 
on older Rust versions.

We could:
* Determine what the earliest Rust version that compiles is (at least 1.39 due 
to async in DF)
* Use this version in CI
* Decide on, and document, a policy for how we update versions

This might mean that when there's fresh new changes landing in Stable, we'd 
likely hold off on them until those changes meet our MSRV.

Thoughts [~Dandandan] [~alamb] [~jorgecarleitao] [~andygrove] [~paddyhoran] 
[~sunchao]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11599) [Rust] Add function to create array with all nulls

2021-02-11 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11599:
--

 Summary: [Rust] Add function to create array with all nulls
 Key: ARROW-11599
 URL: https://issues.apache.org/jira/browse/ARROW-11599
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Neville Dipale
Assignee: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11381) [Rust] [Parquet] LZ4 compressed files written in Rust can't be opened with C++

2021-01-25 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11381:
--

 Summary: [Rust] [Parquet] LZ4 compressed files written in Rust 
can't be opened with C++
 Key: ARROW-11381
 URL: https://issues.apache.org/jira/browse/ARROW-11381
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Neville Dipale


Parquet files that are written with LZ4 compression, cannot be read from 
pyarrow. It seems that the issue might be the LZ4 block vs frame, which we're 
also seeing in ARROW-8767.

I'll update this JIRA with more info, as I'm struggling to get pyspark up on 
MacOS (Rosetta 2 issues)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11365) [Rust] [Parquet] Implement parsers for v2 of the text schema

2021-01-23 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11365:
--

 Summary: [Rust] [Parquet] Implement parsers for v2 of the text 
schema
 Key: ARROW-11365
 URL: https://issues.apache.org/jira/browse/ARROW-11365
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 3.0.0
Reporter: Neville Dipale


V2 of the writer produces schema like:

    required INT32 fieldname INTEGER(32, true);

We should support parsing this format, as it maps to logical types.
I'm unsure of what the implications are for fields that don't have a logical 
type representation, but have a converted type (e.g. INTERVAL). We can try 
write a V2 file with parquet-cpp and observe the behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11364) [Rust] Umbrella issue for parquet 2.6.0 support

2021-01-23 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11364:
--

 Summary: [Rust] Umbrella issue for parquet 2.6.0 support
 Key: ARROW-11364
 URL: https://issues.apache.org/jira/browse/ARROW-11364
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 3.0.0
Reporter: Neville Dipale


This is the umbrella issue where we can collect everything related to parquet 
2.6.0 support (parquet-format-rs: 2.6.1).

It looks like there's some plumbing needed on the typesystem + parsing logic to 
fully support writing and reading v2 of the file format.

Existing compatibility issues can also be linked to this, or added as sub-tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11312) [Rust] Implement FromIter for timestamps, that includes timezone info

2021-01-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11312:
--

 Summary: [Rust] Implement FromIter for timestamps, that includes 
timezone info
 Key: ARROW-11312
 URL: https://issues.apache.org/jira/browse/ARROW-11312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


We currently have TimestampArray::from_vec and TimestampArray::from_opt_vec in 
order to provide timezone information. We do not have an option that uses 
FromIter.

When implementing this, we should search the codebase (esp Parquet) and replace 
the vector-based methods above with iterators where it makes sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11308) [Rust] [Parquet] Add Arrow decimal array writer

2021-01-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11308:
--

 Summary: [Rust] [Parquet] Add Arrow decimal array writer
 Key: ARROW-11308
 URL: https://issues.apache.org/jira/browse/ARROW-11308
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11271) [Rust] [Parquet] List schema to Arrow parser misinterpreting child nullability

2021-01-16 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11271:
--

 Summary: [Rust] [Parquet] List schema to Arrow parser 
misinterpreting child nullability
 Key: ARROW-11271
 URL: https://issues.apache.org/jira/browse/ARROW-11271
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale
Assignee: Neville Dipale


We currently do not propagate child nullability correctly when reading parquet 
files from Spark 3.0.1 (parquet-mr 1.10.1).

For example, the below taken from 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] is 
currently interpreted incorrectly:

 
{code:java}
// List (list nullable, elements non-null) 
optional group my_list (LIST) {
repeated group list { 
required binary element (UTF8); 
} 
}{code}
The Arrow type should be:
{code:java}
Field::new(
"my_list",
DataType::List(
box Field::new("element", DataType::Utf8, nullable: false),
),
nullable: true
){code}
but we currently end up with 
{code:java}
Field::new(
   "my_list",
   DataType::List(
   box Field::new("list", DataType::Utf8, nullable: true),
   ),
   nullable: true
)
{code}
This doesn't seem to be an issue with the master branch as of opening this 
issue, so it might not be severe enough to try force into the 3.0.0 release.

I tested null and non-null Spark files, and was able to read them correctly. 
This becomes an issue with nested lists, which I'm working on.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11194) [Rust] Enable SIMD for aarch64

2021-01-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11194:
--

 Summary: [Rust] Enable SIMD for aarch64
 Key: ARROW-11194
 URL: https://issues.apache.org/jira/browse/ARROW-11194
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


Enable SIMD for aarch64, which includes the Apple ARM CPUs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11187) [Rust] [Parquet] Pin specific parquet-format-rs version

2021-01-08 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11187:
--

 Summary: [Rust] [Parquet] Pin specific parquet-format-rs version
 Key: ARROW-11187
 URL: https://issues.apache.org/jira/browse/ARROW-11187
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


We released paquet-format-rs v2.7.0, which has some incomatibilities with 
v2.6.x, so we should pin to the latter version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11181) [Rust] [Parquet] Document supported features

2021-01-08 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11181:
--

 Summary: [Rust] [Parquet] Document supported features
 Key: ARROW-11181
 URL: https://issues.apache.org/jira/browse/ARROW-11181
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


Document supported Parquet features in the Rust implementation, similar to 
ARROW-10941



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11126) [Rust] Docunent and test ARROW-10656

2021-01-04 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11126:
--

 Summary: [Rust] Docunent and test ARROW-10656
 Key: ARROW-11126
 URL: https://issues.apache.org/jira/browse/ARROW-11126
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


Looks like I rebased against the PR branch, but didn't push my changes before 
the PR was merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11125) [Rust] Implement logical equality for list arrays

2021-01-04 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11125:
--

 Summary: [Rust] Implement logical equality for list arrays
 Key: ARROW-11125
 URL: https://issues.apache.org/jira/browse/ARROW-11125
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale
Assignee: Neville Dipale


We implemented logical equality for struct arrays, but not list arrays.

This work is now required for the Parquet nested list writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11063) [Rust] Validate null counts when building arrays

2020-12-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11063:
--

 Summary: [Rust] Validate null counts when building arrays
 Key: ARROW-11063
 URL: https://issues.apache.org/jira/browse/ARROW-11063
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


ArrayDataBuilder allows the user to specify a null count, alternatively 
calculating it if it is not set.

The problem is that the user-specified null count is never validated against 
the actual count of the buffer.

I suggest removing the ability to specify a null-count, and instead always 
calculating it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11061) [Rust] Validate array properties against schema

2020-12-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11061:
--

 Summary: [Rust] Validate array properties against schema
 Key: ARROW-11061
 URL: https://issues.apache.org/jira/browse/ARROW-11061
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


We have a problem when it comes to nested arrays, where one could create a 
> where the array fields can't be null, but the 
list can have null slots.

This creates a lot of work when working with such nested arrays, because we 
have to create work-arounds to account for this, and take unnecessarily slower 
paths.

I propose that we prevent this problem at the source, by:
 * checking that a batch can't be created with arrays that have incompatible 
null contracts
 * preventing list and struct children from being non-null if any descendant of 
such children are null (might be less of an issue for structs)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11060) [Rust] Logical equality for list arrays

2020-12-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-11060:
--

 Summary: [Rust] Logical equality for list arrays
 Key: ARROW-11060
 URL: https://issues.apache.org/jira/browse/ARROW-11060
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Neville Dipale


Apply logical equality to lists. This requires computing the merged nulls of a 
list and its child based on list offsets.

For example, a list having 3 slots, and 5 values (0, 1, 3, 5) needs to be 
expanded to 5 null masks. If the list is validity is [true, false, true], and 
the values are [t, f, t, f, t] we would get:

[t, f, f, t, t] AND [t, f, t, f, t] = [t, f, f, f, t]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10925) [Rust] Validate temporal data that has restrictions

2020-12-15 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10925:
--

 Summary: [Rust] Validate temporal data that has restrictions
 Key: ARROW-10925
 URL: https://issues.apache.org/jira/browse/ARROW-10925
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Neville Dipale


Some temporal data types have restrictions (e.g. date64 should be a multiple of 
8640). We should validate them when creating the arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10893) [Rust] [DataFusion] Easier clippy fixes

2020-12-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10893:
--

 Summary: [Rust] [DataFusion] Easier clippy fixes
 Key: ARROW-10893
 URL: https://issues.apache.org/jira/browse/ARROW-10893
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Affects Versions: 2.0.0
Reporter: Neville Dipale


Address some of the clippy lints that clippy can fix automatically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10771) [Rust] Extend JSON schema inference to nested types

2020-11-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10771:
--

 Summary: [Rust] Extend JSON schema inference to nested types
 Key: ARROW-10771
 URL: https://issues.apache.org/jira/browse/ARROW-10771
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


Schema inference is currently limited to primitive types and lists of primitive 
types.

This ticket is for work to extend it to nested types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10770) [Rust] Support reading nested JSON lists

2020-11-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10770:
--

 Summary: [Rust] Support reading nested JSON lists
 Key: ARROW-10770
 URL: https://issues.apache.org/jira/browse/ARROW-10770
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


The JSON reader now supports reading nested structs, but we are still left with 
nested lists, which can be lists of lists, or lists of structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10766) [Rust] Compute nested definition and repetition for list arrays

2020-11-29 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10766:
--

 Summary: [Rust] Compute nested definition and repetition for list 
arrays
 Key: ARROW-10766
 URL: https://issues.apache.org/jira/browse/ARROW-10766
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


This extends on ARROW-9728 by only focusing on list array repetition and 
definition levels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10764) [Rust] Inline small JSON and CSV test files

2020-11-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10764:
--

 Summary: [Rust] Inline small JSON and CSV test files
 Key: ARROW-10764
 URL: https://issues.apache.org/jira/browse/ARROW-10764
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


Some of our tests use small CSV and JSON files, which we could inline in the 
code, instead of adding more files to test data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10757) [Rust] [CI] Sporadic failures due to disk filling up

2020-11-28 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10757:
--

 Summary: [Rust] [CI] Sporadic failures due to disk filling up
 Key: ARROW-10757
 URL: https://issues.apache.org/jira/browse/ARROW-10757
 Project: Apache Arrow
  Issue Type: Bug
  Components: CI, Rust
Reporter: Neville Dipale
Assignee: Neville Dipale


CI is failing due to disk size filling up, affecting almost all Rust PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10684) [Rust] Logical equality should consider parent array nullability

2020-11-22 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10684:
--

 Summary: [Rust] Logical equality should consider parent array 
nullability
 Key: ARROW-10684
 URL: https://issues.apache.org/jira/browse/ARROW-10684
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


When creating a struct array with a primitive child array, it is possible for 
the child to be non-nullable, while its parent struct array is nullable.

In this scenario, the child array's slots where the parent is null, become 
invalidated, such that an array with [1, 2, 3] having slot 2 being null, should 
be interpreted as [1, 0, 3].

This issue becomes evident in Parquet roundtrip tests, as we end up not able to 
correctly compare nested structures that have non-null children.

The specification caters for the above behaviour, see 
[http://arrow.apache.org/docs/format/Columnar.html#struct-layout] .

When a struct has nulls, its child array(s) nullability is subject to the 
parent struct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10674) [Rust] Add integration tests for Decimal type

2020-11-21 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10674:
--

 Summary: [Rust] Add integration tests for Decimal type
 Key: ARROW-10674
 URL: https://issues.apache.org/jira/browse/ARROW-10674
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


We have basic decimal support, but we have not yet included decimals in the 
integration testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10550) [Rust] [Parquet] Write nested types (struct, list)

2020-11-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10550:
--

 Summary: [Rust] [Parquet] Write nested types (struct, list)
 Key: ARROW-10550
 URL: https://issues.apache.org/jira/browse/ARROW-10550
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale
 Fix For: 3.0.0


After being able to compute arbitrarily nested definition and repetitions, we 
should be able to write structs and lists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10391) [Rust] [Parquet] Nested Arrow reader

2020-10-26 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10391:
--

 Summary: [Rust] [Parquet] Nested Arrow reader
 Key: ARROW-10391
 URL: https://issues.apache.org/jira/browse/ARROW-10391
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


The objective here is to create a reader that complies with at least Parquet 
2.4.0.

It complements the tasks for the writer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10334) [Rust] [Parquet] Support reading and writing Arrow NullArray

2020-10-17 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10334:
--

 Summary: [Rust] [Parquet] Support reading and writing Arrow 
NullArray
 Key: ARROW-10334
 URL: https://issues.apache.org/jira/browse/ARROW-10334
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10299) [Rust] Support reading and writing V5 of IPC metadata

2020-10-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10299:
--

 Summary: [Rust] Support reading and writing V5 of IPC metadata
 Key: ARROW-10299
 URL: https://issues.apache.org/jira/browse/ARROW-10299
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


This is mostly alignment issues and tracking when we encounter the v4 legacy 
padding.

I had done this work in another branch, but discarded it without noticing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10289) [Rust] Support reading dictionary streams

2020-10-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10289:
--

 Summary: [Rust] Support reading dictionary streams
 Key: ARROW-10289
 URL: https://issues.apache.org/jira/browse/ARROW-10289
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


We support reading dictionaries in the IPC file reader.

We should do the same with the stream reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10269) [Rust] Update nightly: Oct 2020 Edition

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10269:
--

 Summary: [Rust] Update nightly: Oct 2020 Edition
 Key: ARROW-10269
 URL: https://issues.apache.org/jira/browse/ARROW-10269
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Neville Dipale


We should update to a more recent nighly after the 2.0.0 release. It carries 
some clippy annoyances, which will mean that I have to revert much of what I 
did around float comparisons.

Might also be preferable to do this sooner, so that we can complete the clippy 
integration and throw away the carrot in favour of the stick.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10268) [Rust] Support writing dictionaries to IPC file and stream

2020-10-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10268:
--

 Summary: [Rust] Support writing dictionaries to IPC file and stream
 Key: ARROW-10268
 URL: https://issues.apache.org/jira/browse/ARROW-10268
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


We currently do not support writing dictionary arrays to the IPC file and 
stream format.

When this is supported, we can test the integration with other implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10261) [Rust] [BREAKING] Lists should take Field instead of DataType

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10261:
--

 Summary: [Rust] [BREAKING] Lists should take Field instead of 
DataType
 Key: ARROW-10261
 URL: https://issues.apache.org/jira/browse/ARROW-10261
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


There is currently no way of tracking nested field metadata on lists. For 
example, if a list's children are nullable, there's no way of telling just by 
looking at the Field.

This causes problems with integration testing, and also affects Parquet 
roundtrips.

I propose the breaking change of [Large|FixedSize]List taking a Field instead 
of Box, as this will overcome this issue, and ensure that the Rust 
implementation passes integration tests.

CC [~andygrove] [~jorgecarleitao] [~alamb]  [~jhorstmann] ([~carols10cents] as 
this addresses some of the roundtrip failures).

I'm leaning towards this landing in 3.0.0, as I'd love for us to have completed 
or made significant traction on the Arrow Parquet writer (and reader), and 
integration testing, by then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10259) [Rust] Support field metadata

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10259:
--

 Summary: [Rust] Support field metadata
 Key: ARROW-10259
 URL: https://issues.apache.org/jira/browse/ARROW-10259
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


The biggest hurdle to adding field metadata is HashMap and HashSet not 
implementing Hash, Ord and PartialOrd.

I was thinking of implementing the metadata as a Vec<(String, String)> to 
overcome this limitation, and then serializing correctly to JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10258) [Rust] Support extension arrays

2020-10-09 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10258:
--

 Summary: [Rust] Support extension arrays
 Key: ARROW-10258
 URL: https://issues.apache.org/jira/browse/ARROW-10258
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Integration, Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


This should include:
 * supporting the Arrow format
 * supporting field metadata

We can optionally:
 * support recognising known extensions (like UUID)

I'm mainly opening this up for wider visibility, I noticed that I was catching 
strays from metadata integration tests failing because Field doesn't support 
metadata :(



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10225) [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip tests

2020-10-07 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10225:
--

 Summary: [Rust] [Parquet] Fix bull bitmap comparisons in roundtrip 
tests
 Key: ARROW-10225
 URL: https://issues.apache.org/jira/browse/ARROW-10225
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


The Arrow spec allows makes the null bitmap optional if an array has no nulls 
[~carols10cents], so the tests that were failing were because we're comparing 
`None` with a 100% populated bitmap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10198) [Dev] Python merge script doesn't close PRs if not merged on master

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10198:
--

 Summary: [Dev] Python merge script doesn't close PRs if not merged 
on master
 Key: ARROW-10198
 URL: https://issues.apache.org/jira/browse/ARROW-10198
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Affects Versions: 1.0.1
Reporter: Neville Dipale


When using the merge script to merge PRs against non-master branches, the PR on 
Github doesn't get closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10191) [Rust] [Parquet] Add roundtrip tests for single column batches

2020-10-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10191:
--

 Summary: [Rust] [Parquet] Add roundtrip tests for single column 
batches
 Key: ARROW-10191
 URL: https://issues.apache.org/jira/browse/ARROW-10191
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


To aid with test coverage and picking up information loss during Parquet and 
Arrow roundtrips, we can add tests that assert that all supported Arrow 
datatypes can be written and read correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10168) [Rust] Extend arrow schema conversion to projected fields

2020-10-03 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10168:
--

 Summary: [Rust] Extend arrow schema conversion to projected fields
 Key: ARROW-10168
 URL: https://issues.apache.org/jira/browse/ARROW-10168
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


When writing Arrow data to Parquet, we serialise the schema's IPC 
representation. This schema is then read back by the Parquet reader, and used 
to preserve the array type information from the original Arrow data.

We however do not rely on the above mechanism when reading projected columns 
from a Parquet file; i.e. if we have a file with 3 columns, but we only read 2 
columns, we do not yet rely on the serialised arrow schema; and can thus lose 
type information.

This behaviour was deliberately left out, as the function 
*parquet_to_arrow_schema_by_columns* does not check for the existence of arrow 
schema in the metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10103) [Rust] Add a Contains kernel

2020-09-25 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10103:
--

 Summary: [Rust] Add a Contains kernel
 Key: ARROW-10103
 URL: https://issues.apache.org/jira/browse/ARROW-10103
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


Add a `contains` function that checks whether a list array contains a primitive 
value. The result of the function is a boolean array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10095) [Rust] [Parquet] Update for IPC changes

2020-09-25 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10095:
--

 Summary: [Rust] [Parquet] Update for IPC changes
 Key: ARROW-10095
 URL: https://issues.apache.org/jira/browse/ARROW-10095
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


The IPC changes made to comply with MetadataVersion 4 broke the rust-parquet 
writer branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10041) [Rust] Possible to create LargeStringArray with DataType::Utf8

2020-09-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10041:
--

 Summary: [Rust] Possible to create LargeStringArray with 
DataType::Utf8
 Key: ARROW-10041
 URL: https://issues.apache.org/jira/browse/ARROW-10041
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


We don't perform enough checks on ArrayData when creating StringArray and 
LargeStringArray. As they use different integer sizes for offsets, this can 
create a problem where Offset> could be correctly reinterpreted as 
Offset> and vice versa.

We should add checks that pervent this. The same might apply for List and 
LargeList



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10040) [Rust] Create a way to slice unalligned offset buffers

2020-09-18 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10040:
--

 Summary: [Rust] Create a way to slice unalligned offset buffers
 Key: ARROW-10040
 URL: https://issues.apache.org/jira/browse/ARROW-10040
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.1
Reporter: Neville Dipale


We have limitations on the boolean kernels, where we can't apply the kernels on 
buffers whose offsets aren't a multiple of 8. This has the potential of 
preventing users from applying some computations on arrays whose offsets aren't 
divisible by 8.

We could create methods on Buffer that allow slicing buffers and copying them 
into aligned buffers.

An idea would be Buffer::slice(, offset: usize, len: usize) -> Buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9981) [Rust] Allow configuring flight IPC with IpcWriteOptions

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9981:
-

 Summary: [Rust] Allow configuring flight IPC with IpcWriteOptions
 Key: ARROW-9981
 URL: https://issues.apache.org/jira/browse/ARROW-9981
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


We have introduced an IPC write option, but we use the default for the 
arrow-flight crate, which is not ideal. Change this to allow configuring writer 
options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9980) [Rust] Fix parquet crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9980:
-

 Summary: [Rust] Fix parquet crate clippy lints
 Key: ARROW-9980
 URL: https://issues.apache.org/jira/browse/ARROW-9980
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This addresses most clippy lints on the parquet crate. Other remaining lints 
can be addressed as part of future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9979) [Rust] Fix arrow crate clippy lints

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9979:
-

 Summary: [Rust] Fix arrow crate clippy lints
 Key: ARROW-9979
 URL: https://issues.apache.org/jira/browse/ARROW-9979
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This fixes many clippy lints, but not all. It takes hours to address lints, 
ansd we can work on remaining ones in future PRs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9978) [Rust] Umbrella issue for clippy integration

2020-09-12 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9978:
-

 Summary: [Rust] Umbrella issue for clippy integration
 Key: ARROW-9978
 URL: https://issues.apache.org/jira/browse/ARROW-9978
 Project: Apache Arrow
  Issue Type: New Feature
  Components: CI, Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


This is an umbrella issue to collate outstanding and new tasks to enable clippy 
integration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9957) [Rust] Remove unmaintained tempdir dependency

2020-09-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9957:
-

 Summary: [Rust] Remove unmaintained tempdir dependency
 Key: ARROW-9957
 URL: https://issues.apache.org/jira/browse/ARROW-9957
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Affects Versions: 1.0.0
Reporter: Neville Dipale


Replace tempdir with tempfile, also removing older versions of some 
dependencies like rand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9848) [Rust] Implement changes to ensure flatbuffer alignment

2020-08-24 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9848:
-

 Summary: [Rust] Implement changes to ensure flatbuffer alignment
 Key: ARROW-9848
 URL: https://issues.apache.org/jira/browse/ARROW-9848
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


See ARROW-6313, changes were made to all IPC implementations except for Rust



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9841) [Rust] Update checked-in flatbuffer files

2020-08-24 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9841:
-

 Summary: [Rust] Update checked-in flatbuffer files
 Key: ARROW-9841
 URL: https://issues.apache.org/jira/browse/ARROW-9841
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


We can't automatically generate flatbuffer files in Rust due to a bug with 
required fields. 

The currently checked-in generated files are outdated, and should either be 
updated manually or by building the flatbuffers project from master in order to 
update them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9777) [Rust] Implement IPC changes to catch up to 1.0.0 format

2020-08-17 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9777:
-

 Summary: [Rust] Implement IPC changes to catch up to 1.0.0 format
 Key: ARROW-9777
 URL: https://issues.apache.org/jira/browse/ARROW-9777
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


There are a number of IPC changes and features which the Rust implementation 
has fallen behind on. It's effectively using the legacy format that was 
released in 0.14.x.

Some that I encountered are:
 * change padding from 4 bytes to 8 bytes (along with the padding algorithm)
 * add an IPC writer option to support the legacy format and updated format
 * add error handling for the different metadata versions, we should support 
v4+ so it's an oversight to not explicitly return errors if unsupported 
versions are read

Some of the work already has Jiras open (e.g. body compression), I'll find them 
and mark them as related to this.

I'm tight for spare time, but I'll try work on this before the next release 
(along with the Parquet writer)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9728) [Rust] [Parquet] Compute nested spacing

2020-08-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9728:
-

 Summary: [Rust] [Parquet] Compute nested spacing
 Key: ARROW-9728
 URL: https://issues.apache.org/jira/browse/ARROW-9728
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 1.0.0
Reporter: Neville Dipale


When computing definition levels for deeply nested arrays that include lists, 
the definition levels are correctly calculated, but they are not translated 
into correct indexes for the eventual primitive arrays.

For example, an int32 array could have no null values, but be a child of a list 
that has null values. If say the first 5 values of the int32 array are members 
of the first list item (i.e. list_array[0] = [1,2,3,4,5], and that list is 
itself a child of a struct whose index is null, the whole 5 values of the int32 
array *should* be skipped. Further, the list's definition and repetition levels 
will be represented by 1 slot instead of the 5.

The current logic cannot cater for this, and potentially results in slicing the 
int32 array incorrectly (sometimes including some of those first 5 values).

This Jira is for the work necessary to compute the index into the eventual leaf 
arrays correctly.

I started doing it as part of the initial writer PR, but it's complex and is 
blocking progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9413) [Rust] Fix clippy lint on master

2020-07-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9413:
-

 Summary: [Rust] Fix clippy lint on master
 Key: ARROW-9413
 URL: https://issues.apache.org/jira/browse/ARROW-9413
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.17.1
Reporter: Neville Dipale


There was a clippy lint error with the float sort PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9411) [Rust] Update dependencies

2020-07-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9411:
-

 Summary: [Rust] Update dependencies
 Key: ARROW-9411
 URL: https://issues.apache.org/jira/browse/ARROW-9411
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Affects Versions: 0.17.1
Reporter: Neville Dipale


Update dependencies like tonic and rand (to reduce total dependencies)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9408) [Integration] Tests do not run in Windows due to numpy 64-bit errors

2020-07-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9408:
-

 Summary: [Integration] Tests do not run in Windows due to numpy 
64-bit errors
 Key: ARROW-9408
 URL: https://issues.apache.org/jira/browse/ARROW-9408
 Project: Apache Arrow
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.17.1
Reporter: Neville Dipale


We found that the integer range check when generating integration data doesn't 
work in Windows because the default C integers that numpy uses are 32-bit by 
default in Windows.

This fixes that issue by forcing 64-bit integers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9292) [Rust] Update feature matrix with passing tests

2020-07-01 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9292:
-

 Summary: [Rust] Update feature matrix with passing tests
 Key: ARROW-9292
 URL: https://issues.apache.org/jira/browse/ARROW-9292
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 0.17.0
Reporter: Neville Dipale


When we created the feature matrix, I preemptively populated the Rust column 
with supported features. We've subsequently been having trouble with 
integration tests.

This blocker is so that I can update the feature matrix before 1.0.0 release 
based on which tests are passing by then.

CC [~wesm] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9274) [Rust] Builds failing due to IPC test failures

2020-06-30 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9274:
-

 Summary: [Rust] Builds failing due to IPC test failures
 Key: ARROW-9274
 URL: https://issues.apache.org/jira/browse/ARROW-9274
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Neville Dipale


I just saw this after merging 2 PRs, I'm investigating what the cause of the 
failures is



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9082) [Rust] - Stream reader fail when steam not ended with (optional) 0xFFFFFFFF 0x00000000"

2020-06-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9082:
--
Component/s: Rust

> [Rust] - Stream reader fail when steam not ended with (optional) 0x 
> 0x" 
> 
>
> Key: ARROW-9082
> URL: https://issues.apache.org/jira/browse/ARROW-9082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.17.1
>Reporter: Eyal Leshem
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> according to spec : 
> [https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format] , 
> the 0x 0x is optional in the arrow response stream , but 
> currently when client receive such response it's read all the batches well , 
> but return an error  in the end (instead of Ok(None)) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9082) [Rust] - Stream reader fail when steam not ended with (optional) 0xFFFFFFFF 0x00000000"

2020-06-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9082.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7384
[https://github.com/apache/arrow/pull/7384]

> [Rust] - Stream reader fail when steam not ended with (optional) 0x 
> 0x" 
> 
>
> Key: ARROW-9082
> URL: https://issues.apache.org/jira/browse/ARROW-9082
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.1
>Reporter: Eyal Leshem
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> according to spec : 
> [https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format] , 
> the 0x 0x is optional in the arrow response stream , but 
> currently when client receive such response it's read all the batches well , 
> but return an error  in the end (instead of Ok(None)) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9062) [Rust] Support to read JSON into dictionary type

2020-06-12 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9062.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7379
[https://github.com/apache/arrow/pull/7379]

> [Rust] Support to read JSON into dictionary type
> 
>
> Key: ARROW-9062
> URL: https://issues.apache.org/jira/browse/ARROW-9062
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Sven Wagner-Boysen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently a JSON reader build from a schema using the type dictionary for one 
> of the fields in the schema will fail with JsonError("struct types are not 
> yet supported")
> {code:java}
> let builder = ReaderBuilder::new().with_schema(..)
> let mut reader: Reader = 
> builder.build::(File::open(path).unwrap()).unwrap();
> let rb = reader.next().unwrap()
> {code}
>  
> Suggested solution:
> Support reading into a dictionary in Json Reader: 
> [https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L368]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9088) [Rust] Recent version of arrow crate does not compile into wasm target

2020-06-10 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132699#comment-17132699
 ] 

Neville Dipale commented on ARROW-9088:
---

I've made prettytable-rs optional, so once the PR is merged, you should be able 
to turn it off. I forgot that we removed libc at some point, so it didn't dawn 
on me that we can now compile arrow to wasm.

> [Rust] Recent version of arrow crate does not compile into wasm target
> --
>
> Key: ARROW-9088
> URL: https://issues.apache.org/jira/browse/ARROW-9088
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Sergey Todyshev
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> arrow 0.16 compiles successfully into wasm32-unknown-unknown, but recent git 
> version does not. it would be nice to fix that.
> compiler errors:
>  
> {noformat}
> error[E0433]: failed to resolve: could not find `unix` in `os`
> --> 
> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
>  |
>   41 | use std::os::unix::ffi::OsStringExt;
>  |   could not find `unix` in `os`
>   
>   error[E0432]: unresolved import `unix`
>--> 
> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
> |
>   6 | use unix;
> |  no `unix` in the root{noformat}
> the problem is that prettytable-rs dependency depends on term->dirs which 
> causes this error
> consider making  prettytable-rs as dev dependency
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9088) [Rust] Recent version of arrow crate does not compile into wasm target

2020-06-10 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9088:
-

Assignee: Neville Dipale

> [Rust] Recent version of arrow crate does not compile into wasm target
> --
>
> Key: ARROW-9088
> URL: https://issues.apache.org/jira/browse/ARROW-9088
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Sergey Todyshev
>Assignee: Neville Dipale
>Priority: Major
>
> arrow 0.16 compiles successfully into wasm32-unknown-unknown, but recent git 
> version does not. it would be nice to fix that.
> compiler errors:
>  
> {noformat}
> error[E0433]: failed to resolve: could not find `unix` in `os`
> --> 
> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18
>  |
>   41 | use std::os::unix::ffi::OsStringExt;
>  |   could not find `unix` in `os`
>   
>   error[E0432]: unresolved import `unix`
>--> 
> /home/regl/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5
> |
>   6 | use unix;
> |  no `unix` in the root{noformat}
> the problem is that prettytable-rs dependency depends on term->dirs which 
> causes this error
> consider making  prettytable-rs as dev dependency
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9095) [Rust] Fix NullArray to comply with spec

2020-06-10 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9095:
-

 Summary: [Rust] Fix NullArray to comply with spec
 Key: ARROW-9095
 URL: https://issues.apache.org/jira/browse/ARROW-9095
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 0.17.0
Reporter: Neville Dipale


When I implemented the NullArray, I didn't comply with the spec under the 
premise that I'd handle reading and writing IPC in a spec-compliant way as that 
looked like the easier approach.

After some integration testing, I realised that I wasn't doing it correctly, so 
it's better to comply with the spec by not allocating any buffers for the array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3089) [Rust] Add ArrayBuilder for different Arrow arrays

2020-06-09 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-3089.
---
  Assignee: Neville Dipale
Resolution: Implemented

The remaining task was completed

> [Rust] Add ArrayBuilder for different Arrow arrays
> --
>
> Key: ARROW-3089
> URL: https://issues.apache.org/jira/browse/ARROW-3089
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Chao Sun
>Assignee: Neville Dipale
>Priority: Major
> Fix For: 1.0.0
>
>
> Similar to the CPP version, we should add `ArrayBuilder` for different kinds 
> of Arrow arrays. This provides a convenient way to incrementally build Arrow 
> arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9007) [Rust] Support appending arrays by merging array data

2020-06-08 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9007.
---
Resolution: Fixed

Issue resolved by pull request 7365
[https://github.com/apache/arrow/pull/7365]

> [Rust] Support appending arrays by merging array data
> -
>
> Key: ARROW-9007
> URL: https://issues.apache.org/jira/browse/ARROW-9007
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> ARROW-9005 introduces a concat kernel which allows for concatenating multiple 
> arrays of the same type into a single array. This is useful for sorting on 
> multiple arrays, among other things.
> The concat kernel is implemented for most array types, but not yet for nested 
> arrays (lists, structs, etc).
> This Jira is for creating a way of appending/merging all array types, so that 
> concat (and functionality that depends on it) can support all array types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3089) [Rust] Add ArrayBuilder for different Arrow arrays

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-3089:
--
Fix Version/s: 1.0.0

> [Rust] Add ArrayBuilder for different Arrow arrays
> --
>
> Key: ARROW-3089
> URL: https://issues.apache.org/jira/browse/ARROW-3089
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Chao Sun
>Priority: Major
> Fix For: 1.0.0
>
>
> Similar to the CPP version, we should add `ArrayBuilder` for different kinds 
> of Arrow arrays. This provides a convenient way to incrementally build Arrow 
> arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8170) [Rust] [Parquet] Allow Position to support arbitrary Cursor type

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8170:
--
Summary: [Rust] [Parquet] Allow Position to support arbitrary Cursor type  
(was: [Rust] Allow Position to support arbitrary Cursor type)

> [Rust] [Parquet] Allow Position to support arbitrary Cursor type
> 
>
> Key: ARROW-8170
> URL: https://issues.apache.org/jira/browse/ARROW-8170
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Jeong, Heon
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Hi, I'm currently writing in-memory page writer in order to support buffered 
> row group writer (just like in C++ version), and...
>  * I'd like to reuse SerializedPageWriter
>  * SerializedPageWriter requires sink supports util::Position (which is 
> private)
>  * There's Position impl for Cursor, but unnecessarily restricts to mutable 
> references for internal type.
> So I'd like to make one line change in order to lift that type restriction 
> and allow me implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9053) [Rust] Add sort for lists and structs

2020-06-06 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-9053:
-

 Summary: [Rust] Add sort for lists and structs
 Key: ARROW-9053
 URL: https://issues.apache.org/jira/browse/ARROW-9053
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7252) [Rust] [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec allocation

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-7252:
--
Summary: [Rust] [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of 
vec allocation  (was: [Rust] Reading UTF-8/JSON/ENUM field results in a lot of 
vec allocation)

> [Rust] [Parquet] Reading UTF-8/JSON/ENUM field results in a lot of vec 
> allocation
> -
>
> Key: ARROW-7252
> URL: https://issues.apache.org/jira/browse/ARROW-7252
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Wong Shek Hei
>Priority: Minor
>
> While reading a very large parquet file with basically all string fields was 
> very slow(430MB gzipped), after profiling with osx instruments, I noticed 
> that a lot of time is spent in "convert_byte_array", in particular, 
> "reserving" and allocating Vec::with_capacity, which is done before 
> String::from_utf8_unchecked.
> It seems like using String as the underlying storage is causing this(String 
> uses Vec for its underlying storage), this also requires copying from 
> slice to vec.
> "Field::Str" is a pub enum so I am not sure how "refactorable" is the 
> String part, for example, converting it into a (we can perhaps then defer 
> the conversion from &[u8] to Vec until the user really needs a String)
> But of course, changing it to  can result in quite a bit of interface 
> changes... So I am wondering if there are already some plans or solution on 
> the way to improve the handling of the "Field::Str" case?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5153) [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-5153:
--
Summary: [Rust] [Parquet] Use IntoIter trait for 
write_batch/write_mini_batch  (was: [Rust] Use IntoIter trait for 
write_batch/write_mini_batch)

> [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch
> 
>
> Key: ARROW-5153
> URL: https://issues.apache.org/jira/browse/ARROW-5153
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Xavier Lange
>Priority: Major
>
> Writing data to a parquet file requires a lot of copying and intermediate Vec 
> creation. Take a record struct like:
> {{struct MyData {}}{{  name: String,}}{{  address: Option}}{{}}}
> Over the course of working sets of this data, you'll have the bulk data 
> Vec,  the names column in a Vec<>, the address column in a 
> Vec>. This puts extra memory pressure on the system, at the 
> minimum we have to allocate a Vec the same size as the bulk data even if we 
> are using references.
> What I'm proposing is to use an IntoIter style. This will maintain backward 
> compat as a slice automatically implements IntoIter. Where 
> ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: 
> IntoIter". Then you can do things like
> {{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  
> write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| 
> x.is_some())), None)}}
> and you can see there's no need for an intermediate Vec, so no short-term 
> allocations to write out the data.
> I am writing data with many columns and I think this would really help to 
> speed things up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4927) [Rust] Update top level README to describe current functionality

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-4927:
--
Fix Version/s: 1.0.0

> [Rust] Update top level README to describe current functionality
> 
>
> Key: ARROW-4927
> URL: https://issues.apache.org/jira/browse/ARROW-4927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Andy Grove
>Priority: Minor
> Fix For: 1.0.0
>
>
> Update top level Rust README to reflect new functionality, such as SIMD, 
> cast, date/time, DataFusion, etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4465) [Rust] [DataFusion] Add support for ORDER BY

2020-06-06 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127239#comment-17127239
 ] 

Neville Dipale commented on ARROW-4465:
---

[~andygrove] [~houqp] does ARROW-9005 completely cover this?

> [Rust] [DataFusion] Add support for ORDER BY
> 
>
> Key: ARROW-4465
> URL: https://issues.apache.org/jira/browse/ARROW-4465
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to specify an ORDER BY clause on my query.
> Work involved:
>  * Add OrderBy to LogicalPlan enum
>  * Write query planner code to translate SQL AST to OrderBy (SQL parser that 
> we use already supports parsing ORDER BY)
>  * Implement SortRelation
> My high level thoughts on implementing the SortRelation:
>  * Create Arrow array of uint32 same size as batch and populate such that 
> each element contains its own index i.e. array will be 0, 1, 2, 3
>  * Find a Rust crate for sorting that allows us to provide our own comparison 
> lambda
>  * Implement the comparison logic (probably can reuse existing execution code 
> - see filter.rs for how it implements comparison expressions)
>  * Use index array to store the result of the sort i.e. no need to rewrite 
> the whole batch, just the index
>  * Rewrite the batch after the sort has completed
> It would also be good to see how Gandiva has implemented this
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6189) [Rust] [Parquet] Plain encoded boolean column chunks limited to 2048 values

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-6189:
--
Summary: [Rust] [Parquet] Plain encoded boolean column chunks limited to 
2048 values  (was: [Rust] Plain encoded boolean column chunks limited to 2048 
values)

> [Rust] [Parquet] Plain encoded boolean column chunks limited to 2048 values
> ---
>
> Key: ARROW-6189
> URL: https://issues.apache.org/jira/browse/ARROW-6189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.14.1
>Reporter: Simon Jones
>Priority: Major
>
> encoding::PlainEncoder::new creates a BitWriter with 256 bytes of storage, 
> which limits the data page size that can be used. 
> I suggest that in
> {{impl Encoder for PlainEncoder}}
> the return value of put_value is tested and the BitWriter flushed+cleared 
> whenever it runs out of space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8993) [Rust] Support reading non-seekable sources in text readers

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8993:
--
Fix Version/s: 1.0.0

> [Rust] Support reading non-seekable sources in text readers
> ---
>
> Key: ARROW-8993
> URL: https://issues.apache.org/jira/browse/ARROW-8993
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mohamed Zenadi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> It would be interesting to be able to read already compressed json files. 
> This is is regularly used, with many storing their files as json.gz (we do 
> the same).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9047) [Rust] Setting 0-bits of a 0-length bitset segfaults

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-9047.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7360
[https://github.com/apache/arrow/pull/7360]

> [Rust] Setting 0-bits of a 0-length bitset segfaults
> 
>
> Key: ARROW-9047
> URL: https://issues.apache.org/jira/browse/ARROW-9047
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See PR for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9047) [Rust] Setting 0-bits of a 0-length bitset segfaults

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9047:
--
Component/s: Rust

> [Rust] Setting 0-bits of a 0-length bitset segfaults
> 
>
> Key: ARROW-9047
> URL: https://issues.apache.org/jira/browse/ARROW-9047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See PR for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8723) [Rust] Remove SIMD specific benchmark code

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-8723.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7359
[https://github.com/apache/arrow/pull/7359]

> [Rust] Remove SIMD specific benchmark code
> --
>
> Key: ARROW-8723
> URL: https://issues.apache.org/jira/browse/ARROW-8723
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that SIMD is behind a feature flag it's trivial to compare SIMD vs 
> non-SIMD and the SIMD versions of benchmarks can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8723) [Rust] Remove SIMD specific benchmark code

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8723:
--
Affects Version/s: 0.17.0

> [Rust] Remove SIMD specific benchmark code
> --
>
> Key: ARROW-8723
> URL: https://issues.apache.org/jira/browse/ARROW-8723
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Now that SIMD is behind a feature flag it's trivial to compare SIMD vs 
> non-SIMD and the SIMD versions of benchmarks can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9047) [Rust] Setting 0-bits of a 0-length bitset segfaults

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9047:
--
Affects Version/s: 0.17.0

> [Rust] Setting 0-bits of a 0-length bitset segfaults
> 
>
> Key: ARROW-9047
> URL: https://issues.apache.org/jira/browse/ARROW-9047
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.17.0
>Reporter: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See PR for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9007) [Rust] Support appending arrays by merging array data

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-9007:
--
Fix Version/s: 1.0.0

> [Rust] Support appending arrays by merging array data
> -
>
> Key: ARROW-9007
> URL: https://issues.apache.org/jira/browse/ARROW-9007
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-9005 introduces a concat kernel which allows for concatenating multiple 
> arrays of the same type into a single array. This is useful for sorting on 
> multiple arrays, among other things.
> The concat kernel is implemented for most array types, but not yet for nested 
> arrays (lists, structs, etc).
> This Jira is for creating a way of appending/merging all array types, so that 
> concat (and functionality that depends on it) can support all array types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9007) [Rust] Support appending arrays by merging array data

2020-06-06 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-9007:
-

Assignee: Neville Dipale

> [Rust] Support appending arrays by merging array data
> -
>
> Key: ARROW-9007
> URL: https://issues.apache.org/jira/browse/ARROW-9007
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.17.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-9005 introduces a concat kernel which allows for concatenating multiple 
> arrays of the same type into a single array. This is useful for sorting on 
> multiple arrays, among other things.
> The concat kernel is implemented for most array types, but not yet for nested 
> arrays (lists, structs, etc).
> This Jira is for creating a way of appending/merging all array types, so that 
> concat (and functionality that depends on it) can support all array types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8993) [Rust] Support reading non-seekable sources in text readers

2020-06-05 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-8993:
--
Summary: [Rust] Support reading non-seekable sources in text readers  (was: 
[Rust] Support gzipped json files)

> [Rust] Support reading non-seekable sources in text readers
> ---
>
> Key: ARROW-8993
> URL: https://issues.apache.org/jira/browse/ARROW-8993
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mohamed Zenadi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> It would be interesting to be able to read already compressed json files. 
> This is is regularly used, with many storing their files as json.gz (we do 
> the same).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8906) [Rust] Support reading multiple CSV files for schema inference

2020-06-03 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-8906.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7252
[https://github.com/apache/arrow/pull/7252]

> [Rust] Support reading multiple CSV files for schema inference
> --
>
> Key: ARROW-8906
> URL: https://issues.apache.org/jira/browse/ARROW-8906
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: QP Hou
>Assignee: QP Hou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8907) [Rust] implement scalar comparison operations

2020-06-03 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124921#comment-17124921
 ] 

Neville Dipale commented on ARROW-8907:
---

I don't have permission on Jira to assign this to [~yordan-pavlov]

> [Rust] implement scalar comparison operations
> -
>
> Key: ARROW-8907
> URL: https://issues.apache.org/jira/browse/ARROW-8907
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently comparing an array to a scalar / literal value using the comparison 
> operations defined in the comparison kernel here:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
> is very inefficient because:
> (1) an array with the scalar value repeated has to be created, taking time 
> and wasting memory
> (2) time is spent during comparison to load the same literal values over and 
> over
> Initial benchmarking of a specialized scalar comparison function indicates 
> good performance gains:
> eq Float32 time: [938.54 us 950.28 us 962.65 us]
> eq scalar Float32 time: [836.47 us 838.47 us 840.78 us]
> eq Float32 simd time: [75.836 us 76.389 us 77.185 us]
> eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us]
> The benchmark results above show that the scalar comparison function is about 
> 12% faster for non-SIMD and about 20% faster for SIMD comparison operations.
> And this is before accounting for creating the literal array. 
> In a more complex benchmark, the scalar comparison version is about 40% 
> faster overall when we account for not having to create arrays of scalar / 
> literal values.
> Here are the benchmark results:
> filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us]
> filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us]
> And here is the code for the benchmark:
> https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230
> My only concern is that I can't see an easy way to use scalar comparison 
> operations in Data Fusion as it is currently designed to only work on arrays.
> [~paddyhoran] [~andygrove]  let me know what you think, would there be value 
> in implementing scalar comparison operations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8907) [Rust] implement scalar comparison operations

2020-06-03 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-8907.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7261
[https://github.com/apache/arrow/pull/7261]

> [Rust] implement scalar comparison operations
> -
>
> Key: ARROW-8907
> URL: https://issues.apache.org/jira/browse/ARROW-8907
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Yordan Pavlov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently comparing an array to a scalar / literal value using the comparison 
> operations defined in the comparison kernel here:
> https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
> is very inefficient because:
> (1) an array with the scalar value repeated has to be created, taking time 
> and wasting memory
> (2) time is spent during comparison to load the same literal values over and 
> over
> Initial benchmarking of a specialized scalar comparison function indicates 
> good performance gains:
> eq Float32 time: [938.54 us 950.28 us 962.65 us]
> eq scalar Float32 time: [836.47 us 838.47 us 840.78 us]
> eq Float32 simd time: [75.836 us 76.389 us 77.185 us]
> eq scalar Float32 simd time: [61.551 us 61.605 us 61.671 us]
> The benchmark results above show that the scalar comparison function is about 
> 12% faster for non-SIMD and about 20% faster for SIMD comparison operations.
> And this is before accounting for creating the literal array. 
> In a more complex benchmark, the scalar comparison version is about 40% 
> faster overall when we account for not having to create arrays of scalar / 
> literal values.
> Here are the benchmark results:
> filter/filter with arrow SIMD (array) time: [647.77 us 675.12 us 706.69 us]
> filter/filter with arrow SIMD (scalar) time: [402.19 us 404.23 us 407.22 us]
> And here is the code for the benchmark:
> https://github.com/yordan-pavlov/arrow-benchmark/blob/master/rust/arrow_benchmark/src/main.rs#L230
> My only concern is that I can't see an easy way to use scalar comparison 
> operations in Data Fusion as it is currently designed to only work on arrays.
> [~paddyhoran] [~andygrove]  let me know what you think, would there be value 
> in implementing scalar comparison operations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >