[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-12 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212683#comment-17212683
 ] 

Andy Grove commented on ARROW-10226:


I did get to the bottom of why this happened for me. When I converted TPC-H CSV 
data to Parquet I accidentally combined all of the tables when I intended to 
just do this for lineitem. As a result, my lineitem Parquet files were a 
combination of all the tables with varying schema.

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-10 Thread Josh Taylor (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211833#comment-17211833
 ] 

Josh Taylor commented on ARROW-10226:
-

I'm seeing the same issue of the initial title, which was that it never 
completes.

Test file: 
[https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing]

(This is from snowflakes example data, exported as a single file parquet file, 
same thing happens for many files).

Code that fails (both group by with sum of columns and the builder pattern 
doesn't work):

https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [Parquet] Parquet reader reading wrong columns in some batches within a parquet file

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210358#comment-17210358
 ] 

Andy Grove commented on ARROW-10226:


Here is a test case to reproduce the issue. I uploaded the parquet file to 
dropbox. It is ~100MB.

[https://www.dropbox.com/s/6cpz1h9juxl4c7t/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet?dl=0]

[~jorgecarleitao] Thanks for the offer of help. I don't know much time we 
should spend on this but if you have the time to take a look at least to 
confirm the test also fails for you, that would be an extra data point. 
{code:java}
#[test]
fn foo() {
use arrow::array::Array;
use crate::arrow::arrow_reader::ArrowReader;

let file = std::fs::File::open(

"/mnt/tpch/debug/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet").unwrap();
let file_reader = Rc::new(SerializedFileReader::new(file).unwrap());
let metadata = file_reader
.metadata
.file_metadata()
.key_value_metadata()
.as_ref()
.unwrap();


let mut arrow_reader = ParquetFileArrowReader::new(file_reader);
let schema = arrow_reader.get_schema().unwrap();
let projection = vec![4, 5, 6, 7, 8, 9, 10];
let mut batch_reader =
arrow_reader.get_record_reader_by_columns(projection, 40960).unwrap();

while let Some(batch) = batch_reader.next() {
let batch = batch.unwrap();

let mut n = 0;
match batch.column(4).as_any().downcast_ref::() {
Some(l_returnflag) => {
for i in 0..batch.num_rows() {
if l_returnflag.is_valid(i) {
if l_returnflag.value(i).len() > 1 {
n = n + 1;
}
}
}
}
None => println!("l_returnflag is not a string")
}
println!("{} bad values in batch", n);
assert_eq!(n, 0);
}
}
 {code}

> [Rust] [Parquet] Parquet reader reading wrong columns in some batches within 
> a parquet file
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)