[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210341#comment-17210341
 ] 

Andy Grove commented on ARROW-10226:


{code:java}
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49880 
bad values in batch

part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 
bad values in batch
part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49979 
bad values in batch

part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 374998 
bad values in batch
part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50031 
bad values in batch

part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad 
values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375002 
bad values in batch
part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50110 
bad values in batch {code}

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210328#comment-17210328
 ] 

Andy Grove commented on ARROW-10226:


Just tracking progress with debugging this. The issue is that the projection is 
behaving differently PER BATCH within these Parquet files. We expect 
l_returnflag to be a single char but sometimes the parquet reader is returning 
the contents of the l_comment field instead.
{code:java}
 
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: A
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: R
[/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: s among the fluffily r
[/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: eposits a
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: N
[/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet]
 first non-null value for l_returnflag in this batch: y ironic foxes above t
{code}
 

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210317#comment-17210317
 ] 

Neal Richardson commented on ARROW-10226:
-

Sounds good, thanks. Good luck!

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210314#comment-17210314
 ] 

Andy Grove commented on ARROW-10226:


[~npr] Sure, I changed to major, but my plan was to resolve the issue before we 
release tomorrow.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210280#comment-17210280
 ] 

Neal Richardson commented on ARROW-10226:
-

[~andygrove] can you explain why this is a release blocker, given that our 
release target date is tomorrow? It certainly sounds bad, but if this is not 
due to a recent change, and perhaps something that never worked, I'm curious 
why this should hold up 2.0.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210273#comment-17210273
 ] 

Jorge Leitão commented on ARROW-10226:
--

I am really sorry to hear that. Let me know if there is anything I can support 
on this ahead of the release. I can take time over the weekend to bootstrap an 
environment on the cloud to run this and debug it.

I can also easy write some Terraform to bootstrap an environment, so that we 
have a procedure to run these tests on an independent and "immutable" 
environment.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209945#comment-17209945
 ] 

Andy Grove commented on ARROW-10226:


Query works fine against tbl files but not against parquet files (it's reading 
the wrong columns somehow). Spark works fine so the issue is not with the 
Parquet files. Really odd to find this now.

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset

2020-10-07 Thread Andy Grove (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209937#comment-17209937
 ] 

Andy Grove commented on ARROW-10226:


The query also returns the wrong results ... grouping by l_comment (high 
cardinality) instead of l_returnflag (low cardinality)

> [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
> ---
>
> Key: ARROW-10226
> URL: https://issues.apache.org/jira/browse/ARROW-10226
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS)  and 
> when I try and run the TPC-H benchmark, it never completes and eventually 
> uses up all 64 GB RAM.
> I can run Spark against the data  set and the query completes in 24 seconds, 
> which IIRC is how long it took before.
> It is possible that something is odd on my environment, but it is also 
> possible/likely that this is a real bug.
> I am investigating this and will update the Jira once I know more.
> I also went back to old commits that were working for me before and they show 
> the same issue so I don't think this is related to a recent code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)