[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210341#comment-17210341 ] Andy Grove commented on ARROW-10226: {code:java} part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 bad values in batch part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49880 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375000 bad values in batch part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 49979 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 374998 bad values in batch part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50031 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 0 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 375002 bad values in batch part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet has 50110 bad values in batch {code} > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210328#comment-17210328 ] Andy Grove commented on ARROW-10226: Just tracking progress with debugging this. The issue is that the projection is behaving differently PER BATCH within these Parquet files. We expect l_returnflag to be a single char but sometimes the parquet reader is returning the contents of the l_comment field instead. {code:java} [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: A [/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-3-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: R [/mnt/tpch/s1/parquet/lineitem/part-2-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: s among the fluffily r [/mnt/tpch/s1/parquet/lineitem/part-0-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: eposits a [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: N [/mnt/tpch/s1/parquet/lineitem/part-1-36eb4379-93a2-47a8-873a-d0f1ed13a85a-c000.snappy.parquet] first non-null value for l_returnflag in this batch: y ironic foxes above t {code} > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210317#comment-17210317 ] Neal Richardson commented on ARROW-10226: - Sounds good, thanks. Good luck! > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210314#comment-17210314 ] Andy Grove commented on ARROW-10226: [~npr] Sure, I changed to major, but my plan was to resolve the issue before we release tomorrow. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210280#comment-17210280 ] Neal Richardson commented on ARROW-10226: - [~andygrove] can you explain why this is a release blocker, given that our release target date is tomorrow? It certainly sounds bad, but if this is not due to a recent change, and perhaps something that never worked, I'm curious why this should hold up 2.0. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210273#comment-17210273 ] Jorge Leitão commented on ARROW-10226: -- I am really sorry to hear that. Let me know if there is anything I can support on this ahead of the release. I can take time over the weekend to bootstrap an environment on the cloud to run this and debug it. I can also easy write some Terraform to bootstrap an environment, so that we have a procedure to run these tests on an independent and "immutable" environment. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209945#comment-17209945 ] Andy Grove commented on ARROW-10226: Query works fine against tbl files but not against parquet files (it's reading the wrong columns somehow). Spark works fine so the issue is not with the Parquet files. Really odd to find this now. > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10226) [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset
[ https://issues.apache.org/jira/browse/ARROW-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209937#comment-17209937 ] Andy Grove commented on ARROW-10226: The query also returns the wrong results ... grouping by l_comment (high cardinality) instead of l_returnflag (low cardinality) > [Rust] [DataFusion] TPC-H query 1 no longer completes for 100GB dataset > --- > > Key: ARROW-10226 > URL: https://issues.apache.org/jira/browse/ARROW-10226 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Blocker > Fix For: 2.0.0 > > > I re-installed my desktop a few days ago (now using Ubuntu 20.04 LTS) and > when I try and run the TPC-H benchmark, it never completes and eventually > uses up all 64 GB RAM. > I can run Spark against the data set and the query completes in 24 seconds, > which IIRC is how long it took before. > It is possible that something is odd on my environment, but it is also > possible/likely that this is a real bug. > I am investigating this and will update the Jira once I know more. > I also went back to old commits that were working for me before and they show > the same issue so I don't think this is related to a recent code change. -- This message was sent by Atlassian Jira (v8.3.4#803005)