[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844246#comment-17844246 ] Michael Mior commented on CALCITE-2040: --- Thanks to all who managed to push this over the finish line! > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: hongyu guo >Priority: Major > Labels: pull-request-available > Fix For: 1.37.0 > > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824357#comment-17824357 ] hongyu guo commented on CALCITE-2040: - [~zabetak] I have addad 5 co-authors, based on the PR2133 and PR2810. {code:java} [CALCITE-2040] Create adapter for Apache Arrow Co-authored-by: Alessandro Solimando Co-authored-by: Jonathan Swenson Co-authored-by: Julian Hyde Co-authored-by: Karshit Shah Co-authored-by: Michael Mior {code} > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: hongyu guo >Priority: Major > Labels: pull-request-available > Fix For: 1.37.0 > > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824299#comment-17824299 ] Stamatis Zampetakis commented on CALCITE-2040: -- Many people contributed to this work so please give add an appropriate mention (use "Co-authored-by" in the commit message) before merging this to main. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: hongyu guo >Priority: Major > Labels: pull-request-available > Fix For: 1.37.0 > > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824121#comment-17824121 ] Alessandro Solimando commented on CALCITE-2040: --- [~hongyuguo] thanks for creating the tickets and adding the missing details. One last ask is to create instead a new umbrella ticket and move the tickets under it, as these tickets are meant to be addressed after completing CALCITE-2040. The new umbrella ticket should be marked as "depending/blocked" on CALCITE-2040. I have marked the ticket's fix version as 1.37.0, as the release is incumbent but I think we can get this in with some more effort. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: hongyu guo >Priority: Major > Labels: pull-request-available > Fix For: 1.37.0 > > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820512#comment-17820512 ] Alessandro Solimando commented on CALCITE-2040: --- [~hongyuguo], I have left a (partial) review, there is enough to be looked at already I feel, I will finish the review sometime next week. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819076#comment-17819076 ] hongyu guo commented on CALCITE-2040: - Does anyone has any suggestion for [https://github.com/apache/calcite/pull/3666] ? Some discussion can be found in mail thread [https://lists.apache.org/thread/z4qzgnzov7sdjorjvkx8w35m376dwm3y] > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541077#comment-17541077 ] Julian Hyde commented on CALCITE-2040: -- I agree with [~mmior]. If some basic queries work, and it's tested in CI in at least one configuration, let's merge the feature to trunk, and announce it as a feature in an 'alpha' level of quality. Log bugs for the stuff that doesn't work. After that, prioritize fixing the 'silent' failures, e.g. where we give wrong results as opposed to throwing an error. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540891#comment-17540891 ] Michael Mior commented on CALCITE-2040: --- [~jswenson] If some simple useful queries work, I think it's fine if several things are broken. What is broken will need to be documented. Having this landed will make it much easier for others to try it out and hopefully also start contributing. If you already have tests written, I would mark them as known failures for now and create tasks to fix them. One of the advantages of having this landed is also that we can make sure all the CI processes are set up to build correctly and that everything continues to build correctly going forward. That in itself has proved to be a big accomplishment. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539905#comment-17539905 ] Jonathan Swenson commented on CALCITE-2040: --- [~julianhyde] what do you think is required here to get this over the line? It appears as though the PR is a reasonable baseline, but a lot of pieces are not yet implemented. For example: ORs, INs, IS (NOT) NULL, projection of dates, more complicated boolean logic (WHERE (x > 5) IS NOT FALSE), joins, unions, certain casts, certain filter literals. I attempted to write some simple tests for many of the cases you called out, but many of them do not work -- either failing in the: + Calcite calc layer due to data not being mapped as expected (dates come back as date_day integers which get mapped to timestamps) + unimplemented parts of gandiva (comparison of a smallint (or byte) with an integer) + missing implementations in the translation layer from calcite to gandiva expression trees (no mapping of ORs or null checking clauses right now). In most cases the query fails, but in some cases the results appear to be incorrect (projection of dates) or simply missing (joins) for unsupported functionality. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539845#comment-17539845 ] Julian Hyde commented on CALCITE-2040: -- [~jswenson], Great to see your contributions in this area. Please be sure to nag me and [~mmior] to get it over the line. This bug has required a lot of persistence and patience but I think we can land it together. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539613#comment-17539613 ] Michael Mior commented on CALCITE-2040: --- [~jswenson] Thanks for picking this up! I don't think tests failing on M1 is necessarily a blocker. However, it would be nice if we had a way to skip these tests on M1 silicon for now until [ARROW-16608|ARROW-16608] is resolved. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1.5h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539117#comment-17539117 ] Jonathan Swenson commented on CALCITE-2040: --- It appears as though the first issue (filter was not applied over the whole dataset) is fixed with [this commit|https://github.com/jonathanswenson/calcite/commit/408b238d3cd853b2084b8d021d43efc1a99bee0e]. The issue was that the enumerator would exit on the first empty filtered batch and skip iteration or evaluation of filters on any subsequent record batch. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539074#comment-17539074 ] Jonathan Swenson commented on CALCITE-2040: --- Building arrow-gandiva from source on an M1 mac and loading in that jar manually works – so it appears as though the gandiva library is compatible with M1 macs, but they just needs to host a new target on maven. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538626#comment-17538626 ] Jonathan Swenson commented on CALCITE-2040: --- For reference the linker error I get on an M1 mac is: {code:java} FAILURE 2.3sec, org.apache.calcite.adapter.arrow.ArrowAdapterTest > testArrowProjectFieldsWithFloatFilter() java.lang.UnsatisfiedLinkError: /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8: dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8, 0x0001): tried: '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e')) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832) at java.lang.Runtime.load0(Runtime.java:811) at java.lang.System.load(System.java:1088) at org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74) at org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63) at org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53) at org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144) at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) {code} > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538624#comment-17538624 ] Jonathan Swenson commented on CALCITE-2040: --- [~julianhyde] it appears as though ARROW-11135 has been fixed (probably in arrow 7.0.0). I was able to upgrade to arrow and arrow-gandiva to 8.0.0 (latest) and was able to run the tests on Mac. Working off [julian's branch|https://github.com/julianhyde/calcite/tree/2040-arrow], I merged in the final changes from the [original branch|https://github.com/apache/calcite/pull/2133] and resolved a few differences in how the two had diverged. In addition I made the changes suggested by [~vladimirsitnikov] to implement the data generation as part of the test setup (using a @TempDir in junit). Those changes can be found [here|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1] There are a few things that I ran into that probably need to be considered: * I don't believe that filters apply over the whole dataset, but only seem to apply to the first record batch. I had to [update the batch size to 20|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1#diff-857069b93915684ed8b3ffbfb26fe4aadc8a2eebd5ce9c956fc614757c77e48dR52] to accommodate. It is relatively easy to write a test that confirms this (2 of the tests will fail if you bump the batch size back down to 10). This feels like a blocker to me. * I had to move to an Intel mac in order to get this to execute the tests properly – I got a linker error when trying to run the tests on an M1 Mac (apple silicon). I don't know if that is a blocker for development – I had another suite (the cassandra test suite) failing inexplicably on the M1 mac that I'll probably dig more into / file an issue. I'm not deep enough yet in the arrow / gandiva support to know what's happening yet for either of these. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441855#comment-17441855 ] Karshit Shah commented on CALCITE-2040: --- [~vladimirsitnikov], thanks for the suggestion. I'll try to implement it that way. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439422#comment-17439422 ] Vladimir Sitnikov commented on CALCITE-2040: [~Karshit Shah], have you considered running "ArrowData" as a part of the test? (e.g. via JUnit @TempDir, etc , etc). > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 1h > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438878#comment-17438878 ] Karshit Shah commented on CALCITE-2040: --- I've been working on this issue with [~mmior]. I've a Java class that generates an arrow data file. Currently, I've the following gradle task to generate the data file. {code:java} task("runWithJavaExec", JavaExec::class) { main = "org.apache.calcite.adapter.arrow.ArrowData" classpath = sourceSets["test"].runtimeClasspath } {code} I need to run "./gradlew arrow:runWithJavaExec" which generates the data file. However, I would like to add this task as a dependency to "./gradle arrow:test" so that that the file is generated before running the tests. But with limited knowledge of gradle, I'm not able to get that to work. It would be great if anyone can help me out with this. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325314#comment-17325314 ] Julian Hyde commented on CALCITE-2040: -- It seems that ARROW-11135 is not going to make it into Arrow 4.0 (they are just rolling out the first release candidate). No one seems interested in working on it. This case is blocked until that bug is fixed. If anyone is willing and able to fix ARROW-11135, I would be grateful. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321797#comment-17321797 ] Julian Hyde commented on CALCITE-2040: -- [~mmior], Thanks. I added it to my dev branch, as {{arrow/libs/arrow_data.py}}, on the principle that generated files should be accompanied by the source to generate them. Before we merge to master I would like to convert the file to Java. (So that it is easier to maintain.) And even better, have Gradle invoke the script, so that we can remove the the {{.arrow}} file from git. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320306#comment-17320306 ] Michael Mior commented on CALCITE-2040: --- [^arrow_data.py] > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Attachments: arrow_data.py > > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319642#comment-17319642 ] Michael Mior commented on CALCITE-2040: --- [~julianhyde] Good point. I believe Karshit has a Python script that was used to generate this data file. I'm not sure whether the equivalent APIs exist in Java, but I suppose that's likely that this could be written in Java as well. I'll share a copy of this program as soon as possible. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318935#comment-17318935 ] Julian Hyde commented on CALCITE-2040: -- [~mmior] and Karshit, Can you tell me how you created the {{test.arrow}} file? Before we merge to master, I want to remove this binary file from source control. If you could provide a java program that generates {{test.arrow}}, I could take it from there. (I will probably hook it into the gradle build scripts, so that test resources get generated. I will probably also generate Arrow files that contain the scott data set (EMP, DEPT, etc.).) > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318393#comment-17318393 ] Julian Hyde commented on CALCITE-2040: -- As [~mmior] [pointed out on dev@calcite|https://lists.apache.org/thread.html/r56003ae9392e9b759f46a5d94b7571a887a38712134753f7c9b33514%40%3Cdev.calcite.apache.org%3E], [PR 2133|https://github.com/apache/calcite/pull/2133] is ready for review. I plan to fix it up so that it builds and runs in CI (except in AppVeyor, due to issues noted in [Arrow/Gandiva dependency management in Java|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E]. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Michael Mior >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836422#comment-16836422 ] Lai Zhou commented on CALCITE-2040: --- [~masayuki038] Great job. I add a relation link to https://issues.apache.org/jira/browse/CALCITE-2173 . I think it's more than an `adapter` of Calcite, it may be a new physical implementation that like the default Enumerable implementation. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Priority: Major > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836162#comment-16836162 ] Lai Zhou commented on CALCITE-2040: --- I think it may improve a lot of performance if we have Arrow as a calling convention. [~julianhyde],Do you mean a new kind of Enumerable-implementations for Filter, Project, Aggregate and TableScan need to be introduced ? I'm just getting familiar with Arrow.I will have a try to make Arrow as a calling convention. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Priority: Major > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441784#comment-16441784 ] Laurent Goujon commented on CALCITE-2040: - ARROW-1780 is kind of the opposite/complementary: converting a JDBC resultset into an Arrow batch record. > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow
[ https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440328#comment-16440328 ] Kevin Risden commented on CALCITE-2040: --- There is a related JDBC adapter being created in the Apache Arrow project: ARROW-1780 > Create adapter for Apache Arrow > --- > > Key: CALCITE-2040 > URL: https://issues.apache.org/jira/browse/CALCITE-2040 > Project: Calcite > Issue Type: Bug >Reporter: Julian Hyde >Assignee: Julian Hyde >Priority: Major > > Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would > allow people to execute SQL statements, via JDBC or ODBC, on data stored in > Arrow in-memory format. > Since Arrow is an in-memory format, it is not as straightforward as reading, > say, CSV files using the file adapter: an Arrow data set does not have a URL. > (Unless we use Arrow's > [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/] > format, or use an in-memory file system such as Alluxio.) So we would need > to devise a way of addressing Arrow data sets. > Also, since Arrow is an extremely efficient format for processing data, it > would also be good to have Arrow as a calling convention. That is, > implementations of relational operators such as Filter, Project, Aggregate in > addition to just TableScan. > Lastly, when we have an Arrow convention, if we build adapters for file > formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in > CALCITE-2025) it would make a lot of sense to translate those formats > directly into Arrow (applying simple projects and filters first if > applicable). Those adapters would belong as a "contrib" module in the Arrow > project better than in Calcite. -- This message was sent by Atlassian JIRA (v7.6.3#76005)