[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-05-07 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844246#comment-17844246
 ] 

Michael Mior commented on CALCITE-2040:
---

Thanks to all who managed to push this over the finish line!

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: hongyu guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.37.0
>
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-03-07 Thread hongyu guo (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824357#comment-17824357
 ] 

hongyu guo commented on CALCITE-2040:
-

[~zabetak] I have addad 5 co-authors, based on the PR2133 and PR2810.
{code:java}
[CALCITE-2040] Create adapter for Apache Arrow 
Co-authored-by: Alessandro Solimando 
Co-authored-by: Jonathan Swenson 
Co-authored-by: Julian Hyde 
Co-authored-by: Karshit Shah 
Co-authored-by: Michael Mior  {code}

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: hongyu guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.37.0
>
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-03-07 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824299#comment-17824299
 ] 

Stamatis Zampetakis commented on CALCITE-2040:
--

Many people contributed to this work so please give add an appropriate mention 
(use "Co-authored-by" in the commit message) before merging this to main.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: hongyu guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.37.0
>
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-03-06 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824121#comment-17824121
 ] 

Alessandro Solimando commented on CALCITE-2040:
---

[~hongyuguo] thanks for creating the tickets and adding the missing details.

One last ask is to create instead a new umbrella ticket and move the tickets 
under it, as these tickets are meant to be addressed after completing 
CALCITE-2040.

The new umbrella ticket should be marked as "depending/blocked" on CALCITE-2040.

I have marked the ticket's fix version as 1.37.0, as the release is incumbent 
but I think we can get this in with some more effort.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: hongyu guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.37.0
>
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-02-25 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820512#comment-17820512
 ] 

Alessandro Solimando commented on CALCITE-2040:
---

[~hongyuguo], I have left a (partial) review, there is enough to be looked at 
already I feel, I will finish the review sometime next week.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2024-02-20 Thread hongyu guo (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819076#comment-17819076
 ] 

hongyu guo commented on CALCITE-2040:
-

Does anyone has any suggestion for 
[https://github.com/apache/calcite/pull/3666] ?

Some discussion can be found in mail thread 
[https://lists.apache.org/thread/z4qzgnzov7sdjorjvkx8w35m376dwm3y]

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-23 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17541077#comment-17541077
 ] 

Julian Hyde commented on CALCITE-2040:
--

I agree with [~mmior]. If some basic queries work, and it's tested in CI in at 
least one configuration, let's merge the feature to trunk, and announce it as a 
feature in an 'alpha' level of quality. Log bugs for the stuff that doesn't 
work. After that, prioritize fixing the 'silent' failures, e.g. where we give 
wrong results as opposed to throwing an error.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-23 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540891#comment-17540891
 ] 

Michael Mior commented on CALCITE-2040:
---

[~jswenson] If some simple useful queries work, I think it's fine if several 
things are broken. What is broken will need to be documented. Having this 
landed will make it much easier for others to try it out and hopefully also 
start contributing. If you already have tests written, I would mark them as 
known failures for now and create tasks to fix them.

One of the advantages of having this landed is also that we can make sure all 
the CI processes are set up to build correctly and that everything continues to 
build correctly going forward. That in itself has proved to be a big 
accomplishment.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-19 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539905#comment-17539905
 ] 

Jonathan Swenson commented on CALCITE-2040:
---

[~julianhyde] what do you think is required here to get this over the line? 

It appears as though the PR is a reasonable baseline, but a lot of pieces are 
not yet implemented. 

For example: ORs, INs, IS (NOT) NULL, projection of dates, more complicated 
boolean logic (WHERE (x > 5) IS NOT FALSE), joins, unions, certain casts, 
certain filter literals. 

I attempted to write some simple tests for many of the cases you called out, 
but many of them do not work -- either failing in the:

+ Calcite calc layer due to data not being mapped as expected (dates come back 
as date_day integers which get mapped to timestamps)

+ unimplemented parts of gandiva (comparison of a smallint (or byte) with an 
integer)

+ missing implementations in the translation layer from calcite to gandiva 
expression trees (no mapping of ORs or null checking clauses right now). 

 

In most cases the query fails, but in some cases the results appear to be 
incorrect (projection of dates) or simply missing (joins) for unsupported 
functionality.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-19 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539845#comment-17539845
 ] 

Julian Hyde commented on CALCITE-2040:
--

[~jswenson], Great to see your contributions in this area. Please be sure to 
nag me and [~mmior] to get it over the line. This bug has required a lot of 
persistence and patience but I think we can land it together.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-19 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539613#comment-17539613
 ] 

Michael Mior commented on CALCITE-2040:
---

[~jswenson] Thanks for picking this up! I don't think tests failing on M1 is 
necessarily a blocker. However, it would be nice if we had a way to skip these 
tests on M1 silicon for now until [ARROW-16608|ARROW-16608] is resolved.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539117#comment-17539117
 ] 

Jonathan Swenson commented on CALCITE-2040:
---

It appears as though the first issue (filter was not applied over the whole 
dataset) is fixed with [this 
commit|https://github.com/jonathanswenson/calcite/commit/408b238d3cd853b2084b8d021d43efc1a99bee0e].

The issue was that the enumerator would exit on the first empty filtered batch 
and skip iteration or evaluation of filters on any subsequent record batch.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539074#comment-17539074
 ] 

Jonathan Swenson commented on CALCITE-2040:
---

Building arrow-gandiva from source on an M1 mac and loading in that jar 
manually works – so it appears as though the gandiva library is compatible with 
M1 macs, but they just needs to host a new target on maven. 

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538626#comment-17538626
 ] 

Jonathan Swenson commented on CALCITE-2040:
---

For reference the linker error I get on an M1 mac is: 
{code:java}
FAILURE   2.3sec, org.apache.calcite.adapter.arrow.ArrowAdapterTest > 
testArrowProjectFieldsWithFloatFilter()
    java.lang.UnsatisfiedLinkError: 
/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8:
 
dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8,
 0x0001): tried: 
'/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib804580c2-6fe4-4294-bdbb-c0c7d9e582a8'
 (mach-o file, but is an incompatible architecture (have 'x86_64', need 
'arm64e'))
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
        at java.lang.Runtime.load0(Runtime.java:811)
        at java.lang.System.load(System.java:1088)
        at 
org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74)
        at 
org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63)
        at 
org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53)
        at 
org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144)
        at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67) {code}
 

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538624#comment-17538624
 ] 

Jonathan Swenson commented on CALCITE-2040:
---

[~julianhyde] it appears as though ARROW-11135 has been fixed (probably in 
arrow 7.0.0). I was able to upgrade to arrow and arrow-gandiva to 8.0.0 
(latest) and was able to run the tests on Mac.

Working off [julian's 
branch|https://github.com/julianhyde/calcite/tree/2040-arrow], I merged in the 
final changes from the [original 
branch|https://github.com/apache/calcite/pull/2133] and resolved a few 
differences in how the two had diverged. In addition I made the changes 
suggested by [~vladimirsitnikov] to implement the data generation as part of 
the test setup (using a @TempDir in junit).

Those changes can be found 
[here|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1]

There are a few things that I ran into that probably need to be considered:
 * I don't believe that filters apply over the whole dataset, but only seem to 
apply to the first record batch. I had to [update the batch size to 
20|https://github.com/apache/calcite/compare/main...jonathanswenson:swenson/arrow-upgrade?expand=1#diff-857069b93915684ed8b3ffbfb26fe4aadc8a2eebd5ce9c956fc614757c77e48dR52]
 to accommodate. It is relatively easy to write a test that confirms this (2 of 
the tests will fail if you bump the batch size back down to 10). This feels 
like a blocker to me. 
 * I had to move to an Intel mac in order to get this to execute the tests 
properly – I got a linker error when trying to run the tests on an M1 Mac 
(apple silicon). I don't know if that is a blocker for development – I had 
another suite (the cassandra test suite) failing inexplicably on the M1 mac 
that I'll probably dig more into / file an issue.

I'm not deep enough yet in the arrow / gandiva support to know what's happening 
yet for either of these. 

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-11-10 Thread Karshit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17441855#comment-17441855
 ] 

Karshit Shah commented on CALCITE-2040:
---

[~vladimirsitnikov], thanks for the suggestion. I'll try to implement it that 
way.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-11-05 Thread Vladimir Sitnikov (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439422#comment-17439422
 ] 

Vladimir Sitnikov commented on CALCITE-2040:


[~Karshit Shah], have you considered running "ArrowData" as a part of the test? 
(e.g. via JUnit @TempDir, etc , etc).

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-11-04 Thread Karshit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17438878#comment-17438878
 ] 

Karshit Shah commented on CALCITE-2040:
---

I've been working on this issue with [~mmior]. I've a Java class that generates 
an arrow data file. Currently, I've the following gradle task to generate the 
data file.
{code:java}
task("runWithJavaExec", JavaExec::class) {

  main = "org.apache.calcite.adapter.arrow.ArrowData"

  classpath = sourceSets["test"].runtimeClasspath

}

{code}
I need to run "./gradlew arrow:runWithJavaExec" which generates the data file. 
However, I would like to add this task as a dependency to "./gradle arrow:test" 
so that that the file is generated before running the tests. But with limited 
knowledge of gradle, I'm not able to get that to work. It would be great if 
anyone can help me out with this.


 

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-19 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325314#comment-17325314
 ] 

Julian Hyde commented on CALCITE-2040:
--

It seems that ARROW-11135 is not going to make it into Arrow 4.0 (they are just 
rolling out the first release candidate). No one seems interested in working on 
it. This case is blocked until that bug is fixed.

If anyone is willing and able to fix ARROW-11135, I would be grateful.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-14 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321797#comment-17321797
 ] 

Julian Hyde commented on CALCITE-2040:
--

[~mmior], Thanks. I added it to my dev branch, as {{arrow/libs/arrow_data.py}}, 
on the principle that generated files should be accompanied by the source to 
generate them.

Before we merge to master I would like to convert the file to Java. (So that it 
is easier to maintain.)

And even better, have Gradle invoke the script, so that we can remove the the 
{{.arrow}} file from git.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-13 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320306#comment-17320306
 ] 

Michael Mior commented on CALCITE-2040:
---

[^arrow_data.py]

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
> Attachments: arrow_data.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-12 Thread Michael Mior (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319642#comment-17319642
 ] 

Michael Mior commented on CALCITE-2040:
---

[~julianhyde] Good point. I believe Karshit has a Python script that was used 
to generate this data file. I'm not sure whether the equivalent APIs exist in 
Java, but I suppose that's likely that this could be written in Java as well. 
I'll share a copy of this program as soon as possible.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-11 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318935#comment-17318935
 ] 

Julian Hyde commented on CALCITE-2040:
--

[~mmior] and Karshit, Can you tell me how you created the {{test.arrow}} file?

Before we merge to master, I want to remove this binary file from source 
control. If you could provide a java program that generates {{test.arrow}}, I 
could take it from there. (I will probably hook it into the gradle build 
scripts, so that test resources get generated. I will probably also generate 
Arrow files that contain the scott data set (EMP, DEPT, etc.).)

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2021-04-09 Thread Julian Hyde (Jira)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318393#comment-17318393
 ] 

Julian Hyde commented on CALCITE-2040:
--

As [~mmior] [pointed out on 
dev@calcite|https://lists.apache.org/thread.html/r56003ae9392e9b759f46a5d94b7571a887a38712134753f7c9b33514%40%3Cdev.calcite.apache.org%3E],
 [PR 2133|https://github.com/apache/calcite/pull/2133] is ready for review.

I plan to fix it up so that it builds and runs in CI (except in AppVeyor, due 
to issues noted in [Arrow/Gandiva dependency management in 
Java|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E].

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Michael Mior
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2019-05-09 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836422#comment-16836422
 ] 

Lai Zhou commented on CALCITE-2040:
---

[~masayuki038]

Great job. I add a relation link to 
https://issues.apache.org/jira/browse/CALCITE-2173 .
I think it's more than an `adapter`  of Calcite, it may be a new physical 
implementation that like the default Enumerable implementation.
 

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Priority: Major
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2019-05-09 Thread Lai Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836162#comment-16836162
 ] 

Lai Zhou commented on CALCITE-2040:
---

I think it may improve a lot of performance  if we have Arrow as a calling 
convention.

[~julianhyde],Do you mean a new kind of  Enumerable-implementations for Filter, 
Project, Aggregate and TableScan need to be introduced ?

I'm just getting familiar with Arrow.I will have a try to make Arrow as a 
calling convention.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Priority: Major
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2018-04-17 Thread Laurent Goujon (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441784#comment-16441784
 ] 

Laurent Goujon commented on CALCITE-2040:
-

ARROW-1780 is kind of the opposite/complementary: converting a JDBC resultset 
into an Arrow batch record.

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CALCITE-2040) Create adapter for Apache Arrow

2018-04-16 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/CALCITE-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440328#comment-16440328
 ] 

Kevin Risden commented on CALCITE-2040:
---

There is a related JDBC adapter being created in the Apache Arrow project: 
ARROW-1780

> Create adapter for Apache Arrow
> ---
>
> Key: CALCITE-2040
> URL: https://issues.apache.org/jira/browse/CALCITE-2040
> Project: Calcite
>  Issue Type: Bug
>Reporter: Julian Hyde
>Assignee: Julian Hyde
>Priority: Major
>
> Create an adapter for [Apache Arrow|http://arrow.apache.org/]. This would 
> allow people to execute SQL statements, via JDBC or ODBC, on data stored in 
> Arrow in-memory format.
> Since Arrow is an in-memory format, it is not as straightforward as reading, 
> say, CSV files using the file adapter: an Arrow data set does not have a URL. 
> (Unless we use Arrow's 
> [Feather|https://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/]
>  format, or use an in-memory file system such as Alluxio.) So we would need 
> to devise a way of addressing Arrow data sets.
> Also, since Arrow is an extremely efficient format for processing data, it 
> would also be good to have Arrow as a calling convention. That is, 
> implementations of relational operators such as Filter, Project, Aggregate in 
> addition to just TableScan.
> Lastly, when we have an Arrow convention, if we build adapters for file 
> formats (for instance the bioinformatics formats SAM, VCF, FASTQ discussed in 
> CALCITE-2025) it would make a lot of sense to translate those formats 
> directly into Arrow (applying simple projects and filters first if 
> applicable). Those adapters would belong as a "contrib" module in the Arrow 
> project better than in Calcite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)