[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914439#comment-16914439 ] Antoine Pitrou commented on ARROW-5691: --- Ah... Then perhaps the Parquet-Arrow bridge needs to be in a separate {{libarrow_parquet.so}} library. Not very pretty... > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914438#comment-16914438 ] Wes McKinney commented on ARROW-5691: - I'm OK to leave things as they are now, since it's not really hurting anything. I mostly want to nix the {{parquet::arrow}} namespace and treat libparquet.so as a dependency of Arrow record batch assembly / disassembly > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914384#comment-16914384 ] Antoine Pitrou commented on ARROW-5691: --- If we have a "formats" directory I'd still like to have the per-format subdirectories. As for the Parquet symbols I'm not sure I understand: if we move the Arrow-specific code from Parquet to Arrow, is there still a dependency from libparquet.so to libarrow.so? > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914378#comment-16914378 ] Wes McKinney commented on ARROW-5691: - We can also flatten these directories and instead use {{$FORMAT_}} prefixes, for example {{src/arrow/formats/parquet_reader.h}} etc. Thoughts? > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914381#comment-16914381 ] Wes McKinney commented on ARROW-5691: - We still have the problem of which shared library to put the symbols in. libarrow.so is not an option for the Parquet symbols as discussed above > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914337#comment-16914337 ] Antoine Pitrou commented on ARROW-5691: --- Then "formats" plural, because "format" singular sounds too much like the Arrow format itself. > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914327#comment-16914327 ] Francois Saint-Jacques commented on ARROW-5691: --- I prefer `src/arrow/format`, where we'd drop csv, json, orc, ... > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914320#comment-16914320 ] Antoine Pitrou commented on ARROW-5691: --- Inside "src/arrow/dataset" directory, intuitively I'd only put the dataset-compliant interfaces, not the actual file format handling routines (especially as they expose other APIs already). > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914317#comment-16914317 ] Antoine Pitrou commented on ARROW-5691: --- I also dislike the "adapters" name (too generic, it could be anything). As for the Parquet-to-Arrow dependency, isn't it broken precisely if the action outlined in the title is achieved? ("relocate src/parquet/arrow to src/arrow/dataset/parquet") I guess we could have a "formats" directory where we would put JSON, CSV, Parquet, ORC... Though I'm not fond of deep hierarchies and would also be fine with putting "json", "csv", "parquet", "orc"... directories at the Arrow top-level. > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914304#comment-16914304 ] Wes McKinney commented on ARROW-5691: - FWIW I dislike the "adapters" name and I don't know that the "arrow/adapters/tensorflow" belongs in the same place as file format interfaces > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913863#comment-16913863 ] Micah Kornfield commented on ARROW-5691: Given the current organization of the code base and based on [~xhochy] comment above. I think we should put the core logic of reading files under the adaptor folders (where ORC is currently located), then consume that from datasets. I don't have a good mental model of the current .so dependencies to offer a meaningful opinion on that. > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913778#comment-16913778 ] Wes McKinney commented on ARROW-5691: - [~emkornfi...@gmail.com] [~pitrou] thoughts about organizing our various file format interfaces (CSV, JSON, ORC, Parquet, eventually Avro)? Wherever we organize the code, we have to have it build in a shared library separate from {{libarrow.so}} since it will need to depend on e.g. {{libparquet.so}} (which in turn depends on {{libarrow.so}})? > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870675#comment-16870675 ] Wes McKinney commented on ARROW-5691: - Well, the symbols have to go into some shared library, so pick your poison. My proposed solution * libparquet depends on libarrow * libarrow_dataset depends on both libarrow and libparquet, and contains arrow::csv, arrow::json, and arrow::parquet symbols, usable directly without going through datasets API. What other structure would you prefer? I don't think we should create a standalone "libarrow_parquet" library. Personally I would prefer to have a single shared library that contains all symbols related to low level (like the current {{parquet::arrow}} symbols) and high level (the proposed {{arrow::dataset}} APIs) interactions with file formats. Both low- and high-level file APIs will continue to be provided > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870498#comment-16870498 ] Uwe L. Korn commented on ARROW-5691: I would be 100% fine with moving it into {{src/arrow/parquet}} but I question a bit of making the Parquet adaptor a full subset of the dataset project. For me these are two different entities, an adaptor providing access to the Parquet file format, either standalone low-level access or high-level reads into Arrow. Whereas, the dataset project builds on top of various adaptors but is not required for simple interactions with the file formats it supports. > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
[ https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870356#comment-16870356 ] Wes McKinney commented on ARROW-5691: - cc [~xhochy] if you have thoughts > [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet > -- > > Key: ARROW-5691 > URL: https://issues.apache.org/jira/browse/ARROW-5691 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > I think it may make sense to continue developing and maintaining this code in > the same place as other file format <-> Arrow serialization code and dataset > handling routines (e.g. schema normalization). Under this scheme, libparquet > becomes a link time dependency of libarrow_dataset -- This message was sent by Atlassian JIRA (v7.6.3#76005)