[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914439#comment-16914439
 ] 

Antoine Pitrou commented on ARROW-5691:
---

Ah... Then perhaps the Parquet-Arrow bridge needs to be in a separate 
{{libarrow_parquet.so}} library. Not very pretty...

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914438#comment-16914438
 ] 

Wes McKinney commented on ARROW-5691:
-

I'm OK to leave things as they are now, since it's not really hurting anything. 
I mostly want to nix the {{parquet::arrow}} namespace and treat libparquet.so 
as a dependency of Arrow record batch assembly / disassembly

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914384#comment-16914384
 ] 

Antoine Pitrou commented on ARROW-5691:
---

If we have a "formats" directory I'd still like to have the per-format 
subdirectories. 

As for the Parquet symbols I'm not sure I understand: if we move the 
Arrow-specific code from Parquet to Arrow, is there still a dependency from 
libparquet.so to libarrow.so?

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914378#comment-16914378
 ] 

Wes McKinney commented on ARROW-5691:
-

We can also flatten these directories and instead use {{$FORMAT_}} prefixes, 
for example

{{src/arrow/formats/parquet_reader.h}}

etc. Thoughts?

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914381#comment-16914381
 ] 

Wes McKinney commented on ARROW-5691:
-

We still have the problem of which shared library to put the symbols in. 
libarrow.so is not an option for the Parquet symbols as discussed above

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914337#comment-16914337
 ] 

Antoine Pitrou commented on ARROW-5691:
---

Then "formats" plural, because "format" singular sounds too much like the Arrow 
format itself.

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914327#comment-16914327
 ] 

Francois Saint-Jacques commented on ARROW-5691:
---

I prefer `src/arrow/format`, where we'd drop csv, json, orc, ...

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914320#comment-16914320
 ] 

Antoine Pitrou commented on ARROW-5691:
---

Inside "src/arrow/dataset" directory, intuitively I'd only put the 
dataset-compliant interfaces, not the actual file format handling routines 
(especially as they expose other APIs already).

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914317#comment-16914317
 ] 

Antoine Pitrou commented on ARROW-5691:
---

I also dislike the "adapters" name (too generic, it could be anything).

As for the Parquet-to-Arrow dependency, isn't it broken precisely if the action 
outlined in the title is achieved? ("relocate src/parquet/arrow to 
src/arrow/dataset/parquet")

I guess we could have a "formats" directory where we would put JSON, CSV, 
Parquet, ORC...
Though I'm not fond of deep hierarchies and would also be fine with putting 
"json", "csv", "parquet", "orc"... directories at the Arrow top-level.

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914304#comment-16914304
 ] 

Wes McKinney commented on ARROW-5691:
-

FWIW I dislike the "adapters" name and I don't know that the 
"arrow/adapters/tensorflow" belongs in the same place as file format interfaces

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913863#comment-16913863
 ] 

Micah Kornfield commented on ARROW-5691:


Given the current organization of the code base and based on [~xhochy] comment 
above.  I think we should put the core logic of reading files under the adaptor 
folders (where ORC is currently located), then consume that from datasets.  I 
don't have a good mental model of the current .so dependencies to offer a 
meaningful opinion on that. 

 

 

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-08-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913778#comment-16913778
 ] 

Wes McKinney commented on ARROW-5691:
-

[~emkornfi...@gmail.com] [~pitrou] thoughts about organizing our various file 
format interfaces (CSV, JSON, ORC, Parquet, eventually Avro)? Wherever we 
organize the code, we have to have it build in a shared library separate from 
{{libarrow.so}} since it will need to depend on e.g. {{libparquet.so}} (which 
in turn depends on {{libarrow.so}})?

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-06-23 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870675#comment-16870675
 ] 

Wes McKinney commented on ARROW-5691:
-

Well, the symbols have to go into some shared library, so pick your poison. My 
proposed solution

* libparquet depends on libarrow
* libarrow_dataset depends on both libarrow and libparquet, and contains 
arrow::csv, arrow::json, and arrow::parquet symbols, usable directly without 
going through datasets API.

What other structure would you prefer? I don't think we should create a 
standalone "libarrow_parquet" library. 

Personally I would prefer to have a single shared library that contains all 
symbols related to low level (like the current {{parquet::arrow}} symbols) and 
high level (the proposed {{arrow::dataset}} APIs) interactions with file 
formats. Both low- and high-level file APIs will continue to be provided 

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-06-23 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870498#comment-16870498
 ] 

Uwe L. Korn commented on ARROW-5691:


I would be 100% fine with moving it into {{src/arrow/parquet}} but I question a 
bit of making the Parquet adaptor a full subset of the dataset project. For me 
these are two different entities, an adaptor providing access to the Parquet 
file format, either standalone low-level access or high-level reads into Arrow. 
Whereas, the dataset project builds on top of various adaptors but is not 
required for simple interactions with the file formats it supports.

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5691) [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet

2019-06-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870356#comment-16870356
 ] 

Wes McKinney commented on ARROW-5691:
-

cc [~xhochy] if you have thoughts

> [C++] Relocate src/parquet/arrow code to src/arrow/dataset/parquet
> --
>
> Key: ARROW-5691
> URL: https://issues.apache.org/jira/browse/ARROW-5691
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I think it may make sense to continue developing and maintaining this code in 
> the same place as other file format <-> Arrow serialization code and dataset 
> handling routines (e.g. schema normalization). Under this scheme, libparquet 
> becomes a link time dependency of libarrow_dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)