[jira] [Commented] (ARROW-7495) [Java] Remove "empty" concept from ArrowBuf, replace with custom referencemanager

2020-05-25 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116386#comment-17116386
 ] 

Ji Liu commented on ARROW-7495:
---

Issue resolved by pull request 6433

[https://github.com/apache/arrow/pull/6433|https://github.com/apache/arrow/pull/7211]

> [Java] Remove "empty" concept from ArrowBuf, replace with custom 
> referencemanager
> -
>
> Key: ARROW-7495
> URL: https://issues.apache.org/jira/browse/ARROW-7495
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> With the introduction of ReferenceManager in the codebase, the need for a 
> separate ArrowBuf is no longer necessary. Instead, once can create a new 
> reference manager that is used for the empty ArrowBuf. For reminder/review, 
> empty arrowbufs have a special behavior in that they don't actually have any 
> reference counting semantics and always stay at one. This allow us to better 
> troubleshoot unallocated memory than what would otherwise be an NPE after 
> calling ValueVector.clear()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8171) Consider pre-allocating memory for fix-width vector in Avro adapter iterator

2020-05-25 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116385#comment-17116385
 ] 

Ji Liu commented on ARROW-8171:
---

Issue resolved by pull request 7211

[https://github.com/apache/arrow/pull/7211]

> Consider pre-allocating memory for fix-width vector in Avro adapter iterator
> 
>
> Key: ARROW-8171
> URL: https://issues.apache.org/jira/browse/ARROW-8171
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7495) [Java] Remove "empty" concept from ArrowBuf, replace with custom referencemanager

2020-05-25 Thread Siddharth Teotia (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-7495.
-
Resolution: Fixed

> [Java] Remove "empty" concept from ArrowBuf, replace with custom 
> referencemanager
> -
>
> Key: ARROW-7495
> URL: https://issues.apache.org/jira/browse/ARROW-7495
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> With the introduction of ReferenceManager in the codebase, the need for a 
> separate ArrowBuf is no longer necessary. Instead, once can create a new 
> reference manager that is used for the empty ArrowBuf. For reminder/review, 
> empty arrowbufs have a special behavior in that they don't actually have any 
> reference counting semantics and always stay at one. This allow us to better 
> troubleshoot unallocated memory than what would otherwise be an NPE after 
> calling ValueVector.clear()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8171) Consider pre-allocating memory for fix-width vector in Avro adapter iterator

2020-05-25 Thread Siddharth Teotia (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-8171.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

> Consider pre-allocating memory for fix-width vector in Avro adapter iterator
> 
>
> Key: ARROW-8171
> URL: https://issues.apache.org/jira/browse/ARROW-8171
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8909) [Java] Out of order writes using setSafe

2020-05-25 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116350#comment-17116350
 ] 

Liya Fan commented on ARROW-8909:
-

[~saurabhm] Thank you for reporting the problem.
I think the behavior is by design. For variable width vectors, we do not 
support setting values in random order, as this might cause severe performance 
penalty. 

> [Java] Out of order writes using setSafe
> 
>
> Key: ARROW-8909
> URL: https://issues.apache.org/jira/browse/ARROW-8909
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Saurabh
>Priority: Major
>
> I noticed that calling setSafe on a VarCharVector with indices not in 
> increasing order causes the lastIndex to be set to the index in the last call 
> to setSafe.
> Is this a documented and expected behavior ?
> Sample code:
> {code:java}
> import java.util.Collections;
> import lombok.extern.slf4j.Slf4j;
> import org.apache.arrow.memory.RootAllocator;
> import org.apache.arrow.vector.VarCharVector;
> import org.apache.arrow.vector.VectorSchemaRoot;
> import org.apache.arrow.vector.types.pojo.ArrowType;
> import org.apache.arrow.vector.types.pojo.Field;
> import org.apache.arrow.vector.types.pojo.Schema;
> import org.apache.arrow.vector.util.Text;
> @Slf4j
> public class ATest {
>   public static void main() {
> Schema schema = new 
> Schema(Collections.singletonList(Field.nullable("Data", new 
> ArrowType.Utf8(;
> try (VectorSchemaRoot vroot = VectorSchemaRoot.create(schema, new 
> RootAllocator())) {
>   VarCharVector vec = (VarCharVector) vroot.getVector("Data");
>   for (int i = 0; i < 10; i++) {
> vec.setSafe(i, new Text(Integer.toString(i) + "_mtest"));
>   }
>   vec.setSafe(7, new Text(Integer.toString(7) + "_new"));
>   log.info("Data at index 8 Before {}", vec.getObject(8));
>   vroot.setRowCount(10);
>   log.info("Data at index 8 After {}", vec.getObject(8));
>   log.info(vroot.contentToTSVString());
> }
>   }
> }
> {code}
>  
> If I don't set the index 7 after the loop, I get all the 0_mtest, 1_mtest, 
> ..., 9_mtest entries.
> If I set index 7 after the loop, I see 0_mtest, ..., 5_mtest, 6_mtext, 7_new,
>     Before the setRowCount, the data at index 8 is -> *st8_mtest*  ; index 9 
> is *9_mtest*
>    After the setRowCount, the data at index 8 is -> "" ; index  9 is ""
> With a text with more chars instead of 4 with _new, it keeps eating into the 
> data at the following indices.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116339#comment-17116339
 ] 

Wes McKinney commented on ARROW-8214:
-

Yes, I would definitely like to see that happen. Using Flatbuffers is desirable 
to avoid the need to link libprotobuf.a

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2020-05-25 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116338#comment-17116338
 ] 

Micah Kornfield commented on ARROW-8214:


When someone takes this up it would be nice if there was a discussion on how 
this will be unified with gandiva protobuf expression representation

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8772) [C++] Expand SumKernel benchmark to more types

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8772.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7267
[https://github.com/apache/arrow/pull/7267]

> [C++] Expand SumKernel benchmark to more types
> --
>
> Key: ARROW-8772
> URL: https://issues.apache.org/jira/browse/ARROW-8772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Expand SumKernel benchmark to cover more types, Float, Double, Int8, Int16, 
> Int32, Int64.
> Currently it only has Int64 item, useful for further optimize job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8860) [C++] IPC/Feather decompression broken for nested arrays

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8860:

Priority: Critical  (was: Major)

> [C++] IPC/Feather decompression broken for nested arrays
> 
>
> Key: ARROW-8860
> URL: https://issues.apache.org/jira/browse/ARROW-8860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When writing a table with a Struct typed column, this is read back with 
> garbage values when using compression (which is the default):
> {code:python}
> >>>  table = pa.table({'col': pa.StructArray.from_arrays([[0, 1, 2], [1, 2, 
> >>> 3]], names=["f1", "f2"])})
> # roundtrip through feather
> >>> feather.write_feather(table, "test_struct.feather")
> >>> table2 = feather.read_table("test_struct.feather")
> >>> table2.column("col")
> 
> [
>   -- is_valid: all not null
>   -- child 0 type: int64
> [
>   24,
>   1261641627085906436,
>   1369095386551025664
> ]
>   -- child 1 type: int64
> [
>   24,
>   1405756815161762308,
>   281479842103296
> ]
> ]
> {code}
> When not using compression, it is read back correctly:
> {code:python}
> >>> feather.write_feather(table, "test_struct.feather", 
> >>> compression="uncompressed")   
> >>>   
> >>>   
> >>> table2 = feather.read_table("test_struct.feather")
> >>>   
> >>>   
> >>> table2.column("col")  
> >>>   
> >>>   
> 
> [
>   -- is_valid: all not null
>   -- child 0 type: int64
> [
>   0,
>   1,
>   2
> ]
>   -- child 1 type: int64
> [
>   1,
>   2,
>   3
> ]
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8873) [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after delete

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8873:

Summary: [Plasma][C++] Usage model for Object IDs. Object IDs don't 
disappear after delete  (was: Usage model for Object IDs. Object IDs don't 
disappear after delete)

> [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after 
> delete
> -
>
> Key: ARROW-8873
> URL: https://issues.apache.org/jira/browse/ARROW-8873
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Abe Mammen
>Priority: Major
>
> I have an environment that uses Arrow + Plasma to send requests between 
> Python clients and a C++ server that responds with search results etc.
> I use a sequence number based approach for Object ID creation so its 
> understood on both sides. All that works well. So each request from the 
> client creates a unique Object ID, creates and seals it etc. On the other 
> end, a get against that Object ID retrieves the request payload, releases and 
> deletes the Object ID. A similar response scheme for Object IDs are used from 
> the server side to the client to get search results etc where it creates its 
> own unique Object ID understood by the client. The server side creates and 
> seals and the Python client side does a get and deletes the Object ID (there 
> is no release method in Python it appears). I have experimented with deleting 
> the plasma buffer.
> The end result is that as transactions build up, the server side memory use 
> goes way up and I can see that a good # of the objects aren't deleted from 
> the Plasma store until the server exits. I have nulled out the search result 
> part too so that is not what is accumulating. I have not done a memory 
> profile but wanted to get some feedback on some what might be wrong here.
> Is there a better way to use Object IDs for example? And what might be 
> causing the huge memory usage. In this example, I had ~4M transactions 
> between clients and the server which hit a memory usage of about 10+ GB which 
> is in the ballpark of the size of all the payloads. Besides doing 
> release-deletes on Object IDs, is there a better way to purge and remove 
> these objects?
> Any help is appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8801) [Python] Memory leak on read from parquet file with UTC timestamps using pandas

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8801:

Priority: Blocker  (was: Major)

> [Python] Memory leak on read from parquet file with UTC timestamps using 
> pandas
> ---
>
> Key: ARROW-8801
> URL: https://issues.apache.org/jira/browse/ARROW-8801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, 
> mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, 
> ubuntu 20.04 (linux).
>Reporter: Rauli Ruohonen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Given dump.py script 
>  
> {code:java}
> import pandas as pd
> import numpy as np
> x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', 
> utc=True)
> pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', 
> compression=None)
> {code}
> and load.py script
>  
> {code:java}
> import sys
> import pandas as pd
> def foo(engine):
>     for _ in range(2**9):
>         pd.read_parquet('data.parquet', engine=engine)
>     print('Done')
>     input()
> foo(sys.argv[1])
> {code}
> running first "python dump.py" and then "python load.py pyarrow", on my 
> machine python memory usage stays at 4+ GB while it waits for input. If using 
> "python load.py fastparquet" instead, it is about 100 MB, so it should be a 
> pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is 
> removed from dump.py, in which case the timestamp is timezone-unaware.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8580) Pyarrow exceptions are not helpful

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8580.
---
Resolution: Cannot Reproduce

If you can provide a reproducible example of such an unhelpful error message, 
we will certainly fix it

> Pyarrow exceptions are not helpful
> --
>
> Key: ARROW-8580
> URL: https://issues.apache.org/jira/browse/ARROW-8580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Soroush Radpour
>Priority: Major
>
> I'm trying to understand an exception in the code using pyarrow, and it is 
> not very helpful.
> File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: IOError: b'Service Unavailable'. Detail: Python exception: 
> RuntimeError
>   
>   It would be great if each of the three exceptions was unwrapped with full 
> stack trace and error messages that came with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8671:

Priority: Critical  (was: Major)

> [C++] Use IPC body compression metadata approved in ARROW-300 
> --
>
> Key: ARROW-8671
> URL: https://issues.apache.org/jira/browse/ARROW-8671
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 1.0.0
>
>
> This will adapt the existing code to use the new metadata, while maintaining 
> backward compatibility code to recognize the "experimental" metadata written 
> in 0.17.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8462) [Python] Crash in lib.concat_tables on Windows

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8462:

Summary: [Python] Crash in lib.concat_tables on Windows  (was: Crash in 
lib.concat_tables on Windows)

> [Python] Crash in lib.concat_tables on Windows
> --
>
> Key: ARROW-8462
> URL: https://issues.apache.org/jira/browse/ARROW-8462
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Tom Augspurger
>Priority: Major
>
> This crashes for me with pyarrow 0.16 on my Windows VM
> {code:python}
> import pyarrow as pa
> import pandas as pd
> t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
> print("concat")
> pa.lib.concat_tables([t])
> print('done')
> {code}
> Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
> info on windows unfortunately. With `python -X faulthandler` I see
> {code}
> Windows fatal exception: access violation
> Current thread 0x04f8 (most recent call first):
>   File "bug.py", line 6 in (module)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8930) [C++] libz.so linking error with liborc.a

2020-05-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou closed ARROW-8930.
---
Resolution: Duplicate

> [C++] libz.so linking error with liborc.a
> -
>
> Key: ARROW-8930
> URL: https://issues.apache.org/jira/browse/ARROW-8930
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is failing in the Travis CI ARM build
> https://travis-ci.org/github/apache/arrow/jobs/690722203
> {code}
> : && /usr/bin/ccache /usr/bin/c++  -Wno-noexcept-type  
> -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
> -Wno-sign-conversion -Wno-unused-variable -Werror -march=armv8-a  -g  
> -rdynamic 
> src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/adapter_test.cc.o
>   -o debug/arrow-orc-adapter-test  -Wl,-rpath,/build/cpp/debug  
> debug/libarrow_testing.a  debug/libarrow.a  debug//libgtest_maind.so  
> debug//libgtestd.so  /usr/lib/aarch64-linux-gnu/libsnappy.so.1.1.8  
> /usr/lib/aarch64-linux-gnu/liblz4.so  /usr/lib/aarch64-linux-gnu/libz.so  
> -lpthread  -ldl  orc_ep-install/lib/liborc.a  
> /usr/lib/aarch64-linux-gnu/libssl.so  /usr/lib/aarch64-linux-gnu/libcrypto.so 
>  /usr/lib/aarch64-linux-gnu/libbrotlienc.so  
> /usr/lib/aarch64-linux-gnu/libbrotlidec.so  
> /usr/lib/aarch64-linux-gnu/libbrotlicommon.so  
> /usr/lib/aarch64-linux-gnu/libbz2.so  /usr/lib/aarch64-linux-gnu/libzstd.so  
> /usr/lib/aarch64-linux-gnu/libprotobuf.so  
> /usr/lib/aarch64-linux-gnu/libglog.so  
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a  -pthread  
> -lrt && :
> /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): undefined 
> reference to symbol 'inflateEnd'
> /usr/bin/ld: /usr/lib/aarch64-linux-gnu/libz.so: error adding symbols: DSO 
> missing from command line
> collect2: error: ld returned 1 exit status
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8293) [Python] Run flake8 on python/examples also

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8293:

Fix Version/s: 1.0.0

> [Python] Run flake8 on python/examples also
> ---
>
> Key: ARROW-8293
> URL: https://issues.apache.org/jira/browse/ARROW-8293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There are flakes in these files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116250#comment-17116250
 ] 

Wes McKinney commented on ARROW-8214:
-

We will need to create a serialization scheme for general array expressions 
(for use with arrow/compute)

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8180) [C++] Should default_memory_pool() be in arrow/type_fwd.h?

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8180.
---
Resolution: Not A Problem

Closing as not a problem

> [C++] Should default_memory_pool() be in arrow/type_fwd.h?
> --
>
> Key: ARROW-8180
> URL: https://issues.apache.org/jira/browse/ARROW-8180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This seemed somewhat odd to me. It might be better from an IWYU-perspective 
> to move this to arrow/memory_pool.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8173) [C++] Validate ChunkedArray()'s arguments

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116246#comment-17116246
 ] 

Wes McKinney commented on ARROW-8173:
-

{{ChunkedArray::MakeSafe}}?

> [C++] Validate ChunkedArray()'s arguments
> -
>
> Key: ARROW-8173
> URL: https://issues.apache.org/jira/browse/ARROW-8173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> ChunkedArray has constraints on type uniformity of chunks which are currently 
> only expressed in comments. At minimum debug checks should be added to ensure 
> (for example) that an explicit type is shared by all chunks, at best the 
> public constructor should be replaced with 
> {{Result> ChunkedArray::Make(...)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116245#comment-17116245
 ] 

Wes McKinney commented on ARROW-7871:
-

I unassigned the issue from myself. Perhaps some others can write a PR that 
adds simple wrappers to {{pyarrow.compute.call_function}} for the missing 
function types

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116244#comment-17116244
 ] 

Wes McKinney commented on ARROW-7871:
-

This is extremely easy now since functions/kernels can now be exposed in pure 
Python using only their names. 

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Wes McKinney
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7871:
---

Assignee: (was: Wes McKinney)

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7822) [C++] Allocation free error Status constants

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116243#comment-17116243
 ] 

Wes McKinney commented on ARROW-7822:
-

I'm not sure that non-OK Status should ever be found on a performance hot path. 
That would indicate that Status is being used inappropriately for control flow. 
Unless I have misunderstood the issue?

> [C++] Allocation free error Status constants
> 
>
> Key: ARROW-7822
> URL: https://issues.apache.org/jira/browse/ARROW-7822
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> {{Status::state_}} could be made a tagged pointer without affecting the fast 
> path (passing around a non error status). The extra bit could be used to mark 
> a Status' state as heap allocated or not, allowing very error statuses to be 
> extremely cheap when their error state is known to be immutable. For example, 
> this would allow a cheap default of {{Result<>::status_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116238#comment-17116238
 ] 

Wes McKinney commented on ARROW-7784:
-

"QuadraticSpaceMyersDiff" is being instantiated for every Arrow type. Given 
that this code is not performance sensitive, I would suggest refactoring this 
code to only instantiate a single implementation of the Diff algorithm (rather 
than 25+ instantiations) and where relevant introduce a virtual interface for 
interacting with values in different-type arrays. 

> [C++] diff.cc is extremely slow to compile
> --
>
> Key: ARROW-7784
> URL: https://issues.apache.org/jira/browse/ARROW-7784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> This comes up especially when doing an optimized build. {{diff.cc}} is always 
> enabled even if all components are disabled, and it takes multiple seconds to 
> compile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7409) [C++][Python] Windows link error LNK1104: cannot open file 'python37_d.lib'

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116234#comment-17116234
 ] 

Wes McKinney commented on ARROW-7409:
-

[~rbocanegra] any update?

> [C++][Python] Windows link error LNK1104: cannot open file 'python37_d.lib'
> ---
>
> Key: ARROW-7409
> URL: https://issues.apache.org/jira/browse/ARROW-7409
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Raul Bocanegra
>Assignee: Raul Bocanegra
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: fix-msvc-link-python-debug.patch
>
>
> When I build arrow_python on Windows in debug mode it raises a link error 
> "{{LNK1104: cannot open file 'python37_d.lib'".}}
> I have been having a look at the CMake files and it seems that we are forcing 
> to link against release python lib on debug mode.
> I have edited the CMake files in order to fix this bug, see 
> [^fix-msvc-link-python-debug.patch].
> It is just a 3 lines change and makes the debug version of arrow_python link 
> on Windows.
> I could do a PR if you find it useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8939) [C++] Arrow-native C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8939:

Summary: [C++] Arrow-native C++ Data Frame-style programming interface for 
analytics (umbrella issue)  (was: [C++] Arrow C++ Data Frame-style programming 
interface for analytics (umbrella issue))

> [C++] Arrow-native C++ Data Frame-style programming interface for analytics 
> (umbrella issue)
> 
>
> Key: ARROW-8939
> URL: https://issues.apache.org/jira/browse/ARROW-8939
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is an umbrella issue for the "C++ Data Frame" project that has been 
> discussed on the mailing list with the following Google docs overview
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit
> I will attach issues to this JIRA to help organize and track the project as 
> we make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8939) [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8939:
---

 Summary: [C++] Arrow C++ Data Frame-style programming interface 
for analytics (umbrella issue)
 Key: ARROW-8939
 URL: https://issues.apache.org/jira/browse/ARROW-8939
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is an umbrella issue for the "C++ Data Frame" project that has been 
discussed on the mailing list with the following Google docs overview

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit

I will attach issues to this JIRA to help organize and track the project as we 
make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7394) [C++][DataFrame] Implement zero-copy optimizations when performing Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7394:
---

Assignee: Wes McKinney

> [C++][DataFrame] Implement zero-copy optimizations when performing Filter
> -
>
> Key: ARROW-7394
> URL: https://issues.apache.org/jira/browse/ARROW-7394
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> For high-selectivity filters (most elements included), it may be wasteful and 
> slow to copy large contiguous ranges of array chunks into the resulting 
> ChunkedArray. Instead, we can scan the filter boolean array and slice off 
> chunks of the source data rather than copying. 
> We will need to empirically determine how large the contiguous range needs to 
> be in order to merit the slice-based approach versus simple/native 
> materialization. For example, in a filter array like
> 1 0 1 0 1 0 1 0 1
> it would not make sense to slice 5 times because slicing carries some 
> overhead. But if we had
> 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 
> then performing 4 slices may be faster than doing a copy materialization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7245) [C++] Allow automatic String -> LargeString promotions when concatenating tables

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116231#comment-17116231
 ] 

Wes McKinney commented on ARROW-7245:
-

Perhaps Concatenate can be reimplemented as a vector kernel, so that type 
promotions can be handled by the kernel execution machinery

> [C++] Allow automatic String -> LargeString promotions when concatenating 
> tables
> 
>
> Key: ARROW-7245
> URL: https://issues.apache.org/jira/browse/ARROW-7245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> inspired by GitHub issue https://github.com/apache/arrow/issues/5874



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7316) [C++] compile error due to incomplete type for unique_ptr

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-7316.
---
Resolution: Cannot Reproduce

> [C++] compile error due to incomplete type for unique_ptr
> -
>
> Key: ARROW-7316
> URL: https://issues.apache.org/jira/browse/ARROW-7316
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
> Environment: WSL, conda, arrow version 0.15
>Reporter: Danny Kim
>Priority: Major
>
> Hi, 
> I am getting following compile error from Arrow c++
> {code:java}
> Warning: Can't read registry to find the necessary compiler setting 
> Make sure that Python modules winreg, win32api or win32con are installed.C 
> compiler: /home/danny/miniconda3/envs/DEV/bin/x86_64-conda_cos6-linux-gnu-cc 
> -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall 
> -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell 
> -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt 
> -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG 
> -D_FORTIFY_SOURCE=2 -O2 -fPIC 
> compile options: '-DBUILTIN_PARQUET_READER -I. 
> -I/home/danny/miniconda3/envs/DEV/include 
> -I/home/danny/miniconda3/envs/DEV/include/python3.7m -c'
> extra options: '-std=c++11 -g0 -O3'
> x86_64-conda_cos6-linux-gnu-cc: bodo/io/_parquet.cpp
> x86_64-conda_cos6-linux-gnu-cc: bodo/io/_parquet_reader.cpp
> cc1plus: warning: command line option '-Wstrict-prototypes' is valid for 
> C/ObjC but not for C++
> cc1plus: warning: command line option '-Wstrict-prototypes' is valid for 
> C/ObjC but not for C++
> In file included from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/memory:80:0,
>  from /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:22,
>  from 
> bodo/io/_parquet.cpp:13:/home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:
>  In instantiation of 'void std::default_delete<_Tp>::operator()(_Tp*) const 
> [with _Tp = arrow::RecordBatchReader]':
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:268:17:
>  required from 'std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = 
> arrow::RecordBatchReader; _Dp = 
> std::default_delete]'/home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:161:49:
>  required from here
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:76:22:
>  error: invalid application of 'sizeof' to incomplete type 
> 'arrow::RecordBatchReader'
>  static_assert(sizeof(_Tp)>0, ^In file included from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr.h:52:0,
>  from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/memory:81,
>  from /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:22,
>  from bodo/io/_parquet.cpp:13:
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:
>  In instantiation of 'std::__shared_ptr<_Tp, _Lp>::__shared_ptr(_Yp*) 
> [with _Yp = arrow::RecordBatchReader;  = void; _Tp = 
> arrow::RecordBatchReader; __gnu_cxx::_Lock_policy _Lp = 
> (__gnu_cxx::_Lock_policy)2]':
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:1243:4:
>  required from 'std::__shared_ptr<_Tp, _Lp>::_SafeConv<_Yp> 
> std::__shared_ptr<_Tp, _Lp>::reset(_Yp*) [with _Yp = 
> arrow::RecordBatchReader; _Tp = arrow::RecordBatchReader; 
> __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2; 
> std::__shared_ptr<_Tp, _Lp>::_SafeConv<_Yp> = void]'
> /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:164:29: 
> required from here
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:1082:25:
>  error: invalid application of 'sizeof' to incomplete type 
> 'arrow::RecordBatchReader'
>  static_assert( sizeof(_Yp) > 0, "incomplete type" );
>  ^
> error: Command 
> "/home/danny/miniconda3/envs/DEV/bin/x86_64-conda_cos6-linux-gnu-cc 
> -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall 
> -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell 
> -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe 
> -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
> -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -fPIC 
> -DBUILTIN_PARQUET_READER -I. -I/home/danny/miniconda3/envs/DEV/include 
> 

[jira] [Resolved] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7230.
-
  Assignee: Neal Richardson  (was: Projjal Chanda)
Resolution: Fixed

This was done in 
https://github.com/apache/arrow/commit/96217193fc726b675969e91e86a63407bc8dce99

> [C++] Use vendored std::optional instead of boost::optional in Gandiva
> --
>
> Key: ARROW-7230
> URL: https://issues.apache.org/jira/browse/ARROW-7230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> This may help with overall codebase consistency



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7179) [C++][Compute] Coalesce kernel

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116227#comment-17116227
 ] 

Wes McKinney commented on ARROW-7179:
-

We can implement this either as a Binary or VarArgs scalar kernel

> [C++][Compute] Coalesce kernel
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Add a kernel which replaces null values in an array with a scalar value or 
> with values taken from another array:
> {code}
> coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
> coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}
> The code in {{take_internal.h}} should be of some use with a bit of 
> refactoring.
> A filter Expression should be added at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116225#comment-17116225
 ] 

Wes McKinney commented on ARROW-7083:
-

Note that we should be able to add Gandiva-generated kernels (with some glue) 
to {{arrow::compute::Function}} instances. Perhaps we can create an 
{{arrow::compute::GandivaFunction}} that provides the wrapping magic

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116224#comment-17116224
 ] 

Wes McKinney commented on ARROW-7083:
-

I'm inclined to close this issue. After much study, I believe the best that we 
can do is to to take the single-value kernel implementations found in

https://github.com/apache/arrow/tree/master/cpp/src/gandiva/precompiled

and move them to inline-able header files. Then two things happen:

* These inline functions are translated to LLVM IR for use in Gandiva
* The inline functions form the basis for pre-compiled array kernels in 
arrow/compute

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7075) [C++] Boolean kernels should not allocate in Call()

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7075.
-
Resolution: Fixed

This was done in ARROW-8792

> [C++] Boolean kernels should not allocate in Call()
> ---
>
> Key: ARROW-7075
> URL: https://issues.apache.org/jira/browse/ARROW-7075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The boolean kernels currently allocate their value buffers ahead of time but 
> not their null bitmaps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116221#comment-17116221
 ] 

Wes McKinney edited comment on ARROW-7017 at 5/25/20, 7:56 PM:
---

I think the path forward here is to refactor to utilize common implementations 
of inline single-value functions for both the LLVM IR and pre-compiled kernels. 
In other words, what is currently in the gandiva/precompiled directory would be 
moved to some place where we can arrange so that these implementations are 
translated to LLVM IR for use in Gandiva, while available as inline C/C++ 
functions for use in creating pre-compiled vectorized kernels. Having multiple 
implementations of the scalar "unit of work" does not seem desirable

Note that Gandiva-generated kernels should be able (with some glue) to be 
registered in the new general function registry in arrow/compute/registry.h


was (Author: wesmckinn):
I think the path forward here is to refactor to utilize common implementations 
of for both the LLVM IR and pre-compiled kernels. In other words, what is 
currently in the gandiva/precompiled directory would be moved to some place 
where we can arrange so that these implementations are translated to LLVM IR 
for use in Gandiva, while available as inline C/C++ functions for use in 
creating pre-compiled vectorized kernels. Having multiple implementations of 
the scalar "unit of work" does not seem desirable

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116221#comment-17116221
 ] 

Wes McKinney commented on ARROW-7017:
-

I think the path forward here is to refactor to utilize common implementations 
of for both the LLVM IR and pre-compiled kernels. In other words, what is 
currently in the gandiva/precompiled directory would be moved to some place 
where we can arrange so that these implementations are translated to LLVM IR 
for use in Gandiva, while available as inline C/C++ functions for use in 
creating pre-compiled vectorized kernels. Having multiple implementations of 
the scalar "unit of work" does not seem desirable

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7012) [C++] Clarify ChunkedArray chunking strategy and policy

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116220#comment-17116220
 ] 

Wes McKinney commented on ARROW-7012:
-

In general, this is not something that users should be too concerned with. The 
new kernels framework provides a configurability knob 
({{ExecContext::exec_chunksize}}) for selecting the upper limit for the size of 
chunks that are processed

> [C++] Clarify ChunkedArray chunking strategy and policy
> ---
>
> Key: ARROW-7012
> URL: https://issues.apache.org/jira/browse/ARROW-7012
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion on ARROW-6784 and [https://github.com/apache/arrow/pull/5686]. 
> Among the questions:
>  * Do Arrow users control the chunking, or is it an internal implementation 
> detail they should not manage?
>  * If users control it, how do they control it? E.g. if I call Take and use a 
> ChunkedArray for the indices to take, does the chunking follow how the 
> indices are chunked? Or should we attempt to preserve the mapping of data to 
> their chunks in the input table/chunked array?
>  * If it's an implementation detail, what is the optimal chunk size? And when 
> is it worth reshaping (concatenating, slicing) input data to attain this 
> optimal size? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8905.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Duplicate

dup of ARROW-7009

> [C++] Collapse Take APIs from 8 to 1 or 2
> -
>
> Key: ARROW-8905
> URL: https://issues.apache.org/jira/browse/ARROW-8905
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> There are currently 8 {{arrow::compute::Take}} functions with different 
> function signatures. Fewer functions would make life easier for binding 
> developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8938) [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8938:
---

 Summary: [R] Provide binding and argument packing to use 
arrow::compute::CallFunction to use any compute kernel from R dynamically
 Key: ARROW-8938
 URL: https://issues.apache.org/jira/browse/ARROW-8938
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This will drastically simplify exposing new functions to R users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6982) [R] Add bindings for compare and boolean kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116218#comment-17116218
 ] 

Wes McKinney commented on ARROW-6982:
-

Like ARROW-6978, wrapping {{CallFunction}} would allow dynamic invocation of 
any kernel from R

> [R] Add bindings for compare and boolean kernels
> 
>
> Key: ARROW-6982
> URL: https://issues.apache.org/jira/browse/ARROW-6982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>
> See cpp/src/arrow/compute/kernels/compare.h and boolean.h. ARROW-6980 
> introduces an Expression class that works on Arrow Arrays, but to evaluate 
> the expressions, it has to pull the data into R first. This would enable us 
> to do the work in C++ and only pull in the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6978) [R] Add bindings for sum and mean compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116215#comment-17116215
 ] 

Wes McKinney commented on ARROW-6978:
-

R should expose {{arrow::compute::CallFunction}} so that kernel bindings can be 
provided without having to touch C++ code

> [R] Add bindings for sum and mean compute kernels
> -
>
> Key: ARROW-6978
> URL: https://issues.apache.org/jira/browse/ARROW-6978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6959) [C++] Clarify what signatures are preferred for compute kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6959.
---
  Assignee: Wes McKinney
Resolution: Fixed

This is addressed by the new {{arrow::compute::CallFunction}} API

> [C++] Clarify what signatures are preferred for compute kernels
> ---
>
> Key: ARROW-6959
> URL: https://issues.apache.org/jira/browse/ARROW-6959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: compute
> Fix For: 1.0.0
>
>
> Many of the compute kernels feature functions which accept only array inputs 
> in addition to functions which accept Datums. The former seems implicitly 
> like a convenience wrapper around the latter but I don't think this is 
> explicit anywhere. Is there a preferred overload for bindings to use? Is it 
> preferred that C++ implementers provide convenience wrappers for different 
> permutations of argument type? (for example, Filter now provides an overload 
> for record batch input as well as array input)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6956) [C++] Status should use unique_ptr

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6956.
---
Resolution: Won't Fix

I'm not comfortable with this. I think this falls into the "if it ain't broke" 
category 

> [C++] Status should use unique_ptr
> --
>
> Key: ARROW-6956
> URL: https://issues.apache.org/jira/browse/ARROW-6956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> The logic of Status::State is _very_  similar to unique_ptr except the deep 
> copy on copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6856:

Fix Version/s: 1.0.0

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6923) [C++] Option for Filter kernel how to handle nulls in the selection vector

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6923:

Fix Version/s: 1.0.0

> [C++] Option for Filter kernel how to handle nulls in the selection vector
> --
>
> Key: ARROW-6923
> URL: https://issues.apache.org/jira/browse/ARROW-6923
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> How nulls are handled in the boolean mask (selection vector) in a filter 
> kernel varies between languages / data analytics systems (e.g. base R 
> propagates nulls, dplyr R skips (sees as False), SQL generally skips them as 
> well I think, Julia raises an error).
> Currently, in Arrow C++ we "propagate" nulls (null in the selection vector 
> gives a null in the output):
> {code}
> In [7]: arr = pa.array([1, 2, 3]) 
> In [8]: mask = pa.array([True, False, None]) 
> In [9]: arr.filter(mask) 
> Out[9]: 
> 
> [
>   1,
>   null
> ]
> {code}
> Given the different ways this could be done (propagate, skip, error), should 
> we provide an option to control this behaviour?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116211#comment-17116211
 ] 

Wes McKinney commented on ARROW-6856:
-

Yes. I just added to the milestone

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6799.
---
Resolution: Cannot Reproduce

This is no longer an issue because Flatbuffers is not in our toolchain anymore

> [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
> -
>
> Key: ARROW-6799
> URL: https://issues.apache.org/jira/browse/ARROW-6799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Does not appear to be tested in CI. Originally reported at 
> https://github.com/apache/arrow/issues/5575



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6523) [C++][Dataset] arrow_dataset target does not depend on anything

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6523:

Fix Version/s: 1.0.0

> [C++][Dataset] arrow_dataset target does not depend on anything
> ---
>
> Key: ARROW-6523
> URL: https://issues.apache.org/jira/browse/ARROW-6523
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Other subcomponents have targets to allow their libraries or unit tests to be 
> specifically built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6514) [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6514.
---
Resolution: Not A Problem

Closing since we've moved on from LLVM 7

> [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0
> --
>
> Key: ARROW-6514
> URL: https://issues.apache.org/jira/browse/ARROW-6514
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>
> I have LLVM 7.1 installed locally, and FindClangTools couldn't locate it 
> because ARROW_LLVM_VERSION is [hardcoded to 
> 7.0|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/CMakeLists.txt#L91-L99]
>  and clang tools is [restricted to the minor 
> version|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/cmake_modules/FindClangTools.cmake#L78].
> If it makes sense to restrict clang tools location down to the minor version, 
> then we need to pass the located LLVM's version instead of the hardcoded one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6548) [Python] consistently handle conversion of all-NaN arrays across types

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6548:

Fix Version/s: 1.0.0

> [Python] consistently handle conversion of all-NaN arrays across types
> --
>
> Key: ARROW-6548
> URL: https://issues.apache.org/jira/browse/ARROW-6548
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> In ARROW-5682 (https://github.com/apache/arrow/pull/5333), next to fixing 
> actual conversion bugs, I added the ability to convert all-NaN float arrays 
> when converting to string type (and only with {{from_pandas=True}}). So this 
> now works:
> {code}
> >>> pa.array(np.array([np.nan, np.nan], dtype=float), type=pa.string())
> 
> [
>   null,
>   null
> ]
> {code}
> However, I only added this for string type (and it already works for float 
> and int types). If we are happy with this behaviour, we should also add it 
> for other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2020-05-25 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116205#comment-17116205
 ] 

Francois Saint-Jacques commented on ARROW-2079:
---

Question to users/developers, why the need of 2 files, is it because 
`_metadata` can be too big?

> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> ---
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6456:
---

Assignee: Wes McKinney

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3244) [Python] Multi-file parquet loading without scan

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-3244.
---
Fix Version/s: 1.0.0
   Resolution: Implemented

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet
> Fix For: 1.0.0
>
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116202#comment-17116202
 ] 

Wes McKinney commented on ARROW-6456:
-

I will take care of this. 

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6261) [C++] Install any bundled components and add installed CMake or pkgconfig configuration to enable downstream linkers to utilize bundled libraries when statically linking

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6261.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

Closing in favor of the approach of splicing the bundled dependencies into 
libarrow.a

> [C++] Install any bundled components and add installed CMake or pkgconfig 
> configuration to enable downstream linkers to utilize bundled libraries when 
> statically linking
> -
>
> Key: ARROW-6261
> URL: https://issues.apache.org/jira/browse/ARROW-6261
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> The objective of this change would be to make it easier for toolchain 
> builders to ship bundled thirdparty libraries together with the Arrow 
> libraries in case there is a particular library version that is only used 
> when linking with {{libarrow.a}}. In theory configuration could be added to 
> arrowTargets.cmake (or pkgconfig) to simplify static linking



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6124) [C++] ArgSort kernel should sort in a single pass (with nulls)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6124.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

Sorting on large or chunked inputs will probably not be achieved by a 
VectorKernel, but rather than a query execution node similar to various open 
source analytic databases

> [C++] ArgSort kernel should sort in a single pass (with nulls)
> --
>
> Key: ARROW-6124
> URL: https://issues.apache.org/jira/browse/ARROW-6124
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> There's a good chance that merge sort must be implemented (spill to disk, 
> ChunkedArray, ...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6123) [C++] ArgSort kernel should not materialize the output internal

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116197#comment-17116197
 ] 

Wes McKinney commented on ARROW-6123:
-

[~fsaintjacques] could you clarify what you mean?

> [C++] ArgSort kernel should not materialize the output internal
> ---
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6122) [C++] SortToIndices kernel must support FixedSizeBinary

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6122:

Summary: [C++] SortToIndices kernel must support FixedSizeBinary  (was: 
[C++] ArgSort kernel must support FixedSizeBinary)

> [C++] SortToIndices kernel must support FixedSizeBinary
> ---
>
> Key: ARROW-6122
> URL: https://issues.apache.org/jira/browse/ARROW-6122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5980.
---
Resolution: Not A Problem

Our current wheels don't have this problem

> [Python] Missing libarrow.so and libarrow_python.so in wheel file
> -
>
> Key: ARROW-5980
> URL: https://issues.apache.org/jira/browse/ARROW-5980
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Haowei Yu
>Priority: Major
>  Labels: wheel
>
> I have installed the pyarrow 0.14.0 but it seems that by default you did not 
> provide symlink of libarrow.so and libarrow_python.so. Only .so file with 
> version suffix is provided. Hence, I cannot use the output of 
> pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link 
> option. 
> If you provide symlink, I can pass following to the linker to specify the 
> library to link. e.g. g++ -L/ -larrow -larrow_python 
> However, right now, the ld ouput complains not being able to find -larrow and 
> -larrow_python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8062.
---
Resolution: Fixed

Issue resolved by pull request 7180
[https://github.com/apache/arrow/pull/7180]

> [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
> -
>
> Key: ARROW-8062
> URL: https://issues.apache.org/jira/browse/ARROW-8062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Partitioned parquet datasets sometimes come with {{_metadata}} / 
> {{_common_metadata}} files. Those files include information about the schema 
> of the full dataset and potentially all RowGroup metadata as well (for 
> {{_metadata}}).
> Using those files during the creation of a parquet {{Dataset}} can give a 
> more efficient factory (using the stored schema instead of inferring the 
> schema from unioning the schemas of all files + using the paths to individual 
> parquet files instead of crawling the directory).
> Basically, based those files, the schema, list of paths and partition 
> expressions (the information that is needed to create a Dataset) could be 
> constructed.   
> Such logic could be put in a different factory class, eg 
> {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8485) [Integration][Java] Implement extension types integration

2020-05-25 Thread Ryan Murray (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Murray reassigned ARROW-8485:
--

Assignee: Ryan Murray

> [Integration][Java] Implement extension types integration
> -
>
> Key: ARROW-8485
> URL: https://issues.apache.org/jira/browse/ARROW-8485
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, Java
>Reporter: Antoine Pitrou
>Assignee: Ryan Murray
>Priority: Major
> Fix For: 1.0.0
>
>
> Java should support the extension type integration tests added in ARROW-5649.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8114) [Java][Integration] Enable custom_metadata integration test

2020-05-25 Thread Ryan Murray (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Murray reassigned ARROW-8114:
--

Assignee: Ryan Murray

> [Java][Integration] Enable custom_metadata integration test
> ---
>
> Key: ARROW-8114
> URL: https://issues.apache.org/jira/browse/ARROW-8114
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, Java
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ryan Murray
>Priority: Major
> Fix For: 1.0.0
>
>
> This will require refactoring the way metadata is serialized to JSON 
> following https://github.com/apache/arrow/pull/6556 (needs to be {{[{key: 
> "$key", value: "$value"}]}}, rather than {{ {"$key": "$value"} }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5916.
---
Resolution: Later

We didn't reach a conclusion on this so closing for now

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
> Attachments: test.arrow_ipc
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116112#comment-17116112
 ] 

Wes McKinney commented on ARROW-5760:
-

I'd like to work on this next week if it's alright

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5854) [Python] Expose compare kernels on Array class

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5854:

Fix Version/s: (was: 2.0.0)
   1.0.0

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5854) [Python] Expose compare kernels on Array class

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116113#comment-17116113
 ] 

Wes McKinney commented on ARROW-5854:
-

This should be fairly trivial now

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5760:

Fix Version/s: (was: 2.0.0)
   1.0.0

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5530) [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116111#comment-17116111
 ] 

Wes McKinney commented on ARROW-5530:
-

a HashOptions would also need to be introduced

> [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null 
> behavior
> 
>
> Key: ARROW-5530
> URL: https://issues.apache.org/jira/browse/ARROW-5530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5760:
---

Assignee: Wes McKinney  (was: Ben Kietzman)

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5489) [C++] Normalize kernels and ChunkedArray behavior

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5489.
-
Fix Version/s: 1.0.0
 Assignee: Wes McKinney
   Resolution: Fixed

This is done in ARROW-8792

> [C++] Normalize kernels and ChunkedArray behavior
> -
>
> Key: ARROW-5489
> URL: https://issues.apache.org/jira/browse/ARROW-5489
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Some kernels (the wrappers, e.g. Unique) support ChunkedArray inputs, and 
> some don't. We should normalize this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5506) [C++] "Shredder" and "stitcher" functionality

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5506:

Fix Version/s: (was: 2.0.0)

> [C++] "Shredder" and "stitcher" functionality
> -
>
> Key: ARROW-5506
> URL: https://issues.apache.org/jira/browse/ARROW-5506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andrei Gudkov
>Priority: Major
>
> Discussion is here: [https://github.com/apache/arrow/pull/4066]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5506) [C++] "Shredder" and "stitcher" functionality

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5506.
---
Resolution: Won't Fix

> [C++] "Shredder" and "stitcher" functionality
> -
>
> Key: ARROW-5506
> URL: https://issues.apache.org/jira/browse/ARROW-5506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andrei Gudkov
>Priority: Major
>
> Discussion is here: [https://github.com/apache/arrow/pull/4066]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5193) [C++] Linker error with bundled zlib

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5193.
---
Resolution: Fixed

I believe this is fixed now

> [C++] Linker error with bundled zlib
> 
>
> Key: ARROW-5193
> URL: https://issues.apache.org/jira/browse/ARROW-5193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code}
> [98/146] Linking CXX executable debug/flight-test-integration-server
> FAILED: debug/flight-test-integration-server 
> : && /usr/bin/ccache /usr/lib/ccache/c++  -Wno-noexcept-type  
> -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
> -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2  -g  -rdynamic 
> src/arrow/flight/CMakeFiles/flight-test-integration-server.dir/test-integration-server.cc.o
>   -o debug/flight-test-integration-server  
> -Wl,-rpath,/home/antoine/arrow/bundledeps/cpp/build-test/debug 
> debug/libarrow_flight_testing.so.14.0.0 debug/libarrow_testing.so.14.0.0 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlienc-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlidec-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlicommon-static.a -ldl 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> /usr/lib/x86_64-linux-gnu/libboost_system.so 
> /usr/lib/x86_64-linux-gnu/libboost_filesystem.so 
> /usr/lib/x86_64-linux-gnu/libboost_regex.so 
> googletest_ep-prefix/src/googletest_ep/lib/libgtest_maind.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgtestd.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgmockd.a -ldl 
> ../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc++.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc.a 
> ../thirdparty/grpc_ep-install/lib/libgpr.a 
> ../thirdparty/cares_ep-install/lib/libcares.a 
> ../thirdparty/grpc_ep-install/lib/libaddress_sorting.a 
> gflags_ep-prefix/src/gflags_ep/lib/libgflags.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgtestd.a 
> debug/libarrow_flight.so.14.0.0 
> ../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc++.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc.a 
> ../thirdparty/grpc_ep-install/lib/libgpr.a 
> ../thirdparty/cares_ep-install/lib/libcares.a 
> ../thirdparty/grpc_ep-install/lib/libaddress_sorting.a 
> /usr/lib/x86_64-linux-gnu/libboost_system.so debug/libarrow.so.14.0.0 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlienc-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlidec-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlicommon-static.a -ldl 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt 
> && :
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflateInit2_'
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflate'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflateInit2_'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflate'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflateEnd'
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflateEnd'
> collect2: error: ld returned 1 exit status
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8933) [C++] Reduce generated code in vector_hash.cc

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8933:
---

 Summary: [C++] Reduce generated code in vector_hash.cc
 Key: ARROW-8933
 URL: https://issues.apache.org/jira/browse/ARROW-8933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Since hashing doesn't need to know about logical types, we can do the following:

* Use same generated code for both BinaryType and StringType
* Use same generated code for primitive types having the same byte width

These two changes should reduce binary size and improve compilation speed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5005) [C++] Implement support for using selection vectors in scalar aggregate function kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5005:

Summary: [C++] Implement support for using selection vectors in scalar 
aggregate function kernels  (was: [C++] Add support for filter mask in 
AggregateFunction)

> [C++] Implement support for using selection vectors in scalar aggregate 
> function kernels
> 
>
> Key: ARROW-5005
> URL: https://issues.apache.org/jira/browse/ARROW-5005
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The aggregate kernels don't support mask (the result of a filter). Add the 
> the following method to `AggregateFunction`.
> {code:c++}
> virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* 
> state) const = 0;
> {code}
> The goal is to add support for AST similar to:
> {code:sql}
> SELECT AGG(x) FROM table WHERE pred;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5005) [C++] Implement support for using selection vectors in scalar aggregate function kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116107#comment-17116107
 ] 

Wes McKinney commented on ARROW-5005:
-

I believe the best approach right now is to use selection vectors for this (see 
{{arrow::compute::ExecBatch::selection_vector}})

> [C++] Implement support for using selection vectors in scalar aggregate 
> function kernels
> 
>
> Key: ARROW-5005
> URL: https://issues.apache.org/jira/browse/ARROW-5005
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The aggregate kernels don't support mask (the result of a filter). Add the 
> the following method to `AggregateFunction`.
> {code:c++}
> virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* 
> state) const = 0;
> {code}
> The goal is to add support for AST similar to:
> {code:sql}
> SELECT AGG(x) FROM table WHERE pred;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116106#comment-17116106
 ] 

Wes McKinney edited comment on ARROW-5002 at 5/25/20, 3:10 PM:
---

I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I be working to implement the appropriate machinery for this 
under arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)


was (Author: wesmckinn):
I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I will implement the appropriate machinery for this under 
arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116106#comment-17116106
 ] 

Wes McKinney commented on ARROW-5002:
-

I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I will implement the appropriate machinery for this under 
arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5002:

Labels: query-engine  (was: )

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5002:

Summary: [C++] Implement Hash Aggregation query execution node  (was: [C++] 
Implement GroupBy)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-4798) [C++] Re-enable runtime/references cpplint check

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4798.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

The benchmark thing is enough of a nuisance that I won't bother with this. 
We've been pretty effective about catching mutable references in code reviews

> [C++] Re-enable runtime/references cpplint check
> 
>
> Key: ARROW-4798
> URL: https://issues.apache.org/jira/browse/ARROW-4798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This will help keep the codebase clean.
> We might consider defining some custom filters for cpplint warnings we want 
> to suppress, like it doesn't like {{benchmark::State&}} because of the 
> non-const reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4633:

Fix Version/s: 1.0.0

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Priority: Minor
>  Labels: dataset-parquet-read, newbie, parquet
> Fix For: 1.0.0
>
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4530) [C++] Review Aggregate kernel state allocation/ownership semantics

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116099#comment-17116099
 ] 

Wes McKinney commented on ARROW-4530:
-

You may have noticed that the aggregation API was iterated in ARROW-8792. I 
think the current structure is adequate for non-hash-aggregations, but we 
should think about how to deal with implementing aggregations that can be used 
with hash aggregation (aka "GROUP BY")

> [C++] Review Aggregate kernel state allocation/ownership semantics
> --
>
> Key: ARROW-4530
> URL: https://issues.apache.org/jira/browse/ARROW-4530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116096#comment-17116096
 ] 

Wes McKinney commented on ARROW-4333:
-

I partially addressed some of these questions in ARROW-8792, but there are 
other questions viz-a-viz memory reuse and dealing with ChunkedArrays. Perhaps 
it would be useful to go through these questions and discuss them in the 
context of the new generic kernel execution framework in arrow/compute

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4333:
---

Assignee: (was: Wes McKinney)

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4333:
---

Assignee: Wes McKinney

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4097) [C++] Add function to "conform" a dictionary array to a target new dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116095#comment-17116095
 ] 

Wes McKinney commented on ARROW-4097:
-

This can be implemented as a ScalarFunction I think

> [C++] Add function to "conform" a dictionary array to a target new dictionary
> -
>
> Key: ARROW-4097
> URL: https://issues.apache.org/jira/browse/ARROW-4097
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow up work to ARROW-554. 
> Unifying multiple dictionary-encoded arrays is one use case. Another is 
> rewriting a DictionaryArray to be based on another dictionary. For example, 
> this would be used to implement Cast from one dictionary type to another.
> This will need to be able to insert nulls where there are values that are not 
> found in the target dictionary
> see also discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243025730



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3978) [C++] Implement hashing, dictionary-encoding for StructArray

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3978:

Labels: query-engine  (was: )

> [C++] Implement hashing, dictionary-encoding for StructArray
> 
>
> Key: ARROW-3978
> URL: https://issues.apache.org/jira/browse/ARROW-3978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: query-engine
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be 
> incrementally hashed to produce dictionary codes suitable for computing 
> aggregates or any other purpose. 
> There are a few subtasks related to this, such as efficiently constructing a 
> record (that can be hashed quickly) to identify each "row" in the struct. 
> Maybe we should start with that first



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8878) [R] how to install when behind a firewall?

2020-05-25 Thread Olaf (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116028#comment-17116028
 ] 

Olaf commented on ARROW-8878:
-

[~npr] what do you think? interestingly, i am able to install and use the 
nightly version. Is the nightly package stored on another website? did you fix 
something in the nightly versions that might affect this?

 

thanks!

> [R] how to install when behind a firewall?
> --
>
> Key: ARROW-8878
> URL: https://issues.apache.org/jira/browse/ARROW-8878
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
> Environment: r
>Reporter: Olaf
>Priority: Major
>
> Hello there and thanks again for this beautiful package!
> I am trying to install {{arrow}} on linux and I got a few problematic 
> warnings during the install. My computer is behind a firewall so not all the 
> connections coming from rstudio are allowed.
>  
> {code:java}
> > sessionInfo()
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-ubuntu18-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.4 LTS
> Matrix products: default
> BLAS/LAPACK: 
> /apps/intel/2019.1/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
>  [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C 
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] MKLthreads_0.1
> loaded via a namespace (and not attached):
> [1] compiler_3.6.1 tools_3.6.1
> {code}
>  
> after running {{install.packages("arrow")}} I get
>  
> {code:java}
>  
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ source
> *** Proceeding without C++ dependencies
> Warning message:
> In unzip(tf1, exdir = src_dir) : error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> -
> {code}
>  
>  
> However, the installation ends normally.
>  
> {code:java}
>  ** R
> ** inst
> ** byte-compile and prepare package for lazy loading
> ** help
> *** installing help indices
> ** building package indices
> ** installing vignettes
> ** testing if installed package can be loaded from temporary location
> ** checking absolute paths in shared objects and dynamic libraries
> ** testing if installed package can be loaded from final location
> ** testing if installed package keeps a record of temporary installation path
> * DONE (arrow)
> {code}
>  
> So I go ahead and try to run arrow::install_arrow() and get a similar warning.
>  
> {code:java}
> installing *source* package ?arrow? ...
> ** package ?arrow? successfully unpacked and MD5 sums checked
> ** using staged installation
> *** Successfully retrieved C++ binaries for ubuntu-18.04
> Warning messages:
> 1: In file(file, "rt") :
>  URL 
> 'https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv':
>  status was 'Couldn't connect to server'
> 2: In unzip(bin_file, exdir = dst_dir) :
>  error 1 in extracting from zip file
> ./configure: line 132: cd: libarrow/arrow-0.17.1/lib: No such file or 
> directory
> - NOTE ---
> After installation, please run arrow::install_arrow()
> for help installing required runtime libraries
> {code}
> And unfortunately I cannot read any parquet file.
> {noformat}
> Error in fetch(key) : lazy-load database 
> '/mydata/R/x86_64-ubuntu18-linux-gnu-library/3.6/arrow/help/arrow.rdb' is 
> corrupt{noformat}
>  
> Could you please tell me how to fix this? Can I just copy the zip from github 
> and do a manual install in Rstudio?
>  
> Thanks!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on row groups with chunked columns

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3822:

Fix Version/s: 1.0.0

> [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on 
> row groups with chunked columns
> ---
>
> Key: ARROW-3822
> URL: https://issues.apache.org/jira/browse/ARROW-3822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> If a BinaryArray / StringArray overflows a single column when reading a row 
> group, the resulting table will have a ChunkedArray. Using TableBatchReader 
> in 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> will therefore only return a part of the row group, discarding the rest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8937:
---

 Summary: [C++] Add "parse_strptime" function for string to 
timestamp conversions using the kernels framework
 Key: ARROW-8937
 URL: https://issues.apache.org/jira/browse/ARROW-8937
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This should be relatively straightforward to implement using the new kernels 
framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-3372) [C++] Introduce SlicedBuffer class

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3372.
---
Resolution: Won't Fix

> [C++] Introduce SlicedBuffer class
> --
>
> Key: ARROW-3372
> URL: https://issues.apache.org/jira/browse/ARROW-3372
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The purpose of this class will be to forward certain function calls to the 
> parent buffer, like a request for the device (CPU, GPU, etc.).
> As a result of this, we can remove the {{parent_}} member from {{Buffer}} as 
> that member is only there to support slices. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1846) [C++] Implement "any" reduction kernel for boolean data

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116060#comment-17116060
 ] 

Wes McKinney commented on ARROW-1846:
-

With fresh eyes and ARROW-8792 in the rear view mirror, I believe Any should be 
implemented as a ScalarAggregateFunction, with some way for agg functions to 
communicate that they have short-circuited to the KernelContext

> [C++] Implement "any" reduction kernel for boolean data
> ---
>
> Key: ARROW-1846
> URL: https://issues.apache.org/jira/browse/ARROW-1846
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics, dataframe
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116043#comment-17116043
 ] 

Wes McKinney edited comment on ARROW-971 at 5/25/20, 2:03 PM:
--

The correct way to implement is now as {{arrow::compute::ScalarFunction}}


was (Author: wesmckinn):
The correct way to implement is as {{arrow::compute::ScalarFunction}}

> [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116062#comment-17116062
 ] 

Wes McKinney commented on ARROW-1888:
-

This should be implemented in scalar_cast_nested.cc

> [C++] Implement casts from one struct type to another (with same field names 
> and number of fields)
> --
>
> Key: ARROW-1888
> URL: https://issues.apache.org/jira/browse/ARROW-1888
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116043#comment-17116043
 ] 

Wes McKinney commented on ARROW-971:


The correct way to implement is as {{arrow::compute::ScalarFunction}}

> [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1570.
-
Fix Version/s: 1.0.0
 Assignee: Wes McKinney
   Resolution: Fixed

This was basically achieved in ARROW-8792. Further work can be done with 
specific follow ups

> [C++] Define API for creating a kernel instance from function of scalar input 
> and output with a particular signature
> 
>
> Key: ARROW-1570
> URL: https://issues.apache.org/jira/browse/ARROW-1570
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 1.0.0
>
>
> This could include an {{std::function}} instance (but these cannot be inlined 
> by the C++ compiler), but should also permit use with inline-able functions 
> or functors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1574) [C++] Implement kernel function that converts a dense array to dictionary given known dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116054#comment-17116054
 ] 

Wes McKinney commented on ARROW-1574:
-

This would be a useful expansion of the functions in vector_hash.cc. We must 
introduce a {{HashOptions}} to be able to supply the known dictionary when 
invoking the functions

> [C++] Implement kernel function that converts a dense array to dictionary 
> given known dictionary
> 
>
> Key: ARROW-1574
> URL: https://issues.apache.org/jira/browse/ARROW-1574
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This may simply be a special case of cast using a dictionary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1568) [C++] Implement "drop null" kernels that return array without nulls

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116050#comment-17116050
 ] 

Wes McKinney commented on ARROW-1568:
-

This can be implemented as a {{arrow::compute::VectorFunction}} because the 
size of the array is changed, so this function is not valid in a SQL-like 
context

> [C++] Implement "drop null" kernels that return array without nulls
> ---
>
> Key: ARROW-1568
> URL: https://issues.apache.org/jira/browse/ARROW-1568
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >