[jira] [Assigned] (ARROW-10010) [Rust] Speedup arithmetic
[ https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10010: - Assignee: Jorge (was: Apache Arrow JIRA Bot) > [Rust] Speedup arithmetic > - > > Key: ARROW-10010 > URL: https://issues.apache.org/jira/browse/ARROW-10010 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are some optimizations possible in arithmetics kernels. > > PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10010) [Rust] Speedup arithmetic
[ https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10010: - Assignee: Apache Arrow JIRA Bot (was: Jorge) > [Rust] Speedup arithmetic > - > > Key: ARROW-10010 > URL: https://issues.apache.org/jira/browse/ARROW-10010 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are some optimizations possible in arithmetics kernels. > > PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10010) [Rust] Speedup arithmetic
[ https://issues.apache.org/jira/browse/ARROW-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10010: --- Labels: pull-request-available (was: ) > [Rust] Speedup arithmetic > - > > Key: ARROW-10010 > URL: https://issues.apache.org/jira/browse/ARROW-10010 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There are some optimizations possible in arithmetics kernels. > > PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10010) [Rust] Speedup arithmetic
Jorge created ARROW-10010: - Summary: [Rust] Speedup arithmetic Key: ARROW-10010 URL: https://issues.apache.org/jira/browse/ARROW-10010 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge There are some optimizations possible in arithmetics kernels. PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10009) [C++] LeastSignficantBitMask has typo in name.
Micah Kornfield created ARROW-10009: --- Summary: [C++] LeastSignficantBitMask has typo in name. Key: ARROW-10009 URL: https://issues.apache.org/jira/browse/ARROW-10009 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Micah Kornfield Assignee: Micah Kornfield We should fix the typo. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10008) pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False
Caleb Hattingh created ARROW-10008: -- Summary: pyarrow.parquet.read_table fails with predicate pushdown on categorical data with use_legacy_dataset=False Key: ARROW-10008 URL: https://issues.apache.org/jira/browse/ARROW-10008 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 1.0.1, 0.17.1 Environment: Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) [GCC 7.3.0] Pandas version: 1.1.2 pyarrow version: 1.0.1 Reporter: Caleb Hattingh I apologise if this is a known issue; I looked both in this issue tracker and on github and I didn't find it. There seems to be a problem reading a dataset with predicate pushdown (filters) on columns with categorical data. The problem only occurs with `use_legacy_dataset=False` (but if that's True it has no effect if the column isn't a partition key. Reproducer: {code:python} import shutil import sys, platform from pathlib import Path import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # Settings CATEGORICAL_DTYPE = True USE_LEGACY_DATASET = False print('Platform:', platform.platform()) print('Python version:', sys.version) print('Pandas version:', pd.__version__) print('pyarrow version:', pa.__version__) print('categorical enabled:', CATEGORICAL_DTYPE) print('use_legacy_dataset:', USE_LEGACY_DATASET) print() # Clean up test dataset if present path = Path('blah.parquet') if path.exists(): shutil.rmtree(str(path)) # Simple data d = dict(col1=['a', 'b'], col2=[1, 2]) # Either categorical or not if CATEGORICAL_DTYPE: df = pd.DataFrame(data=d, dtype='category') else: df = pd.DataFrame(data=d) # Write dataset table = pa.Table.from_pandas(df) pq.write_to_dataset(table, str(path)) # Load dataset table = pq.read_table( str(path), filters=[('col1', '=', 'a')], use_legacy_dataset=USE_LEGACY_DATASET, ) df = table.to_pandas() print(df.dtypes) print(repr(df)) {code} Output: {code:java} $ python categorical_predicate_pushdown.py Platform: Linux-5.8.9-050809-generic-x86_64-with-glibc2.10 Python version: 3.8.5 (default, Aug 5 2020, 08:36:46) [GCC 7.3.0] Pandas version: 1.1.2 pyarrow version: 1.0.1 categorical enabled: True use_legacy_dataset: False /arrow/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Type error: Cannot compare scalars of differing type: dictionary vs string /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4fc128)[0x7f50568c6128] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f50568c693d] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal14DieWithMessageERKSs+0x51)[0x7f50569757c1] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x4c)[0x7f505697716c] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression21AssumeGivenComparisonERKS1_+0x438)[0x7f5043334f18] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0x34)[0x7f5043334fa4] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset20ComparisonExpression6AssumeERKNS0_10ExpressionE+0xce)[0x7f504333503e] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset12RowGroupInfo7SatisfyERKNS0_10ExpressionE+0x1c)[0x7f50433116ac] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset19ParquetFileFragment15FilterRowGroupsERKNS0_10ExpressionE+0x563)[0x7f5043311cb3] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZNK5arrow7dataset17ParquetFileFormat8ScanFileESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEEPNS0_12FileFragmentE+0x203)[0x7f50433168a3] /home/caleb/Documents/kapiche/chrysalis/venv38/lib/python3.8/site-packages/pyarrow/libarrow_dataset.so.100(_ZN5arrow7dataset12FileFragment4ScanESt10shared_ptrINS0_11ScanOptionsEES2_INS0_11ScanContextEE+0x55)[0x7f5043329785]
[jira] [Commented] (ARROW-9989) Arrow
[ https://issues.apache.org/jira/browse/ARROW-9989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195791#comment-17195791 ] Wes McKinney commented on ARROW-9989: - This question might be more relevant for the dev@ or user@ mailing list. If you want to keep it as a JIRA, could you write an informative issue title? > Arrow > -- > > Key: ARROW-9989 > URL: https://issues.apache.org/jira/browse/ARROW-9989 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.14.0 > Environment: Linux 18.04, Arrow from maven 0.14.0 >Reporter: Litchy Soong >Priority: Major > > In scala, (Java of Arrow), following code work > {quote}object A { > def write(): = > > Unknown macro: { > val vectorSchemaRoot = VectorSchemaRoot.create(getSchema, allocator) > val writer = new ArrowStreamWriter(vectorSchemaRoot, null, out) } > } > {quote} > But following does not work > {quote}object A { > var vectorSchemaRoot: VectorSchemaRoot = null > var writer: ArrowStreamWriter = null > def write(): = > Unknown macro: { > vectorSchemaRoot = VectorSchemaRoot.create(getSchema, allocator) writer = > new ArrowStreamWriter(vectorSchemaRoot, null, out) } > } > {quote} > The error is , > {quote}java.lang.IllegalStateException: wrong buffer size: 601 != > 4081java.lang.IllegalStateException: wrong buffer size: 601 != 4081 at > org.apache.arrow.vector.ipc.message.MessageSerializer.writeBatchBuffers(MessageSerializer.java:297) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:267) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132) > at org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120) > > java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: range(0, > 0))java.lang.IndexOutOfBoundsException: index: 0, length: 1 (expected: > range(0, 0)) at io.netty.buffer.ArrowBuf.checkIndexD(ArrowBuf.java:337) at > io.netty.buffer.ArrowBuf.chk(ArrowBuf.java:324) at > io.netty.buffer.ArrowBuf.getByte(ArrowBuf.java:526) at > org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:70) at > org.apache.arrow.vector.Float4Vector.set(Float4Vector.java:168) > > java.lang.IllegalStateException: RefCnt has gone > negativejava.lang.IllegalStateException: RefCnt has gone negative at > org.apache.arrow.util.Preconditions.checkState(Preconditions.java:458) at > org.apache.arrow.memory.BufferLedger.release(BufferLedger.java:134) at > org.apache.arrow.memory.BufferLedger.release(BufferLedger.java:108) at > org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:441) > at > org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708) > at > org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226) > {quote} > > and it is raised every second time when I call the method. And seems both > ArrowStreamWriter and VectorSchemaRoot could not be initialized in this way. > Why? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10007) [Python][CI] Add a nightly build to exercise hypothesis tests
Krisztian Szucs created ARROW-10007: --- Summary: [Python][CI] Add a nightly build to exercise hypothesis tests Key: ARROW-10007 URL: https://issues.apache.org/jira/browse/ARROW-10007 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs We have a couple of hypothesis tests which are especially useful to discover corner cases. We should have a crossbow nightly build to regularly run them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10004) [Python] Consider to raise or normalize if a timezone aware datetime.time object is encountered during conversion
[ https://issues.apache.org/jira/browse/ARROW-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-10004: Issue Type: Improvement (was: New Feature) > [Python] Consider to raise or normalize if a timezone aware datetime.time > object is encountered during conversion > - > > Key: ARROW-10004 > URL: https://issues.apache.org/jira/browse/ARROW-10004 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Priority: Major > > Python datetime.time objects may have timezone information attached, but > since the time types (type32 and type64) don't have that property in arrow we > simply ignore it. > We should either raise an error or normalize to UTC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10005) [C++] Add an Append method to the time builders which validates the input range
[ https://issues.apache.org/jira/browse/ARROW-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-10005: Issue Type: Improvement (was: New Feature) > [C++] Add an Append method to the time builders which validates the input > range > --- > > Key: ARROW-10005 > URL: https://issues.apache.org/jira/browse/ARROW-10005 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Priority: Major > > Seems like we don't have a method which validates the input value range for > time types. It would be handy to do the validation after converting from a > python object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10006) [C++][Python] Do not collect python iterators if not necessary
Krisztian Szucs created ARROW-10006: --- Summary: [C++][Python] Do not collect python iterators if not necessary Key: ARROW-10006 URL: https://issues.apache.org/jira/browse/ARROW-10006 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Krisztian Szucs When converting python objects to arrow array currently we always collect the input to a sequence, but this may be memory consuming in certain cases. For unknown sized iterators we could consume and temporarily store the seen items during inference potentially improving both the conversion time and peak memory usage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
[ https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9924: -- Labels: pull-request-available (was: ) > [Python] Performance regression reading individual Parquet files using > Dataset interface > > > Key: ARROW-9924 > URL: https://issues.apache.org/jira/browse/ARROW-9924 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Critical > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > I haven't investigated very deeply but this seems symptomatic of a problem: > {code} > In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) > > > In [28]: pq.write_table(pa.table(df), 'test.parquet') > > > In [29]: timeit pq.read_table('test.parquet') > > > 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) > > > 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10005) [C++] Add an Append method to the time builders which validates the input range
Krisztian Szucs created ARROW-10005: --- Summary: [C++] Add an Append method to the time builders which validates the input range Key: ARROW-10005 URL: https://issues.apache.org/jira/browse/ARROW-10005 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Krisztian Szucs Seems like we don't have a method which validates the input value range for time types. It would be handy to do the validation after converting from a python object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10004) [Python] Consider to raise or normalize if a timezone aware datetime.time object is encountered during conversion
Krisztian Szucs created ARROW-10004: --- Summary: [Python] Consider to raise or normalize if a timezone aware datetime.time object is encountered during conversion Key: ARROW-10004 URL: https://issues.apache.org/jira/browse/ARROW-10004 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Krisztian Szucs Python datetime.time objects may have timezone information attached, but since the time types (type32 and type64) don't have that property in arrow we simply ignore it. We should either raise an error or normalize to UTC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195668#comment-17195668 ] Paul Taylor commented on ARROW-8394: I've started work on a branch in my fork here[1], but have been occupied the last few weeks (work, moving, back injury, etc.). There's not much left to do, so I think I should be able to get it finished and PR'd this week. 1. https://github.com/trxcllnt/arrow/tree/typescript-3.9 > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem
[ https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10003: - Assignee: Ben Kietzman (was: Apache Arrow JIRA Bot) > [C++] Create directories in CopyFiles when copying within the same filesystem > - > > Key: ARROW-10003 > URL: https://issues.apache.org/jira/browse/ARROW-10003 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > CopyFiles creates parent directories for destination files, but only when > copying between different filesystems. This behavior should be made consistent -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem
[ https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10003: - Assignee: Apache Arrow JIRA Bot (was: Ben Kietzman) > [C++] Create directories in CopyFiles when copying within the same filesystem > - > > Key: ARROW-10003 > URL: https://issues.apache.org/jira/browse/ARROW-10003 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > CopyFiles creates parent directories for destination files, but only when > copying between different filesystems. This behavior should be made consistent -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem
[ https://issues.apache.org/jira/browse/ARROW-10003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10003: --- Labels: pull-request-available (was: ) > [C++] Create directories in CopyFiles when copying within the same filesystem > - > > Key: ARROW-10003 > URL: https://issues.apache.org/jira/browse/ARROW-10003 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 1.0.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > CopyFiles creates parent directories for destination files, but only when > copying between different filesystems. This behavior should be made consistent -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9775) [C++] Automatic S3 region selection
[ https://issues.apache.org/jira/browse/ARROW-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-9775: - Assignee: Antoine Pitrou > [C++] Automatic S3 region selection > --- > > Key: ARROW-9775 > URL: https://issues.apache.org/jira/browse/ARROW-9775 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Python > Environment: macOS, Linux. >Reporter: Sahil Gupta >Assignee: Antoine Pitrou >Priority: Major > Labels: filesystem > Fix For: 2.0.0 > > > Currently, PyArrow and ArrowCpp need to be provided the region of the S3 > file/bucket, else it defaults to using 'us-east-1'. Ideally, PyArrow and > ArrowCpp can automatically detect the region and get the files, etc. For > instance, s3fs and boto3 can read and write files without having to specify > the region explicitly. Similar functionality to auto-detect the region would > be great to have in PyArrow and ArrowCpp. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10003) [C++] Create directories in CopyFiles when copying within the same filesystem
Ben Kietzman created ARROW-10003: Summary: [C++] Create directories in CopyFiles when copying within the same filesystem Key: ARROW-10003 URL: https://issues.apache.org/jira/browse/ARROW-10003 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 1.0.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 2.0.0 CopyFiles creates parent directories for destination files, but only when copying between different filesystems. This behavior should be made consistent -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
[ https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195647#comment-17195647 ] Wes McKinney commented on ARROW-9924: - My principle concern is addressing the performance regressions, which are especially grave considering that they affect one of the (if not *the*) most-called user-facing APIs in the whole Arrow project. The other questions we can investigate as follow up matters. > [Python] Performance regression reading individual Parquet files using > Dataset interface > > > Key: ARROW-9924 > URL: https://issues.apache.org/jira/browse/ARROW-9924 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Critical > Fix For: 2.0.0 > > > I haven't investigated very deeply but this seems symptomatic of a problem: > {code} > In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) > > > In [28]: pq.write_table(pa.table(df), 'test.parquet') > > > In [29]: timeit pq.read_table('test.parquet') > > > 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) > > > 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand updated ARROW-10002: Description: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of {{default fn}} in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.) If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. was: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of {{default fn}} in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.) If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there > has been further discussion and ideas for resolving the soundness issue, but > to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand updated ARROW-10002: Description: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of {{default fn}} in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.) If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. was: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. > (Note: there has been further discussion and ideas for resolving the > soundness issue, but to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand updated ARROW-10002: Description: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of {{default fn}} in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.) If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. was: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of {{default fn}} in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]|http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.) If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of {{default fn}} in the > codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]]. (Note: > there has been further discussion and ideas for resolving the soundness > issue, but to my knowledge no definitive action.) > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195638#comment-17195638 ] Andy Grove commented on ARROW-10002: Thanks [~batmanaod] this looks really interesting. [~paddyhoran] [~nevime] [~sunchao] [~alamb] [~jorgecarleitao] [~jhorstmann] will likely be interested in this > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of `default fn` in the codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: > there has been further discussion and ideas for resolving the soundness > issue, but to my knowledge no definitive > action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails
[ https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9859. --- Resolution: Fixed Issue resolved by pull request 8185 [https://github.com/apache/arrow/pull/8185] > [C++] S3 FileSystemFromUri with special char in secret key fails > > > Key: ARROW-9859 > URL: https://issues.apache.org/jira/browse/ARROW-9859 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > S3 Secret access keys can contain special characters like {{/}}. When they do > 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them > (e.g. replace / with %2F) > 2) When you do escape the special characters, requests that require > authorization fail with the message "The request signature we calculated does > not match the signature you provided. Check your key and signing method." > This may suggest that there's some extra URL encoding/decoding that needs to > happen inside. > I was only able to work around this by generating a new access key that > happened not to have special characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface
[ https://issues.apache.org/jira/browse/ARROW-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195633#comment-17195633 ] Ben Kietzman commented on ARROW-9924: - {quote} Looking at the top of the hierarchical perf report for the "new" code, the deeply nested layers of iterators strikes me as one thing to think more about whether that's the design we want {quote} To be clear, is the concern over clarity or performance? IIUC [https://gist.github.com/wesm/3e3eeb6b7f5f22650f18e69e206c2eb8#file-gistfile1-txt-L8-L20] represents minimal cost since 0.65% of runtime was spent managing the Iterator abstraction. If we wanted to replace our abstraction for lazy sequences we could potentially refactor to a {{Future}}-based iteration. Did you have a replacement in mind? {quote} why ProjectRecordBatch and FilterRecordBatch being used? Nothing is being projected nor filtered {quote} We don't explicitly elide them when the projection or filter is trivial. I could try to benchmark whether there is a significant performance benefit to adding a special case for trivial projection/filtering, but I'd guess we don't gain anything. Another potential bandaid fix would be to allow column level parallelism when scanning a single file (since no thread contention would be incurred) (combined with increasing batch size). > [Python] Performance regression reading individual Parquet files using > Dataset interface > > > Key: ARROW-9924 > URL: https://issues.apache.org/jira/browse/ARROW-9924 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Critical > Fix For: 2.0.0 > > > I haven't investigated very deeply but this seems symptomatic of a problem: > {code} > In [27]: df = pd.DataFrame({'A': np.random.randn(1000)}) > > > In [28]: pq.write_table(pa.table(df), 'test.parquet') > > > In [29]: timeit pq.read_table('test.parquet') > > > 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) > > > 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9465) [Python] Improve ergonomics of compute functions
[ https://issues.apache.org/jira/browse/ARROW-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-9465. --- Resolution: Fixed Issue resolved by pull request 8163 [https://github.com/apache/arrow/pull/8163] > [Python] Improve ergonomics of compute functions > > > Key: ARROW-9465 > URL: https://issues.apache.org/jira/browse/ARROW-9465 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Introspection of exported compute functions currently yield suboptimal output: > {code:python} > >>> from pyarrow import compute as pc > >>> > >>> > >>> pc.list_flatten > >>> > >>> > .func(arg)> > >>> ?pc.list_flatten > >>> > >>> > Signature: pc.list_flatten(arg) > Docstring: > File: ~/arrow/dev/python/pyarrow/compute.py > Type: function > >>> help(pc.list_flatten) > >>> > >>> > Help on function func in module pyarrow.compute: > func(arg) > {code} > The function should ideally have: > * the right global name > * an appropriate signature > * a docstring -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195629#comment-17195629 ] Kyle Strand commented on ARROW-10002: - I have put together a repository with a minimal example of one way in which we're using specialization in the `array` module: [https://github.com/BatmanAoD/arrow-rust-specialization-alternatives] The {{master}} branch shows how the code is written currently. This pull request shows how we could avoid specialization by introducing an "indexing" method associated with each primitive type: https://github.com/BatmanAoD/arrow-rust-specialization-alternatives/pull/1 > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of `default fn` in the codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: > there has been further discussion and ideas for resolving the soundness > issue, but to my knowledge no definitive > action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand updated ARROW-10002: Description: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. was: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: {{ }} {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of `default fn` in the codebase: > > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: > there has been further discussion and ideas for resolving the soundness > issue, but to my knowledge no definitive > action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10002) [Rust] Trait-specialization requries nightly
[ https://issues.apache.org/jira/browse/ARROW-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Strand updated ARROW-10002: Description: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: {{ }} {code:java} $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3{code} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. was: Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: {{ $> rg -c 'default fn' ../arrow/rust/}} {{ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1}} {{ ../arrow/rust/parquet/src/column/writer.rs:2}} {{ ../arrow/rust/parquet/src/encodings/encoding.rs:16}} {{ ../arrow/rust/parquet/src/arrow/record_reader.rs:1}} {{ ../arrow/rust/parquet/src/encodings/decoding.rs:13}} {{ ../arrow/rust/parquet/src/file/statistics.rs:1}} {{ ../arrow/rust/arrow/src/array/builder.rs:7}} {{ ../arrow/rust/arrow/src/array/array.rs:3}} {{ ../arrow/rust/arrow/src/array/equal.rs:3}} This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. > [Rust] Trait-specialization requries nightly > > > Key: ARROW-10002 > URL: https://issues.apache.org/jira/browse/ARROW-10002 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Kyle Strand >Priority: Major > > Trait specialization is widely used in the Rust Arrow implementation. Uses > can be identified by searching for instances of `default fn` in the codebase: > {{ }} > {code:java} > $> rg -c 'default fn' ../arrow/rust/ > ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 > ../arrow/rust/parquet/src/column/writer.rs:2 > ../arrow/rust/parquet/src/encodings/encoding.rs:16 > ../arrow/rust/parquet/src/arrow/record_reader.rs:1 > ../arrow/rust/parquet/src/encodings/decoding.rs:13 > ../arrow/rust/parquet/src/file/statistics.rs:1 > ../arrow/rust/arrow/src/array/builder.rs:7 > ../arrow/rust/arrow/src/array/array.rs:3 > ../arrow/rust/arrow/src/array/equal.rs:3{code} > > This feature requires Nightly Rust. Additionally, there is [no schedule for > stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] > , primarily due to an [unresolved soundness > hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: > there has been further discussion and ideas for resolving the soundness > issue, but to my knowledge no definitive > action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] > If we can remove specialization from the Rust codebase, we will not be > blocked on the Rust team's stabilization of that feature in order to move to > stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10002) [Rust] Trait-specialization requries nightly
Kyle Strand created ARROW-10002: --- Summary: [Rust] Trait-specialization requries nightly Key: ARROW-10002 URL: https://issues.apache.org/jira/browse/ARROW-10002 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Kyle Strand Trait specialization is widely used in the Rust Arrow implementation. Uses can be identified by searching for instances of `default fn` in the codebase: ``` $> rg -c 'default fn' ../arrow/rust/ ../arrow/rust/parquet/src/util/test_common/rand_gen.rs:1 ../arrow/rust/parquet/src/column/writer.rs:2 ../arrow/rust/parquet/src/encodings/encoding.rs:16 ../arrow/rust/parquet/src/arrow/record_reader.rs:1 ../arrow/rust/parquet/src/encodings/decoding.rs:13 ../arrow/rust/parquet/src/file/statistics.rs:1 ../arrow/rust/arrow/src/array/builder.rs:7 ../arrow/rust/arrow/src/array/array.rs:3 ../arrow/rust/arrow/src/array/equal.rs:3 ``` This feature requires Nightly Rust. Additionally, there is [no schedule for stabilization|https://github.com/rust-lang/rust/issues/31844#issue-135807289] , primarily due to an [unresolved soundness hole|[http://aturon.github.io/blog/2017/07/08/lifetime-dispatch]. (Note: there has been further discussion and ideas for resolving the soundness issue, but to my knowledge no definitive action.)|http://aturon.github.io/tech/2017/07/08/lifetime-dispatch/].] If we can remove specialization from the Rust codebase, we will not be blocked on the Rust team's stabilization of that feature in order to move to stable Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195577#comment-17195577 ] Tim Conkling edited comment on ARROW-8394 at 9/14/20, 4:35 PM: --- This is intended with all respect - this is a complex project, and I appreciate the work being done on it! - but I'm surprised by this response. [~wesm], if nobody is looking at this issue, does that mean that the JavaScript library is not a priority (or not being maintained anymore)? (As a user of the project, I'm trying to calibrate my expectations for its future. And as a developer on other open source projects, I recognize that it can be supremely frustrating when others feel entitled to ongoing free support - that's not my intent! :)) was (Author: timconkling): This is intended with all respect - this is a complex project, and I appreciate the work being done on it! - but I'm surprised by this response. [~wesm], if nobody is looking at this issue, does that mean that the JavaScript library is not a priority (or not being maintained anymore)? (As a user of the project, I'm trying to gauge my expectations for the project. And as a developer on other open source projects, I recognize that it can be supremely frustrating when others feel entitled to ongoing free support - that's not my intent! :)) > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package
[ https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195577#comment-17195577 ] Tim Conkling commented on ARROW-8394: - This is intended with all respect - this is a complex project, and I appreciate the work being done on it! - but I'm surprised by this response. [~wesm], if nobody is looking at this issue, does that mean that the JavaScript library is not a priority (or not being maintained anymore)? (As a user of the project, I'm trying to gauge my expectations for the project. And as a developer on other open source projects, I recognize that it can be supremely frustrating when others feel entitled to ongoing free support - that's not my intent! :)) > [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm > package > --- > > Key: ARROW-8394 > URL: https://issues.apache.org/jira/browse/ARROW-8394 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.16.0 >Reporter: Shyamal Shukla >Priority: Blocker > > Attempting to use apache-arrow within a web application, but typescript > compiler throws the following errors in some of arrow's .d.ts files > import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; > export class SomeClass { > . > . > constructor() { > const t = Table.from(''); > } > *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: > Class static side 'typeof Column' incorrectly extends base class static side > 'typeof Chunked'. Types of property 'new' are incompatible. > *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: > Subsequent property declarations must have the same type. Property 'schema' > must be of type 'Schema', but here has type 'Schema'. > 238 schema: Schema; > *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error > TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. > The types of 'slice(...).clone' are incompatible between these types. > the tsconfig.json file looks like > { > "compilerOptions": { > "target":"ES6", > "outDir": "dist", > "baseUrl": "src/" > }, > "exclude": ["dist"], > "include": ["src/*.ts"] > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
[ https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10001: --- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Add developer guide to README > - > > Key: ARROW-10001 > URL: https://issues.apache.org/jira/browse/ARROW-10001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
[ https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10001: - Assignee: Apache Arrow JIRA Bot (was: Jorge) > [Rust] [DataFusion] Add developer guide to README > - > > Key: ARROW-10001 > URL: https://issues.apache.org/jira/browse/ARROW-10001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jorge >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
[ https://issues.apache.org/jira/browse/ARROW-10001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-10001: - Assignee: Jorge (was: Apache Arrow JIRA Bot) > [Rust] [DataFusion] Add developer guide to README > - > > Key: ARROW-10001 > URL: https://issues.apache.org/jira/browse/ARROW-10001 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jorge >Assignee: Jorge >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-2651) [Python] Build & Test with PyPy
[ https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195571#comment-17195571 ] Niklas B edited comment on ARROW-2651 at 9/14/20, 4:29 PM: --- Besides GetContiguous we (and with we I mean Matti) needed to patch a few datetime related things, [https://gist.github.com/mattip/c9c8398b58721ae5893dc8134c353f28] Build that works with the patch available on [https://github.com/bivald/pyarrow-on-pypy3/tree/feature/latest-pypy-latest-pyarrow] As for the test suite I had to disable IO, misc and memory since they gave segfaults. pytest pyarrow --ignore-glob='*test_io.py' --ignore-glob='*test_misc.py' --ignore-glob='*test_memory.py' Gave: 33 failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in 104.02s == short test summary info == FAILED pyarrow2/tests/test_array.py::test_to_pandas_zero_copy - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_array_slice - SystemError: Function returned an error result without setting an exception FAILED pyarrow2/tests/test_array.py::test_array_ref_to_ndarray_base - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_array_conversions_no_sentinel_values - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_nbytes_sizeof - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_cffi.py::test_export_import_array - assert 1528 == 896 FAILED pyarrow2/tests/test_cffi.py::test_export_import_batch - assert 1048 == 128 FAILED pyarrow2/tests/test_convert_builtin.py::test_garbage_collection - assert 128 == 766912 FAILED pyarrow2/tests/test_convert_builtin.py::test_sequence_bytes - NotImplementedError: creating contiguous readonly buffer from non-contiguous not implemented yet FAILED pyarrow2/tests/test_convert_builtin.py::test_map_from_dicts - AssertionError: Regex pattern 'integer is required' does not match 'expected integer, got str object'. FAILED pyarrow2/tests/test_csv.py::test_read_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::test_parse_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::test_convert_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::TestSerialStreamingCSVRead::test_batch_lifetime - AssertionError: assert 1464704 == 1464576 FAILED pyarrow2/tests/test_cython.py::test_cython_api - subprocess.CalledProcessError: Command '['/pyarrow/bin/pypy3', 'setup.py', 'build_ext', '--inplace']' returned non-zero exit status 1. FAILED pyarrow2/tests/test_extension_type.py::test_ext_type__lifetime - AssertionError: assert UuidType(extension) is None FAILED pyarrow2/tests/test_extension_type.py::test_uuid_type_pickle - AssertionError: assert UuidType(extension) is None FAILED pyarrow2/tests/test_extension_type.py::test_ext_array_lifetime - AssertionError: assert ParamExtType(extension) is None FAILED pyarrow2/tests/test_fs.py::test_py_filesystem_lifetime - AssertionError: assert is None FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_deduplicate_integers_as_objects - assert 100 == 991 FAILED pyarrow2/tests/test_pandas.py::test_array_uses_memory_pool - assert 103552 == 465152 FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_self_destruct - assert 6112064 == 4112064 FAILED pyarrow2/tests/test_pandas.py::test_table_uses_memory_pool - assert 6249408 == 6112064 FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_numpy_array - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_dataframe - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_schema.py::test_schema_sizeof - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_coo_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csr_matrix_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csf_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_table.py::test_chunked_array_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_table.py::test_recordbatch_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_table.py::test_table_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_tensor.py::test_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount'
[jira] [Commented] (ARROW-2651) [Python] Build & Test with PyPy
[ https://issues.apache.org/jira/browse/ARROW-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195571#comment-17195571 ] Niklas B commented on ARROW-2651: - Besides GetContiguous we (and with me I mean Matti) needed to patch a few datetime related things, [https://gist.github.com/mattip/c9c8398b58721ae5893dc8134c353f28] Build that works with the patch available on [https://github.com/bivald/pyarrow-on-pypy3/tree/feature/latest-pypy-latest-pyarrow] As for the test suite I had to disable IO, misc and memory since they gave segfaults. pytest pyarrow --ignore-glob='*test_io.py' --ignore-glob='*test_misc.py' --ignore-glob='*test_memory.py' Gave: 33 failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in 104.02s == short test summary info == FAILED pyarrow2/tests/test_array.py::test_to_pandas_zero_copy - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_array_slice - SystemError: Function returned an error result without setting an exception FAILED pyarrow2/tests/test_array.py::test_array_ref_to_ndarray_base - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_array_conversions_no_sentinel_values - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_array.py::test_nbytes_sizeof - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_cffi.py::test_export_import_array - assert 1528 == 896 FAILED pyarrow2/tests/test_cffi.py::test_export_import_batch - assert 1048 == 128 FAILED pyarrow2/tests/test_convert_builtin.py::test_garbage_collection - assert 128 == 766912 FAILED pyarrow2/tests/test_convert_builtin.py::test_sequence_bytes - NotImplementedError: creating contiguous readonly buffer from non-contiguous not implemented yet FAILED pyarrow2/tests/test_convert_builtin.py::test_map_from_dicts - AssertionError: Regex pattern 'integer is required' does not match 'expected integer, got str object'. FAILED pyarrow2/tests/test_csv.py::test_read_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::test_parse_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::test_convert_options - Failed: DID NOT RAISE FAILED pyarrow2/tests/test_csv.py::TestSerialStreamingCSVRead::test_batch_lifetime - AssertionError: assert 1464704 == 1464576 FAILED pyarrow2/tests/test_cython.py::test_cython_api - subprocess.CalledProcessError: Command '['/pyarrow/bin/pypy3', 'setup.py', 'build_ext', '--inplace']' returned non-zero exit status 1. FAILED pyarrow2/tests/test_extension_type.py::test_ext_type__lifetime - AssertionError: assert UuidType(extension) is None FAILED pyarrow2/tests/test_extension_type.py::test_uuid_type_pickle - AssertionError: assert UuidType(extension) is None FAILED pyarrow2/tests/test_extension_type.py::test_ext_array_lifetime - AssertionError: assert ParamExtType(extension) is None FAILED pyarrow2/tests/test_fs.py::test_py_filesystem_lifetime - AssertionError: assert is None FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_deduplicate_integers_as_objects - assert 100 == 991 FAILED pyarrow2/tests/test_pandas.py::test_array_uses_memory_pool - assert 103552 == 465152 FAILED pyarrow2/tests/test_pandas.py::test_to_pandas_self_destruct - assert 6112064 == 4112064 FAILED pyarrow2/tests/test_pandas.py::test_table_uses_memory_pool - assert 6249408 == 6112064 FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_numpy_array - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_pandas.py::test_object_leak_in_dataframe - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_schema.py::test_schema_sizeof - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_coo_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csr_matrix_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_sparse_tensor.py::test_sparse_csf_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' FAILED pyarrow2/tests/test_table.py::test_chunked_array_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_table.py::test_recordbatch_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_table.py::test_table_basics - TypeError: getsizeof(...) FAILED pyarrow2/tests/test_tensor.py::test_tensor_base_object - AttributeError: module 'sys' has no attribute 'getrefcount' = 33 failed, 2620 passed, 532 skipped, 13 xfailed, 10 warnings in
[jira] [Created] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
Jorge created ARROW-10001: - Summary: [Rust] [DataFusion] Add developer guide to README Key: ARROW-10001 URL: https://issues.apache.org/jira/browse/ARROW-10001 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9995) [R] Snappy Codec Support not built
[ https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195563#comment-17195563 ] Joska Lako commented on ARROW-9995: --- Thanks, that worked in the end! > [R] Snappy Codec Support not built > -- > > Key: ARROW-9995 > URL: https://issues.apache.org/jira/browse/ARROW-9995 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0, 1.0.1 >Reporter: Joska Lako >Assignee: Neal Richardson >Priority: Major > Labels: Snappy > Attachments: ErrorScreenshot.PNG > > Original Estimate: 24h > Remaining Estimate: 24h > > I am reading my file on a Linux based server which has no Snappy compression. > Even though I call the function to do uncompressed compression. I still get > an error Snappy codec support not built. How do I overcome this error and > read a parquet file without snappy codec on linux? > read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') > Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: > NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4432) [Python][Hypothesis] Empty table - pandas roundtrip produces unequal tables
[ https://issues.apache.org/jira/browse/ARROW-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-4432: --- Summary: [Python][Hypothesis] Empty table - pandas roundtrip produces unequal tables (was: [Python][Hypothesis] Empty table - pandas roundtrip produces inequal tables) > [Python][Hypothesis] Empty table - pandas roundtrip produces unequal tables > --- > > Key: ARROW-4432 > URL: https://issues.apache.org/jira/browse/ARROW-4432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Krisztian Szucs >Priority: Major > Labels: hypothesis > > The following test case fails for empty tables: > {code:python} > import hypothesis as h > import pyarrow.tests.strategies as past > @h.given(past.all_tables) > def test_pandas_roundtrip(table): > df = table.to_pandas() > table_ = pa.Table.from_pandas(df) > assert table == table_ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9992) [C++][Python] Refactor python to arrow conversions based on a reusable conversion API
[ https://issues.apache.org/jira/browse/ARROW-9992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9992: -- Labels: pull-request-available (was: ) > [C++][Python] Refactor python to arrow conversions based on a reusable > conversion API > -- > > Key: ARROW-9992 > URL: https://issues.apache.org/jira/browse/ARROW-9992 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We have a lot of technical debt accumulated in the python to arrow conversion > code paths including hidden bugs. We need to simplify the implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-1: Fix Version/s: 2.0.0 > [C++][Python] Support constructing StructArray from list of key-value pairs > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Priority: Major > Fix For: 2.0.0 > > > {code:python} > item = [ > ('a', 1), > ('b', 2) > ] > ty = pa.struct([ > pa.field('a', type=pa.int8()), > pa.field('b', type=pa.float64()) > ]) > pa.array([item], type=ty) > {code} > raises > {code} > ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was > not a dict, tuple, or recognized null value for conversion to struct type > {code} > This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type > contains duplicated field names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-1: Component/s: Python > [C++][Python] Support constructing StructArray from list of key-value pairs > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Krisztian Szucs >Priority: Major > Fix For: 2.0.0 > > > {code:python} > item = [ > ('a', 1), > ('b', 2) > ] > ty = pa.struct([ > pa.field('a', type=pa.int8()), > pa.field('b', type=pa.float64()) > ]) > pa.array([item], type=ty) > {code} > raises > {code} > ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was > not a dict, tuple, or recognized null value for conversion to struct type > {code} > This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type > contains duplicated field names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs
[ https://issues.apache.org/jira/browse/ARROW-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-1: Description: {code:python} item = [ ('a', 1), ('b', 2) ] ty = pa.struct([ pa.field('a', type=pa.int8()), pa.field('b', type=pa.float64()) ]) pa.array([item], type=ty) {code} raises {code} ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was not a dict, tuple, or recognized null value for conversion to struct type {code} This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type contains duplicated field names. > [C++][Python] Support constructing StructArray from list of key-value pairs > --- > > Key: ARROW-1 > URL: https://issues.apache.org/jira/browse/ARROW-1 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Priority: Major > > {code:python} > item = [ > ('a', 1), > ('b', 2) > ] > ty = pa.struct([ > pa.field('a', type=pa.int8()), > pa.field('b', type=pa.float64()) > ]) > pa.array([item], type=ty) > {code} > raises > {code} > ArrowTypeError: Could not convert [('a', 1), ('b', 2)] with type list: was > not a dict, tuple, or recognized null value for conversion to struct type > {code} > This feature is required for {{pa.repeat(scalar, n)}} roundtrip if the type > contains duplicated field names. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-: --- Description: {code:python} pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) {code} raises {code} ArrowNotImplementedError: Sequence converter for type dictionary not implemented {code} It would be a much more comfortable way than {code:python} pa.DictionaryArray.from_arrays(indices, dictionary) {code} was: {code:python} pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) {code} raises {code} ArrowNotImplementedError: Sequence converter for type dictionary not implemented {code} > [Python] Support constructing dictionary array directly through pa.array() > -- > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Priority: Major > > {code:python} > pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) > {code} > raises > {code} > ArrowNotImplementedError: Sequence converter for type > dictionary not implemented > {code} > It would be a much more comfortable way than > {code:python} > pa.DictionaryArray.from_arrays(indices, dictionary) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10000) [C++][Python] Support constructing StructArray from list of key-value pairs
Krisztian Szucs created ARROW-1: --- Summary: [C++][Python] Support constructing StructArray from list of key-value pairs Key: ARROW-1 URL: https://issues.apache.org/jira/browse/ARROW-1 Project: Apache Arrow Issue Type: New Feature Reporter: Krisztian Szucs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9995) [R] Snappy Codec Support not built
[ https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-9995: --- Component/s: (was: C++) > [R] Snappy Codec Support not built > -- > > Key: ARROW-9995 > URL: https://issues.apache.org/jira/browse/ARROW-9995 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0, 1.0.1 >Reporter: Joska Lako >Priority: Major > Labels: Snappy > Attachments: ErrorScreenshot.PNG > > Original Estimate: 24h > Remaining Estimate: 24h > > I am reading my file on a Linux based server which has no Snappy compression. > Even though I call the function to do uncompressed compression. I still get > an error Snappy codec support not built. How do I overcome this error and > read a parquet file without snappy codec on linux? > read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') > Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: > NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-: --- Description: {code:python} pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) {code} raises {code} ArrowNotImplementedError: Sequence converter for type dictionary not implemented {code} It would be a much more comfortable way than {code:python} pa.DictionaryArray.from_arrays(indices, dictionary) {code} And possibly more efficient as well thanks to the adaptive dictionary builders. was: {code:python} pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) {code} raises {code} ArrowNotImplementedError: Sequence converter for type dictionary not implemented {code} It would be a much more comfortable way than {code:python} pa.DictionaryArray.from_arrays(indices, dictionary) {code} > [Python] Support constructing dictionary array directly through pa.array() > -- > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > {code:python} > pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) > {code} > raises > {code} > ArrowNotImplementedError: Sequence converter for type > dictionary not implemented > {code} > It would be a much more comfortable way than > {code:python} > pa.DictionaryArray.from_arrays(indices, dictionary) > {code} > And possibly more efficient as well thanks to the adaptive dictionary > builders. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-: -- Assignee: Krisztian Szucs > [Python] Support constructing dictionary array directly through pa.array() > -- > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > {code:python} > pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) > {code} > raises > {code} > ArrowNotImplementedError: Sequence converter for type > dictionary not implemented > {code} > It would be a much more comfortable way than > {code:python} > pa.DictionaryArray.from_arrays(indices, dictionary) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array directly through pa.array()
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-: --- Summary: [Python] Support constructing dictionary array directly through pa.array() (was: [Python] Support constructing dictionary array through pa.array()) > [Python] Support constructing dictionary array directly through pa.array() > -- > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Priority: Major > > {code:python} > pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) > {code} > raises > {code} > ArrowNotImplementedError: Sequence converter for type > dictionary not implemented > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9999) [Python] Support constructing dictionary array through pa.array()
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-: --- Description: {code:python} pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) {code} raises {code} ArrowNotImplementedError: Sequence converter for type dictionary not implemented {code} > [Python] Support constructing dictionary array through pa.array() > - > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: New Feature >Reporter: Krisztian Szucs >Priority: Major > > {code:python} > pa.array(["some", "string"], type=pa.dictionary(pa.int8(), pa.string))) > {code} > raises > {code} > ArrowNotImplementedError: Sequence converter for type > dictionary not implemented > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9995) [R] Snappy Codec Support not built
[ https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195541#comment-17195541 ] Neal Richardson commented on ARROW-9995: {{read_parquet()}} doesn't ask you about compression--it detects what compression is used in the file. So it sounds like you're trying to read a snappy-compressed file and thus need a build with snappy enabled. To get that, since you already have arrow you could call `arrow::install_arrow()` and it should just work, installing a more complete version. Or you could set {{LIBARROW_MINIMAL=FALSE}} and reinstall by the usual ways. See https://arrow.apache.org/docs/r/articles/install.html for more. > [R] Snappy Codec Support not built > -- > > Key: ARROW-9995 > URL: https://issues.apache.org/jira/browse/ARROW-9995 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 1.0.0, 1.0.1 >Reporter: Joska Lako >Priority: Major > Labels: Snappy > Attachments: ErrorScreenshot.PNG > > Original Estimate: 24h > Remaining Estimate: 24h > > I am reading my file on a Linux based server which has no Snappy compression. > Even though I call the function to do uncompressed compression. I still get > an error Snappy codec support not built. How do I overcome this error and > read a parquet file without snappy codec on linux? > read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') > Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: > NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9995) [R] Snappy Codec Support not built
[ https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-9995: -- Assignee: Neal Richardson > [R] Snappy Codec Support not built > -- > > Key: ARROW-9995 > URL: https://issues.apache.org/jira/browse/ARROW-9995 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0, 1.0.1 >Reporter: Joska Lako >Assignee: Neal Richardson >Priority: Major > Labels: Snappy > Attachments: ErrorScreenshot.PNG > > Original Estimate: 24h > Remaining Estimate: 24h > > I am reading my file on a Linux based server which has no Snappy compression. > Even though I call the function to do uncompressed compression. I still get > an error Snappy codec support not built. How do I overcome this error and > read a parquet file without snappy codec on linux? > read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') > Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: > NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9995) [R] Snappy Codec Support not built
[ https://issues.apache.org/jira/browse/ARROW-9995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-9995. Resolution: Information Provided > [R] Snappy Codec Support not built > -- > > Key: ARROW-9995 > URL: https://issues.apache.org/jira/browse/ARROW-9995 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 1.0.0, 1.0.1 >Reporter: Joska Lako >Assignee: Neal Richardson >Priority: Major > Labels: Snappy > Attachments: ErrorScreenshot.PNG > > Original Estimate: 24h > Remaining Estimate: 24h > > I am reading my file on a Linux based server which has no Snappy compression. > Even though I call the function to do uncompressed compression. I still get > an error Snappy codec support not built. How do I overcome this error and > read a parquet file without snappy codec on linux? > read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') > Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: > NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9997) [Python] StructScalar.as_py() fails if the type has duplicate field names
Krisztian Szucs created ARROW-9997: -- Summary: [Python] StructScalar.as_py() fails if the type has duplicate field names Key: ARROW-9997 URL: https://issues.apache.org/jira/browse/ARROW-9997 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 {{StructScalar}} currently extends an abstract Mapping interface. Since the type allows duplicate field names we cannot provide that API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9999) [Python] Support constructing dictionary array through pa.array()
Krisztian Szucs created ARROW-: -- Summary: [Python] Support constructing dictionary array through pa.array() Key: ARROW- URL: https://issues.apache.org/jira/browse/ARROW- Project: Apache Arrow Issue Type: New Feature Reporter: Krisztian Szucs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9998) [Python] Support pickling DictionaryScalar
Krisztian Szucs created ARROW-9998: -- Summary: [Python] Support pickling DictionaryScalar Key: ARROW-9998 URL: https://issues.apache.org/jira/browse/ARROW-9998 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 Since the {{pa.array}} factory function doesn't support the creation of dictionary array pickling [has not been implemented|https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_scalars.py#L554] for dictionary scalars yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9995) [R] Snappy Codec Support not built
Joska Lako created ARROW-9995: - Summary: [R] Snappy Codec Support not built Key: ARROW-9995 URL: https://issues.apache.org/jira/browse/ARROW-9995 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 1.0.1, 1.0.0 Reporter: Joska Lako Attachments: ErrorScreenshot.PNG I am reading my file on a Linux based server which has no Snappy compression. Even though I call the function to do uncompressed compression. I still get an error Snappy codec support not built. How do I overcome this error and read a parquet file without snappy codec on linux? read_parquet(file,as_data_frame=TRUE,compression='UNCOMPRESSED') Error in parquet___arrow___FileReader__ReadTable1(self) : IOError: NotImplemented: Snappy codec support not built -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9994) [C++][Python] Auto chunking nested array containing binary-like fields result malformed output
[ https://issues.apache.org/jira/browse/ARROW-9994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9994: --- Issue Type: Bug (was: Improvement) > [C++][Python] Auto chunking nested array containing binary-like fields result > malformed output > -- > > Key: ARROW-9994 > URL: https://issues.apache.org/jira/browse/ARROW-9994 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.0 >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > In case of nested types the binary-like arrays are chunked but not the > others, so after finalizing the builder the nested output array contains > different length children. > {code:python} >char = b'x' >ty = pa.binary() > v1 = char * 1 > v2 = char * 147483646 > struct_type = pa.struct([ > pa.field('bool', pa.bool_()), > pa.field('integer', pa.int64()), > pa.field('string-like', ty), > ]) > data = [{'bool': True, 'integer': 1, 'string-like': v1}] * 20 > data.append({'bool': True, 'integer': 1, 'string-like': v2}) > arr = pa.array(data, type=struct_type) > assert isinstance(arr, pa.Array) > data.append({'bool': True, 'integer': 1, 'string-like': char}) > arr = pa.array(data, type=struct_type) > assert isinstance(arr, pa.ChunkedArray) > {code} > {code:python} > len(arr.field(0)) == 22 > len(arr.field(1)) == 22 > len(arr.field(2)) == 1 # the string array gets chunked whereas the rest of > the fields do not > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9994) [C++][Python] Auto chunking nested array containing binary-like fields result malformed output
Krisztian Szucs created ARROW-9994: -- Summary: [C++][Python] Auto chunking nested array containing binary-like fields result malformed output Key: ARROW-9994 URL: https://issues.apache.org/jira/browse/ARROW-9994 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 1.0.0 Reporter: Krisztian Szucs Assignee: Krisztian Szucs In case of nested types the binary-like arrays are chunked but not the others, so after finalizing the builder the nested output array contains different length children. {code:python} char = b'x' ty = pa.binary() v1 = char * 1 v2 = char * 147483646 struct_type = pa.struct([ pa.field('bool', pa.bool_()), pa.field('integer', pa.int64()), pa.field('string-like', ty), ]) data = [{'bool': True, 'integer': 1, 'string-like': v1}] * 20 data.append({'bool': True, 'integer': 1, 'string-like': v2}) arr = pa.array(data, type=struct_type) assert isinstance(arr, pa.Array) data.append({'bool': True, 'integer': 1, 'string-like': char}) arr = pa.array(data, type=struct_type) assert isinstance(arr, pa.ChunkedArray) {code} {code:python} len(arr.field(0)) == 22 len(arr.field(1)) == 22 len(arr.field(2)) == 1 # the string array gets chunked whereas the rest of the fields do not {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9580) Docs have superfluous ()
[ https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9580: Assignee: Apache Arrow JIRA Bot (was: Dominik Moritz) > Docs have superfluous () > > > Key: ARROW-9580 > URL: https://issues.apache.org/jira/browse/ARROW-9580 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Dominik Moritz >Assignee: Apache Arrow JIRA Bot >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9580) Docs have superfluous ()
[ https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9580: -- Labels: pull-request-available (was: ) > Docs have superfluous () > > > Key: ARROW-9580 > URL: https://issues.apache.org/jira/browse/ARROW-9580 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Dominik Moritz >Assignee: Dominik Moritz >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9580) Docs have superfluous ()
[ https://issues.apache.org/jira/browse/ARROW-9580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9580: Assignee: Dominik Moritz (was: Apache Arrow JIRA Bot) > Docs have superfluous () > > > Key: ARROW-9580 > URL: https://issues.apache.org/jira/browse/ARROW-9580 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Dominik Moritz >Assignee: Dominik Moritz >Priority: Trivial > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9993) [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
[ https://issues.apache.org/jira/browse/ARROW-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-9993: --- Issue Type: Bug (was: Improvement) > [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects > - > > Key: ARROW-9993 > URL: https://issues.apache.org/jira/browse/ARROW-9993 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > > Timezone roundtrip fails with {{pytz.StaticTzInfo}} objects on master: > {code:python} > tz = pytz.timezone('Etc/GMT+1') > pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz)) > {code} > {code} > --- > UnknownTimeZoneError Traceback (most recent call last) > in > > 1 pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz)) > ~/Workspace/arrow/python/pyarrow/types.pxi in pyarrow.lib.string_to_tzinfo() >1838 Time zone object >1839 """ > -> 1840 cdef PyObject* tz = > GetResultValue(StringToTzinfo(name.encode('utf-8'))) >1841 return PyObject_to_object(tz) >1842 > ~/Workspace/arrow/python/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > 120 cdef api int pyarrow_internal_check_status(const CStatus& status) \ > 121 nogil except -1: > --> 122 return check_status(status) > ~/.conda/envs/arrow38/lib/python3.8/site-packages/pytz/__init__.py in > timezone(zone) > 179 fp.close() > 180 else: > --> 181 raise UnknownTimeZoneError(zone) > 182 > 183 return _tzinfo_cache[zone] > UnknownTimeZoneError: '-01' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9993) [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects
Krisztian Szucs created ARROW-9993: -- Summary: [Python] Tzinfo - string roundtrip fails on pytz.StaticTzInfo objects Key: ARROW-9993 URL: https://issues.apache.org/jira/browse/ARROW-9993 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Timezone roundtrip fails with {{pytz.StaticTzInfo}} objects on master: {code:python} tz = pytz.timezone('Etc/GMT+1') pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz)) {code} {code} --- UnknownTimeZoneError Traceback (most recent call last) in > 1 pa.lib.string_to_tzinfo(pa.lib.tzinfo_to_string(tz)) ~/Workspace/arrow/python/pyarrow/types.pxi in pyarrow.lib.string_to_tzinfo() 1838 Time zone object 1839 """ -> 1840 cdef PyObject* tz = GetResultValue(StringToTzinfo(name.encode('utf-8'))) 1841 return PyObject_to_object(tz) 1842 ~/Workspace/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() 120 cdef api int pyarrow_internal_check_status(const CStatus& status) \ 121 nogil except -1: --> 122 return check_status(status) ~/.conda/envs/arrow38/lib/python3.8/site-packages/pytz/__init__.py in timezone(zone) 179 fp.close() 180 else: --> 181 raise UnknownTimeZoneError(zone) 182 183 return _tzinfo_cache[zone] UnknownTimeZoneError: '-01' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9616) [C++] Support LTO for R
[ https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195515#comment-17195515 ] Antoine Pitrou commented on ARROW-9616: --- Deciding to LTO everything sounds more ideological than pragmatic. LTO can be useful in some select cases, but I fail to understand why it would be mandatory. Also it will increase build times again. > [C++] Support LTO for R > --- > > Key: ARROW-9616 > URL: https://issues.apache.org/jira/browse/ARROW-9616 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 1.0.0 >Reporter: Jeroen >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The next version of R might enable LTO on Windows, i.e. R packages will be > compiled with {{-flto}} by default. This works out of the box for most > packages, but for arrow, the linker crashes as below. > {code} > C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 > -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o > array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o > chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o > expression.o feather.o field.o filesystem.o imports.o io.o json.o > memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o > recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o > -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset > -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto > -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR > lto1.exe: internal compiler error: in add_symbol_to_partition_1, at > lto/lto-partition.c:153 > libbacktrace could not find executable to open > Please submit a full bug report, > with preprocessed source if appropriate. > See <[https://github.com/r-windows]> for instructions. > lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 > exit status > compilation terminated. > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > error: lto-wrapper failed > {code} > You can reproduce this in R on Windows for example like so: > {code:r} > dir.create("~/.R") > writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars") > install.packages("arrow", type = 'source') > {code} > I am not sure if this is a bug in the toolchain, or in arrow. I tried with > both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this > issue|https://github.com/cycfi/elements/pull/56] in another project which > suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto > code with non-lto code (which is the case when we only build the r bindings > with lto, but not the c++ library). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9616) [C++] Support LTO for R
[ https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195513#comment-17195513 ] Antoine Pitrou commented on ARROW-9616: --- An internal compiler error is certainly not a bug in Arrow, but we have to workaround the issue at some point, no? > [C++] Support LTO for R > --- > > Key: ARROW-9616 > URL: https://issues.apache.org/jira/browse/ARROW-9616 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 1.0.0 >Reporter: Jeroen >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The next version of R might enable LTO on Windows, i.e. R packages will be > compiled with {{-flto}} by default. This works out of the box for most > packages, but for arrow, the linker crashes as below. > {code} > C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 > -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o > array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o > chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o > expression.o feather.o field.o filesystem.o imports.o io.o json.o > memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o > recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o > -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset > -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto > -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR > lto1.exe: internal compiler error: in add_symbol_to_partition_1, at > lto/lto-partition.c:153 > libbacktrace could not find executable to open > Please submit a full bug report, > with preprocessed source if appropriate. > See <[https://github.com/r-windows]> for instructions. > lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 > exit status > compilation terminated. > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > error: lto-wrapper failed > {code} > You can reproduce this in R on Windows for example like so: > {code:r} > dir.create("~/.R") > writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars") > install.packages("arrow", type = 'source') > {code} > I am not sure if this is a bug in the toolchain, or in arrow. I tried with > both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this > issue|https://github.com/cycfi/elements/pull/56] in another project which > suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto > code with non-lto code (which is the case when we only build the r bindings > with lto, but not the c++ library). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9616) [C++] Support LTO for R
[ https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195512#comment-17195512 ] Neal Richardson commented on ARROW-9616: If CRAN decides that it LTO's everything, then we wouldn't be able to turn that off. FWIW CRAN already has a LTO builder in its test setup (debian, I believe) and arrow is not failing that. So this is something in the Windows setup, and possibly not a problem in arrow at all. > [C++] Support LTO for R > --- > > Key: ARROW-9616 > URL: https://issues.apache.org/jira/browse/ARROW-9616 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 1.0.0 >Reporter: Jeroen >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The next version of R might enable LTO on Windows, i.e. R packages will be > compiled with {{-flto}} by default. This works out of the box for most > packages, but for arrow, the linker crashes as below. > {code} > C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 > -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o > array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o > chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o > expression.o feather.o field.o filesystem.o imports.o io.o json.o > memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o > recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o > -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset > -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto > -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR > lto1.exe: internal compiler error: in add_symbol_to_partition_1, at > lto/lto-partition.c:153 > libbacktrace could not find executable to open > Please submit a full bug report, > with preprocessed source if appropriate. > See <[https://github.com/r-windows]> for instructions. > lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 > exit status > compilation terminated. > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > error: lto-wrapper failed > {code} > You can reproduce this in R on Windows for example like so: > {code:r} > dir.create("~/.R") > writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars") > install.packages("arrow", type = 'source') > {code} > I am not sure if this is a bug in the toolchain, or in arrow. I tried with > both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this > issue|https://github.com/cycfi/elements/pull/56] in another project which > suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto > code with non-lto code (which is the case when we only build the r bindings > with lto, but not the c++ library). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9991) [C++] split kernels for strings/binary
[ https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195508#comment-17195508 ] Joris Van den Bossche commented on ARROW-9991: -- And I suppose "whitespace" here is more than a split on " " ? (also multiple spaces, different kinds of newlines, tabs, etc?) In that case, a separate specialized kernel seems indeed best. > [C++] split kernels for strings/binary > -- > > Key: ARROW-9991 > URL: https://issues.apache.org/jira/browse/ARROW-9991 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Maarten Breddels >Assignee: Maarten Breddels >Priority: Major > > Similar to Python str.split and bytes.split, we'd like to have a way to > convert str into list[str] (and similarly for bytes). > When the separator is given, the algorithms for both types are the same. > Python, however, overloads strip. When given no separator, the algorithm will > split considering all whitespace (unicode for str, ascii for bytes) as > separator. > I'd rather see not too much overloaded kernels, e.g. > binary_split (takes string/binary separator, and maxsplit arg, no special > utf8 version needed) > utf8_split_whitespace (similar to Python's version given no separator) > ascii_split_whitespace (similar to Python's version given no separator, but > considering ascii, although this could work on any binary data) > there can also be rsplit versions of these, or they could be an argument. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1385) [C++] Add Buffer implementation and helper functions for POSIX shared memory
[ https://issues.apache.org/jira/browse/ARROW-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195491#comment-17195491 ] Antoine Pitrou commented on ARROW-1385: --- Is there a target use case we're thinking about? Otherwise it's not obvious this deserves keeping an issue open. (especially, one annoyance with shared memory is garbage collecting unused shared memory segments: Windows is able to do this automatically, Unix unfortunately is not) > [C++] Add Buffer implementation and helper functions for POSIX shared memory > > > Key: ARROW-1385 > URL: https://issues.apache.org/jira/browse/ARROW-1385 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 2.0.0 > > > This should also include affordances for detaching and removing shm segments -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails
[ https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9859: Assignee: Apache Arrow JIRA Bot (was: Antoine Pitrou) > [C++] S3 FileSystemFromUri with special char in secret key fails > > > Key: ARROW-9859 > URL: https://issues.apache.org/jira/browse/ARROW-9859 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Neal Richardson >Assignee: Apache Arrow JIRA Bot >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > S3 Secret access keys can contain special characters like {{/}}. When they do > 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them > (e.g. replace / with %2F) > 2) When you do escape the special characters, requests that require > authorization fail with the message "The request signature we calculated does > not match the signature you provided. Check your key and signing method." > This may suggest that there's some extra URL encoding/decoding that needs to > happen inside. > I was only able to work around this by generating a new access key that > happened not to have special characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails
[ https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Arrow JIRA Bot reassigned ARROW-9859: Assignee: Antoine Pitrou (was: Apache Arrow JIRA Bot) > [C++] S3 FileSystemFromUri with special char in secret key fails > > > Key: ARROW-9859 > URL: https://issues.apache.org/jira/browse/ARROW-9859 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > S3 Secret access keys can contain special characters like {{/}}. When they do > 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them > (e.g. replace / with %2F) > 2) When you do escape the special characters, requests that require > authorization fail with the message "The request signature we calculated does > not match the signature you provided. Check your key and signing method." > This may suggest that there's some extra URL encoding/decoding that needs to > happen inside. > I was only able to work around this by generating a new access key that > happened not to have special characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails
[ https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9859: -- Labels: pull-request-available (was: ) > [C++] S3 FileSystemFromUri with special char in secret key fails > > > Key: ARROW-9859 > URL: https://issues.apache.org/jira/browse/ARROW-9859 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > S3 Secret access keys can contain special characters like {{/}}. When they do > 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them > (e.g. replace / with %2F) > 2) When you do escape the special characters, requests that require > authorization fail with the message "The request signature we calculated does > not match the signature you provided. Check your key and signing method." > This may suggest that there's some extra URL encoding/decoding that needs to > happen inside. > I was only able to work around this by generating a new access key that > happened not to have special characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9964) [C++] CSV date support
[ https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195469#comment-17195469 ] Antoine Pitrou commented on ARROW-9964: --- Thanks for the report. Indeed, for now, you cannot directly those values as a date type. However, you can read them as timestamp64. I agree it would be good to allow specifying a date column in {{column_types}}. > [C++] CSV date support > -- > > Key: ARROW-9964 > URL: https://issues.apache.org/jira/browse/ARROW-9964 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.1 >Reporter: Maciej >Priority: Major > > There is no support for reading date type from CSV file. I'd like to read > such a value: > {code:java} > 1991-02-03 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9859) [C++] S3 FileSystemFromUri with special char in secret key fails
[ https://issues.apache.org/jira/browse/ARROW-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195470#comment-17195470 ] Antoine Pitrou commented on ARROW-9859: --- Nevermind, I have such a test bucket myself :-) > [C++] S3 FileSystemFromUri with special char in secret key fails > > > Key: ARROW-9859 > URL: https://issues.apache.org/jira/browse/ARROW-9859 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Python >Reporter: Neal Richardson >Assignee: Antoine Pitrou >Priority: Major > Fix For: 2.0.0 > > > S3 Secret access keys can contain special characters like {{/}}. When they do > 1) FileSystemFromUri will fail to parse the URI unless you URL-encode them > (e.g. replace / with %2F) > 2) When you do escape the special characters, requests that require > authorization fail with the message "The request signature we calculated does > not match the signature you provided. Check your key and signing method." > This may suggest that there's some extra URL encoding/decoding that needs to > happen inside. > I was only able to work around this by generating a new access key that > happened not to have special characters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9964) [C++] CSV date support
[ https://issues.apache.org/jira/browse/ARROW-9964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-9964: -- Fix Version/s: 2.0.0 > [C++] CSV date support > -- > > Key: ARROW-9964 > URL: https://issues.apache.org/jira/browse/ARROW-9964 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.1 >Reporter: Maciej >Priority: Major > Fix For: 2.0.0 > > > There is no support for reading date type from CSV file. I'd like to read > such a value: > {code:java} > 1991-02-03 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9616) [C++] Support LTO for R
[ https://issues.apache.org/jira/browse/ARROW-9616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195464#comment-17195464 ] Antoine Pitrou commented on ARROW-9616: --- Can't we simply disable LTO? I doubt LTO would bring much to Arrow (and if it does, then I'd say it's a bug: we should structure our source code so that LTO is generally not useful). > [C++] Support LTO for R > --- > > Key: ARROW-9616 > URL: https://issues.apache.org/jira/browse/ARROW-9616 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 1.0.0 >Reporter: Jeroen >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The next version of R might enable LTO on Windows, i.e. R packages will be > compiled with {{-flto}} by default. This works out of the box for most > packages, but for arrow, the linker crashes as below. > {code} > C:/rtools40/mingw64/bin/g++ -shared -O2 -Wall -mfpmath=sse -msse2 > -mstackrealign -flto -s -static-libgcc -o arrow.dll tmp.def array.o > array_from_vector.o array_to_vector.o arraydata.o arrowExports.o buffer.o > chunkedarray.o compression.o compute.o csv.o dataset.o datatype.o > expression.o feather.o field.o filesystem.o imports.o io.o json.o > memorypool.o message.o parquet.o py-to-r.o recordbatch.o recordbatchreader.o > recordbatchwriter.o scalar.o schema.o symbols.o table.o threadpool.o > -L../windows//lib-8.3.0/x64 -L../windows//lib/x64 -lparquet -larrow_dataset > -larrow -lthrift -lsnappy -lz -lzstd -llz4 -lbcrypt -lpsapi -lcrypto > -lcrypt32 -lws2_32 -LC:/PROGRA~1/R/R-devel/bin/x64 -lR > lto1.exe: internal compiler error: in add_symbol_to_partition_1, at > lto/lto-partition.c:153 > libbacktrace could not find executable to open > Please submit a full bug report, > with preprocessed source if appropriate. > See <[https://github.com/r-windows]> for instructions. > lto-wrapper.exe: fatal error: C:\rtools40\mingw64\bin\g++.exe returned 1 > exit status > compilation terminated. > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/9.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > error: lto-wrapper failed > {code} > You can reproduce this in R on Windows for example like so: > {code:r} > dir.create("~/.R") > writeLines("CPPFLAGS=-flto", file = "~/.R/Makevars") > install.packages("arrow", type = 'source') > {code} > I am not sure if this is a bug in the toolchain, or in arrow. I tried with > both gcc-8.3.0 and gcc-9.3.0, and the result is the same. I did find [this > issue|https://github.com/cycfi/elements/pull/56] in another project which > suggests to enable `INTERPROCEDURAL_OPTIMIZATION` in cmake, when mixing lto > code with non-lto code (which is the case when we only build the r bindings > with lto, but not the c++ library). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5123) [Rust] derive RecordWriter from struct definitions
[ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195410#comment-17195410 ] Neville Dipale commented on ARROW-5123: --- I'm unable to assign to Xavier > [Rust] derive RecordWriter from struct definitions > -- > > Key: ARROW-5123 > URL: https://issues.apache.org/jira/browse/ARROW-5123 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Xavier Lange >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 14h 20m > Remaining Estimate: 0h > > Migrated from previous github issue (which saw a lot of comments but at a > rough transition time in the project): > https://github.com/sunchao/parquet-rs/pull/197 > > Goal > === > Writing many columns to a file is a chore. If you can put your values in to a > struct which mirrors the schema of your file, this > `derive(ParquetRecordWriter)` will write out all the fields, in the order in > which they are defined, to a row_group. > How to Use > === > ``` > extern crate parquet; > #[macro_use] extern crate parquet_derive; > #[derive(ParquetRecordWriter)] > struct ACompleteRecord<'a> { > pub a_bool: bool, > pub a_str: &'a str, > } > ``` > RecordWriter trait > === > This is the new trait which `parquet_derive` will implement for your structs. > ``` > use super::RowGroupWriter; > pub trait RecordWriter { > fn write_to_row_group(, row_group_writer: Box); > } > ``` > How does it work? > === > The `parquet_derive` crate adds code generating functionality to the rust > compiler. The code generation takes rust syntax and emits additional syntax. > This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, > loaded by the machinery in cargo. Users don't have to do any special > `build.rs` steps or anything like that, it's automatic by including > `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a > section saying as much: > ``` > [lib] > proc-macro = true > ``` > The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to > the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The > `syn` crate parses the struct from a string-representation to a AST (a > recursive enum value). The AST contains all the values I care about when > generating a `RecordWriter` impl: > - the name of the struct > - the lifetime variables of the struct > - the fields of the struct > The fields of the struct are translated from AST to a flat `FieldInfo` > struct. It has the bits I care about for writing a column: `field_name`, > `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. > The code then does the equivalent of templating to build the `RecordWriter` > implementation. The templating functionality is provided by the `quote` > crate. At a high-level the template for `RecordWriter` looks like: > ``` > impl RecordWriter for $struct_name { > fn write_row_group(..) { > $({ > $column_writer_snippet > }) > } > } > ``` > this template is then added under the struct definition, ending up something > like: > ``` > struct MyStruct { > } > impl RecordWriter for MyStruct { > fn write_row_group(..) { > { > write_col_1(); > }; > { > write_col_2(); > } > } > } > ``` > and finally _THIS_ is the code passed to rustc. It's just code now, fully > expanded and standalone. If a user ever changes their `struct MyValue` > definition the `ParquetRecordWriter` will be regenerated. There's no > intermediate values to version control or worry about. > Viewing the Derived Code > === > To see the generated code before it's compiled, one very useful bit is to > install `cargo expand` [more info on > gh](https://github.com/dtolnay/cargo-expand), then you can do: > ``` > $WORK_DIR/parquet-rs/parquet_derive_test > cargo expand --lib > ../temp.rs > ``` > then you can dump the contents: > ``` > struct DumbRecord { > pub a_bool: bool, > pub a2_bool: bool, > } > impl RecordWriter for &[DumbRecord] { > fn write_to_row_group( > , > row_group_writer: Box, > ) { > let mut row_group_writer = row_group_writer; > { > let vals: Vec = self.iter().map(|x| x.a_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch([..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > }; > { > let vals: Vec = self.iter().map(|x| x.a2_bool).collect(); > let mut
[jira] [Resolved] (ARROW-5123) [Rust] derive RecordWriter from struct definitions
[ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-5123. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 4140 [https://github.com/apache/arrow/pull/4140] > [Rust] derive RecordWriter from struct definitions > -- > > Key: ARROW-5123 > URL: https://issues.apache.org/jira/browse/ARROW-5123 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Xavier Lange >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 14h 10m > Remaining Estimate: 0h > > Migrated from previous github issue (which saw a lot of comments but at a > rough transition time in the project): > https://github.com/sunchao/parquet-rs/pull/197 > > Goal > === > Writing many columns to a file is a chore. If you can put your values in to a > struct which mirrors the schema of your file, this > `derive(ParquetRecordWriter)` will write out all the fields, in the order in > which they are defined, to a row_group. > How to Use > === > ``` > extern crate parquet; > #[macro_use] extern crate parquet_derive; > #[derive(ParquetRecordWriter)] > struct ACompleteRecord<'a> { > pub a_bool: bool, > pub a_str: &'a str, > } > ``` > RecordWriter trait > === > This is the new trait which `parquet_derive` will implement for your structs. > ``` > use super::RowGroupWriter; > pub trait RecordWriter { > fn write_to_row_group(, row_group_writer: Box); > } > ``` > How does it work? > === > The `parquet_derive` crate adds code generating functionality to the rust > compiler. The code generation takes rust syntax and emits additional syntax. > This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, > loaded by the machinery in cargo. Users don't have to do any special > `build.rs` steps or anything like that, it's automatic by including > `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a > section saying as much: > ``` > [lib] > proc-macro = true > ``` > The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to > the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The > `syn` crate parses the struct from a string-representation to a AST (a > recursive enum value). The AST contains all the values I care about when > generating a `RecordWriter` impl: > - the name of the struct > - the lifetime variables of the struct > - the fields of the struct > The fields of the struct are translated from AST to a flat `FieldInfo` > struct. It has the bits I care about for writing a column: `field_name`, > `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. > The code then does the equivalent of templating to build the `RecordWriter` > implementation. The templating functionality is provided by the `quote` > crate. At a high-level the template for `RecordWriter` looks like: > ``` > impl RecordWriter for $struct_name { > fn write_row_group(..) { > $({ > $column_writer_snippet > }) > } > } > ``` > this template is then added under the struct definition, ending up something > like: > ``` > struct MyStruct { > } > impl RecordWriter for MyStruct { > fn write_row_group(..) { > { > write_col_1(); > }; > { > write_col_2(); > } > } > } > ``` > and finally _THIS_ is the code passed to rustc. It's just code now, fully > expanded and standalone. If a user ever changes their `struct MyValue` > definition the `ParquetRecordWriter` will be regenerated. There's no > intermediate values to version control or worry about. > Viewing the Derived Code > === > To see the generated code before it's compiled, one very useful bit is to > install `cargo expand` [more info on > gh](https://github.com/dtolnay/cargo-expand), then you can do: > ``` > $WORK_DIR/parquet-rs/parquet_derive_test > cargo expand --lib > ../temp.rs > ``` > then you can dump the contents: > ``` > struct DumbRecord { > pub a_bool: bool, > pub a2_bool: bool, > } > impl RecordWriter for &[DumbRecord] { > fn write_to_row_group( > , > row_group_writer: Box, > ) { > let mut row_group_writer = row_group_writer; > { > let vals: Vec = self.iter().map(|x| x.a_bool).collect(); > let mut column_writer = > row_group_writer.next_column().unwrap().unwrap(); > if let > parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = > column_writer > { > typed.write_batch([..], None, None).unwrap(); > } > row_group_writer.close_column(column_writer).unwrap(); > }; > { > let vals: Vec =
[jira] [Assigned] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
[ https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-9976: -- Assignee: Krisztian Szucs > [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe > - > > Key: ARROW-9976 > URL: https://issues.apache.org/jira/browse/ARROW-9976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: quentin lhoest >Assignee: Krisztian Szucs >Priority: Minor > > When calling Table.from_pandas() with a large dataset with a column of > vectors (np.array), there is an `ArrowCapacityError` > To reproduce: > {code:python} > import pandas as pd > import numpy as np > import pyarrow as pa > n = 1713614 > df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)}) > pa.Table.from_pandas(df) > {code} > With a smaller n it works. > Error raised: > {noformat} > --- > ArrowCapacityErrorTraceback (most recent call last) > in > > 1 _ = pa.Table.from_pandas(df) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 591 for i, maybe_fut in enumerate(arrays): > 592 if isinstance(maybe_fut, futures.Future): > --> 593 arrays[i] = maybe_fut.result() > 594 > 595 types = [x.type for x in arrays] > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in result(self, timeout) > 423 raise CancelledError() > 424 elif self._state == FINISHED: > --> 425 return self.__get_result() > 426 > 427 self._condition.wait(timeout) > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in __get_result(self) > 382 def __get_result(self): > 383 if self._exception: > --> 384 raise self._exception > 385 else: > 386 return self._result > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py > in run(self) > 55 > 56 try: > ---> 57 result = self.fn(*self.args, **self.kwargs) > 58 except BaseException as exc: > 59 self.future.set_exception(exc) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 557 > 558 try: > --> 559 result = pa.array(col, type=type_, from_pandas=True, > safe=safe) > 560 except (pa.ArrowInvalid, > 561 pa.ArrowNotImplementedError, > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib.array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib._ndarray_to_array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowCapacityError: List array cannot contain more than 2147483646 child > elements, have 2147483648 > {noformat} > I guess one needs to chunk the data before creating the arrays ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9984) [Rust] [DataFusion] DRY of function to string
[ https://issues.apache.org/jira/browse/ARROW-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-9984. --- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8176 [https://github.com/apache/arrow/pull/8176] > [Rust] [DataFusion] DRY of function to string > - > > Key: ARROW-9984 > URL: https://issues.apache.org/jira/browse/ARROW-9984 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge >Assignee: Jorge >Priority: Minor > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9992) [C++][Python] Refactor python to arrow conversions based on a reusable conversion API
Krisztian Szucs created ARROW-9992: -- Summary: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API Key: ARROW-9992 URL: https://issues.apache.org/jira/browse/ARROW-9992 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 We have a lot of technical debt accumulated in the python to arrow conversion code paths including hidden bugs. We need to simplify the implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9991) [C++] split kernels for strings/binary
[ https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maarten Breddels updated ARROW-9991: Summary: [C++] split kernels for strings/binary (was: [C++] split kernsl for strings/binary) > [C++] split kernels for strings/binary > -- > > Key: ARROW-9991 > URL: https://issues.apache.org/jira/browse/ARROW-9991 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Maarten Breddels >Assignee: Maarten Breddels >Priority: Major > > Similar to Python str.split and bytes.split, we'd like to have a way to > convert str into list[str] (and similarly for bytes). > When the separator is given, the algorithms for both types are the same. > Python, however, overloads strip. When given no separator, the algorithm will > split considering all whitespace (unicode for str, ascii for bytes) as > separator. > I'd rather see not too much overloaded kernels, e.g. > binary_split (takes string/binary separator, and maxsplit arg, no special > utf8 version needed) > utf8_split_whitespace (similar to Python's version given no separator) > ascii_split_whitespace (similar to Python's version given no separator, but > considering ascii, although this could work on any binary data) > there can also be rsplit versions of these, or they could be an argument. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9991) [C++] split kernsl for strings/binary
[ https://issues.apache.org/jira/browse/ARROW-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maarten Breddels updated ARROW-9991: Description: Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes). When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator. I'd rather see not too much overloaded kernels, e.g. binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed) utf8_split_whitespace (similar to Python's version given no separator) ascii_split_whitespace (similar to Python's version given no separator, but considering ascii, although this could work on any binary data) there can also be rsplit versions of these, or they could be an argument. was: Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes). When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator. I'd rather see not too much overloaded kernels, e.g. # binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed) utf8_split_whitespace (similar to Python's version given no separator) asi > [C++] split kernsl for strings/binary > - > > Key: ARROW-9991 > URL: https://issues.apache.org/jira/browse/ARROW-9991 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Maarten Breddels >Assignee: Maarten Breddels >Priority: Major > > Similar to Python str.split and bytes.split, we'd like to have a way to > convert str into list[str] (and similarly for bytes). > When the separator is given, the algorithms for both types are the same. > Python, however, overloads strip. When given no separator, the algorithm will > split considering all whitespace (unicode for str, ascii for bytes) as > separator. > I'd rather see not too much overloaded kernels, e.g. > binary_split (takes string/binary separator, and maxsplit arg, no special > utf8 version needed) > utf8_split_whitespace (similar to Python's version given no separator) > ascii_split_whitespace (similar to Python's version given no separator, but > considering ascii, although this could work on any binary data) > there can also be rsplit versions of these, or they could be an argument. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9991) [C++] split kernsl for strings/binary
Maarten Breddels created ARROW-9991: --- Summary: [C++] split kernsl for strings/binary Key: ARROW-9991 URL: https://issues.apache.org/jira/browse/ARROW-9991 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes). When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator. I'd rather see not too much overloaded kernels, e.g. # binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed) utf8_split_whitespace (similar to Python's version given no separator) asi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9976) [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe
[ https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17195249#comment-17195249 ] Joris Van den Bossche commented on ARROW-9976: -- [~lhoestq] Thanks for the report. Yes, for now you will need to chunk yourself before converting to pyarrow, but this might be something that pyarrow should do for you. cc [~kszucs] might be a relevant case for your python conversion refactor? > [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe > - > > Key: ARROW-9976 > URL: https://issues.apache.org/jira/browse/ARROW-9976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 >Reporter: quentin lhoest >Priority: Minor > > When calling Table.from_pandas() with a large dataset with a column of > vectors (np.array), there is an `ArrowCapacityError` > To reproduce: > {code:python} > import pandas as pd > import numpy as np > import pyarrow as pa > n = 1713614 > df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)}) > pa.Table.from_pandas(df) > {code} > With a smaller n it works. > Error raised: > {noformat} > --- > ArrowCapacityErrorTraceback (most recent call last) > in > > 1 _ = pa.Table.from_pandas(df) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 591 for i, maybe_fut in enumerate(arrays): > 592 if isinstance(maybe_fut, futures.Future): > --> 593 arrays[i] = maybe_fut.result() > 594 > 595 types = [x.type for x in arrays] > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in result(self, timeout) > 423 raise CancelledError() > 424 elif self._state == FINISHED: > --> 425 return self.__get_result() > 426 > 427 self._condition.wait(timeout) > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in __get_result(self) > 382 def __get_result(self): > 383 if self._exception: > --> 384 raise self._exception > 385 else: > 386 return self._result > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py > in run(self) > 55 > 56 try: > ---> 57 result = self.fn(*self.args, **self.kwargs) > 58 except BaseException as exc: > 59 self.future.set_exception(exc) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 557 > 558 try: > --> 559 result = pa.array(col, type=type_, from_pandas=True, > safe=safe) > 560 except (pa.ArrowInvalid, > 561 pa.ArrowNotImplementedError, > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib.array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib._ndarray_to_array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowCapacityError: List array cannot contain more than 2147483646 child > elements, have 2147483648 > {noformat} > I guess one needs to chunk the data before creating the arrays ? -- This message was sent by Atlassian Jira (v8.3.4#803005)